← Back to Portfolio
📝

Synthetic OCR Image Generator

2025.12 - 2026.02

📋 Project Overview

A markdown-first synthetic OCR dataset generator and benchmarking toolkit covering the full pipeline: LLM-backed corpus generation → headless Playwright rendering with realistic noise → HuggingFace publishing → reproducible OCR/VLM model evaluation → ranked leaderboard. Supports Korean and Japanese.

🎯 Problem Definition & Goals

⚙️ Key Features & Contributions

🔧 Technical Challenges & Solutions

📈 Results & Learnings

🛠️ Technologies

Python Playwright OpenAI API Hugging Face Hub Transformers XeLaTeX uv

🔗 Links