Synthetic OCR Image Generator

📋 Project Overview

A markdown-first synthetic OCR dataset generator and benchmarking toolkit covering the full pipeline: LLM-backed corpus generation → headless Playwright rendering with realistic noise → HuggingFace publishing → reproducible OCR/VLM model evaluation → ranked leaderboard. Supports Korean and Japanese.

🎯 Problem Definition & Goals

Problem: High-quality labeled OCR data for Korean and Japanese is scarce and expensive to collect, and no reproducible benchmark exists to compare models fairly across text, table, and formula categories.
Goal 1: Build a realistic synthetic OCR pipeline for Korean and Japanese using LLM-generated corpus and browser-rendered markdown.
Goal 2: Create a reproducible evaluation framework with checkpoint resume, model config YAMLs, and structured reports.
Goal 3: Publish open datasets to HuggingFace Hub and maintain a living leaderboard.

⚙️ Key Features & Contributions

LLM-Backed Corpus Generation: OpenAI API generates realistic Korean/Japanese text corpus, reused across generation and publish workflows.
Headless Playwright Renderer: Chromium renders markdown pages with configurable noise, blur, and character-similarity-based typo substitutions for realistic documents.
Evaluation Pipeline: YAML model configs, batch API support, checkpoint-based resume — outputs JSON/Markdown/HTML reports and leaderboard files per language.
HuggingFace Publishing: Sharded dataset upload with per-shard metadata, realism_stats.json, and run_manifest.json for full reproducibility.

🔧 Technical Challenges & Solutions

Formula Rendering: Markdown pages include LaTeX formulas, requiring XeLaTeX + latex-to-image integration and bounded formula-render caching for long runs.
Typo Realism: Built a character similarity database to generate plausible substitution errors, making synthetic text more representative of real scan noise.
Scale & Resumability: Sharded output with run_manifest.json allows generation to resume mid-run without duplicating completed shards.
Evaluation Consistency: Checkpoint-based scoring prevents double-counting; protocol snapshots pin dataset versions so results remain comparable across runs.

📈 Results & Learnings

Korean OCR Leaderboard: Top model (LightOnOCR-2-1B) achieved 0.9737 avg_markdown_overall_score across text, table, and formula categories (100/100 success rate).
Japanese OCR Leaderboard: Top model achieved 0.9777, with Nanonets-OCR2-3B scoring 0.9605 and DotsOCR 0.9288.
Pipeline Reliability: End-to-end generation, publish, and evaluation flow runs without manual intervention, enabling repeatable benchmark updates.
Key Learning: Gained deep expertise in synthetic data quality, multi-language OCR challenges, and building reproducible ML benchmarks.

🛠️ Technologies

Python Playwright OpenAI API Hugging Face Hub Transformers XeLaTeX uv

🔗 Links

💻 GitHub Repository