📋 Project Overview
A markdown-first synthetic OCR dataset generator and benchmarking toolkit covering the full pipeline: LLM-backed corpus generation → headless Playwright rendering with realistic noise → HuggingFace publishing → reproducible OCR/VLM model evaluation → ranked leaderboard. Supports Korean and Japanese.
🎯 Problem Definition & Goals
- Problem: High-quality labeled OCR data for Korean and Japanese is scarce and expensive to collect, and no reproducible benchmark exists to compare models fairly across text, table, and formula categories.
- Goal 1: Build a realistic synthetic OCR pipeline for Korean and Japanese using LLM-generated corpus and browser-rendered markdown.
- Goal 2: Create a reproducible evaluation framework with checkpoint resume, model config YAMLs, and structured reports.
- Goal 3: Publish open datasets to HuggingFace Hub and maintain a living leaderboard.
⚙️ Key Features & Contributions
- LLM-Backed Corpus Generation: OpenAI API generates realistic Korean/Japanese text corpus, reused across generation and publish workflows.
- Headless Playwright Renderer: Chromium renders markdown pages with configurable noise, blur, and character-similarity-based typo substitutions for realistic documents.
- Evaluation Pipeline: YAML model configs, batch API support, checkpoint-based resume — outputs JSON/Markdown/HTML reports and leaderboard files per language.
- HuggingFace Publishing: Sharded dataset upload with per-shard metadata,
realism_stats.json, andrun_manifest.jsonfor full reproducibility.
🔧 Technical Challenges & Solutions
- Formula Rendering: Markdown pages include LaTeX formulas, requiring XeLaTeX +
latex-to-imageintegration and bounded formula-render caching for long runs. - Typo Realism: Built a character similarity database to generate plausible substitution errors, making synthetic text more representative of real scan noise.
- Scale & Resumability: Sharded output with
run_manifest.jsonallows generation to resume mid-run without duplicating completed shards. - Evaluation Consistency: Checkpoint-based scoring prevents double-counting; protocol snapshots pin dataset versions so results remain comparable across runs.
📈 Results & Learnings
- Korean OCR Leaderboard: Top model (LightOnOCR-2-1B) achieved 0.9737 avg_markdown_overall_score across text, table, and formula categories (100/100 success rate).
- Japanese OCR Leaderboard: Top model achieved 0.9777, with Nanonets-OCR2-3B scoring 0.9605 and DotsOCR 0.9288.
- Pipeline Reliability: End-to-end generation, publish, and evaluation flow runs without manual intervention, enabling repeatable benchmark updates.
- Key Learning: Gained deep expertise in synthetic data quality, multi-language OCR challenges, and building reproducible ML benchmarks.