← Back to Portfolio

Synthetic OCR Image Generator

2025.12 - 2026.02

Project Overview

A markdown-first synthetic OCR dataset generator and benchmarking toolkit covering the full pipeline: LLM-backed corpus generation → headless Playwright rendering with realistic noise → HuggingFace publishing → reproducible OCR/VLM model evaluation → ranked leaderboard. Supports Korean and Japanese.

Problem Definition & Goals

Key Features & Contributions

Technical Challenges & Solutions

Results & Learnings

Technologies

Python Playwright OpenAI API Hugging Face Hub Transformers XeLaTeX uv

Links