📋 Project Overview
A synthetic data generator that creates natural and diverse images for OCR model training.
🎯 Problem Definition & Goals
- Problem: OCR training requires large amounts of labeled text images, but real data collection is expensive.
- Goal 1: Develop a synthetic text image generator realistic enough for OCR training.
- Goal 2: Apply diverse fonts, backgrounds, and noise to simulate real-world variability.
- Goal 3: Build an efficient pipeline for generating large quantities of synthetic images.
⚙️ Key Features & Contributions
- Font Diversity: Supports various Korean and English fonts for diverse document styles.
- Background Diversity: Applies solid colors, textures, and gradients.
- Noise Simulation: Simulates camera noise, blur, and distortion.
- Auto Labeling: Automatically generates accurate text labels.
🔧 Technical Challenges & Solutions
- Realism: Synthetic images must be realistic enough. Combined various noise and distortion effects.
- Font Rendering: Supporting diverse Korean font styles. Explored multiple font libraries.
- Data Diversity: Generated data must cover real-world variability. Used parameter randomization.
- Generation Efficiency: Performance optimization for large-scale generation. Applied batch processing and parallelization.
📈 Results & Learnings
- OCR Improvement: Models trained with synthetic data outperformed real-data-only models.
- Data Efficiency: Generated large training data without labeling costs.
- Customization: Enabled domain-specific data generation (receipts, documents, etc.).
- Key Learning: Gained expertise in synthetic data generation and image processing.