🔒 Internal project — source code and detailed outputs are confidential. Metrics and architecture described here are based on personal notes from the internship period.
📋 Project Overview
Built a multi-agent pipeline that quantitatively analyzes drama scripts for storyline quality, character appeal, and commercial potential to support production decisions. 4-person team, deployed for actual script review during the alpha-beta-gamma test period at CJ AI Center.
🎯 Problem Definition & Goals
- Problem: Manual review of 1–4 episodes per script was expensive and slow, with reviewer bias creating inconsistent evaluations and bottlenecks that risked overlooking valuable scripts.
- Goal 1: Automate script analysis with objective metrics through a multi-agent system.
- Goal 2: Robustly parse diverse script formats (PDF, HWP, DOCX) with high Korean text accuracy.
- Goal 3: Improve OCR quality (WER) and scene-classification quality (F1) to a level viable for real production use.
⚙️ Key Features & Contributions
- Multi-Format Document Parser: Preprocessing modules to convert PDF, HWP, DOCX scripts into structured text, with a Korean OCR benchmark dataset for validation.
- VLM Integration (Qwen2.5-7B-VL): Replaced traditional OCR with a Vision Language Model to handle complex multi-column layouts and Korean text, eliminating ordering errors.
- Scene Analysis Agent: Scene-level strength/weakness classifier with CoT prompting, enabling the LLM to reason about contextual nuance and subtext.
- AWS On-Demand GPU Deployment: Designed a boto3-driven on-demand EC2 provisioning flow — pre-initialized EBS volume mounts, AMI-based startup, and region-aware GPU capacity selection — eliminating idle GPU costs for this periodic-use workload.
- LangChain Orchestration: Connected Parser, Analyzer, and Evaluator agents via LangChain with state-based workflow and conditional parallel execution.
🔧 Technical Challenges & Solutions
- Korean OCR — WER >20%: Traditional OCR (PaddleOCR) showed high error rates and column-ordering failures on multi-column script layouts.
- Solution: Switched to Qwen2.5-7B-VL for document-structure-aware parsing. WER dropped from 20% to 7%.
- Scene Classification — F1 0.2: Sparse training data and over-segmented genre-specific prompts caused the model to overfit to surface patterns rather than understand narrative nuance.
- Solution: Benchmarked Korean-strength reasoning models (Deepseek-R1, c4ai-command-a) and simplified prompts to let the model reason freely. F1 improved from 0.2 to 0.5.
- AWS Cold Start & Cost: On-premises deployment failed; naive AWS setup with S3 downloads caused 20–30 min cold starts. Resolved with pre-initialized EBS volumes, AMI snapshots, and uv-based environment setup, cutting startup time significantly while removing idle GPU costs.
📈 Results & Learnings
- OCR Accuracy: WER reduced from 20% → 7% via VLM-based parsing, securing reliable input data for downstream agents.
- Classification Quality: Scene-level F1 score improved from 0.2 → 0.5 through model selection and prompt simplification.
- Production Deployment: Pipeline used in actual script review during the alpha-beta-gamma test period. On-demand GPU structure eliminated idle costs for this periodic workload.
- Key Learnings: Data quality and simple prompting outperform over-engineered constraints. Infrastructure details (EBS mount stability, AMI readiness, regional GPU capacity) matter as much as model performance for real deployments.