📋 Project Overview
Developed a bidirectional translation model between Jeju dialect and standard Korean, addressing the challenge of preserving endangered regional languages through AI technology.
🎯 Problem Definition & Goals
- Problem: Jeju dialect is classified as a critically endangered language by UNESCO.
- Goal 1: Build an accurate bidirectional translation system between Jeju dialect and standard Korean.
- Goal 2: Compare different fine-tuning methods (Full, LoRA, QLoRA) for encoder-decoder translation models.
- Goal 3: Create reusable datasets and models to support Korean dialect preservation efforts.
⚙️ Key Features & Contributions
- Dataset Preparation: Processed AIhub "Korean Dialect Utterances (Jeju)" dataset.
- Multi-Model Comparison: Evaluated both KoT5 and KoBART architectures.
- Parameter-Efficient Fine-tuning: Implemented LoRA and QLoRA (4-bit).
- Comprehensive Evaluation: Established evaluation pipeline using BLEU and ROUGE scores.
- Public Release: Published processed dataset on HuggingFace.
🔧 Technical Challenges & Solutions
- Low-Resource Language: Limited training data for Jeju dialect.
- Dialect Variability: Jeju dialect has significant regional variations.
- Encoder-Decoder Complexity: Seq2seq models required careful hyperparameter tuning.
- Quantization Trade-offs: 4-bit QLoRA showed some quality degradation.
📈 Results & Learnings
- Translation Quality: Achieved competitive BLEU scores.
- Efficiency Analysis: LoRA achieved near full fine-tuning quality with 90% fewer parameters.
- Cultural Impact: Created tools that can help bridge generational language gaps.
- Key Learning: Gained practical experience in low-resource NLP.