Bilingual Translation LLM

Project Overview

Developed a bidirectional translation model between Jeju dialect and standard Korean, addressing the challenge of preserving endangered regional languages through AI technology.

Problem Definition & Goals

Problem: Jeju dialect is classified as a critically endangered language by UNESCO.
Goal 1: Build an accurate bidirectional translation system between Jeju dialect and standard Korean.
Goal 2: Compare different fine-tuning methods (Full, LoRA, QLoRA) for encoder-decoder translation models.
Goal 3: Create reusable datasets and models to support Korean dialect preservation efforts.

Key Features & Contributions

Dataset Preparation: Processed AIhub "Korean Dialect Utterances (Jeju)" dataset.
Multi-Model Comparison: Evaluated both KoT5 and KoBART architectures.
Parameter-Efficient Fine-tuning: Implemented LoRA and QLoRA (4-bit).
Comprehensive Evaluation: Established evaluation pipeline using BLEU and ROUGE scores.
Public Release: Published processed dataset on HuggingFace.

Technical Challenges & Solutions

Low-Resource Language: Limited training data for Jeju dialect.
Dialect Variability: Jeju dialect has significant regional variations.
Encoder-Decoder Complexity: Seq2seq models required careful hyperparameter tuning.
Quantization Trade-offs: 4-bit QLoRA showed some quality degradation.

Results & Learnings

Translation Quality: Achieved competitive BLEU scores.
Efficiency Analysis: LoRA achieved near full fine-tuning quality with 90% fewer parameters.
Cultural Impact: Created tools that can help bridge generational language gaps.
Key Learning: Gained practical experience in low-resource NLP.

Technologies

Python KoT5 KoBART Hugging Face Datasets PyTorch TRL LoRA QLoRA Seq2SeqTrainer

Links

HuggingFace Dataset