📋 Project Overview
A lightweight, educational implementation of Stable Diffusion 3 (SD3) built from scratch in PyTorch.
🎯 Problem Definition & Goals
- Problem: State-of-the-art Stable Diffusion models are too large to train on consumer hardware.
- Goal 1: Create a scaled-down SD3 implementation trainable on a single consumer GPU.
- Goal 2: Implement the complete pipeline including VAE, text encoders, and MMDiT from scratch.
- Goal 3: Extend to video/GIF generation using motion modules.
⚙️ Key Features & Contributions
- MMDiT Architecture: Implemented Multi-Modal Diffusion Transformer.
- Rectified Flow: Used modern flow-based formulation for straighter sampling.
- Complete Pipeline: Built VAE, CLIP/T5 text conditioning, and full denoising pipeline.
- Motion Module: Added temporal attention layers for animated GIFs.
- Educational Design: Clean, well-commented codebase.
🔧 Technical Challenges & Solutions
- Memory Constraints: Reduced model size while preserving innovations.
- VAE Training Stability: Implemented KL annealing and perceptual loss.
- Text-Image Alignment: Used CFG and improved cross-attention.
- Motion Consistency: Added temporal attention for smooth transitions.
📈 Results & Learnings
- Successful Generation: Model generates coherent images from text prompts.
- Training Efficiency: Entire model trainable on single RTX 3090.
- Educational Impact: Codebase serves as learning reference.
- Key Learning: Gained comprehensive understanding of SD3 architecture.