← Back to Portfolio

tiny-stable-diffusion

2025.12 - 2026.02

Project Overview

A lightweight, educational implementation of Stable Diffusion 3 (SD3) built from scratch in PyTorch.

Problem Definition & Goals

Problem: State-of-the-art Stable Diffusion models are too large to train on consumer hardware.
Goal 1: Create a scaled-down SD3 implementation trainable on a single consumer GPU.
Goal 2: Implement the complete pipeline including VAE, text encoders, and MMDiT from scratch.
Goal 3: Extend to video/GIF generation using motion modules.

Key Features & Contributions

MMDiT Architecture: Implemented Multi-Modal Diffusion Transformer.
Rectified Flow: Used modern flow-based formulation for straighter sampling.
Complete Pipeline: Built VAE, CLIP/T5 text conditioning, and full denoising pipeline.
Motion Module: Added temporal attention layers for animated GIFs.
Educational Design: Clean, well-commented codebase.

Technical Challenges & Solutions

Memory Constraints: Reduced model size while preserving innovations.
VAE Training Stability: Implemented KL annealing and perceptual loss.
Text-Image Alignment: Used CFG and improved cross-attention.
Motion Consistency: Added temporal attention for smooth transitions.

Results & Learnings

Successful Generation: Model generates coherent images from text prompts.
Training Efficiency: Entire model trainable on single RTX 3090.
Educational Impact: Codebase serves as learning reference.
Key Learning: Gained comprehensive understanding of SD3 architecture.

Technologies

Python PyTorch Hugging Face Datasets Stable Diffusion 3 MMDiT Rectified Flow VAE

Links

GitHub Repository