U-Net diffusion model built entirely from scratch using PyTorch. Trained on COCO Captions (118K images). Architecture: U-Net with cross-attention for text conditioning via CLIP. 35M parameters, 64×64 resolution. Every component — noise scheduler, U-Net, cross-attention, DDIM sampling — was implemented from scratch.