CLIP is training. Finally. #PDDiffusion
Train times are already kinda high - like, probably one hour per epoch. I think part of it is just that some of the Wikimedia Commons imagery needs to get downscaled in advance, because there's some absurdly large images in that dataset.
✅ Set up a U-Net trainer
✅ Set up a CLIP trainer
❌ Set up conditional U-Net training for txt2image
❌ Test with some actual prompts
❌ Calculate how expensive it is to scale this up