Pinned post

Hello, and welcome to the Fediverse!

No, I didn't say Metaverse, I said Fediverse! That's a whole different thing that is also the product of a billionaire playboy driving a multi-billion dollar enterprise into the ground.

You may have known me as Libertardian on Twitter, or the branded account that is now just @admin.

And now the U-Net training loop is choking because CLIP wants everything on the CPU for some reason...

Might as well just move the CLIP step into dataset loading at this point

Show thread

Nevermind, it turns out the unet part of PDDiffusion has an os.path.chdir("output") right at the start that throws everything off

Show thread

Oh no, I forgot to save the image preprocessor config when training vocabulary

No matter, we're just using the defaults from CLIPFeatureExtractor, I can just copy the preprocessor_config.json from OpenAI CLIP (they're the same, and uncopyrightable)

...Oh no, it's not actually trying to read the file, is it?

Checking the Stable Diffusion config shows that their cross_attention_dim matches their text encoder's hidden_size.

So... is that supposed to match the projection_dim on CLIPProcessor, too? Because the default image encoder's hidden_size is bigger than the text encoder on my CLIP. How does that even work?

Show thread

I love how the conditional U-Net classes don't explain what the "cross_attention_dim" parameter actually means.

Is it the dimension of the CLIP space I trained yesterday? Is it some other thing? I'm reading through Diffusers source just to find out!

update: CLIP has finished training.

Now to figure out how to train a *conditional* U-Net...

On November 25, at the urging of a far-right troll, Elon Musk banned the @CrimethInc Twitter account.

Musk’s goal in acquiring Twitter had nothing to do with “free speech”—it was a partisan move to silence opposition, paving the way for fascist violence.

Please help us circulate our full statement:

Please follow us on Telegram and Mastodon.

Oh dear. I've just been informed that collecting the names of every person on the planet for my naughty and nice lists is, and I quote, “a significant and wholly irresponsible breach of #GDPR “.
I'm going to hand out about 8 billion consent forms soon. If you could all get them back to me ASAP that would be appreciated.

I took the day off today, under the impression that I would be relaxing and playing video games. The cat was under the impression that I would be attending to her every whim.
So we have compromised and I am attending to the cat's every whim.

Get £1,000,000s worth of ebooks and digital audiobooks for FREE this #BlackFriday.

Just register with your local public library, download the app and they're right there.

p.s. it works every other day too.


That last one is the thing that really bugs me.

Art generators do not "take references" in the same way a human does. You have a model that does noise detection (U-Net), and a pair of models that does image classification (CLIP).

What the generator does is, it takes a starting image, asks CLIP how close it is to the guidance, and then the U-Net uses that information to condition its noise prediction.

This is entirely a hack that just so happens to work for """drawing""" an image.

Show thread

The result of asking on The Orange Site™️ why Stable Diffusion has no train-from-scratch example code (related to ):

- One guy excited that I was talking about actually using licensed images

- A handful of people telling me to just fine-tune because I'll never be able to afford to scale up

- The usual smoldering tire fire of arguments between people who hate Copilot, people who hate copyright, and people who don't understand how diffusion models work

Thinking about scaling up ...

The current size of my scraped subset of Wikimedia Commons images is around 33GB. It will get way larger. We already know from my escapades with that local storage is hella expensive in AWS but S3 is super cheap.

My escapades with Paparouna CI have also told me that spot pricing for EC2 is hella cheap.

But these won't jive well - having all the data on S3 means idle time as the instances grab pieces of the dataset on startup.

CLIP is training. Finally.

Train times are already kinda high - like, probably one hour per epoch. I think part of it is just that some of the Wikimedia Commons imagery needs to get downscaled in advance, because there's some absurdly large images in that dataset.

✅ Set up a U-Net trainer
✅ Set up a CLIP trainer
❌ Set up conditional U-Net training for txt2image
❌ Test with some actual prompts
❌ Calculate how expensive it is to scale this up

Next problem: the CLIP tokenizer is refusing to pad things even though it knows the maximum length and the padding token, so torch.tensor shits its pants

Show thread

I'll give it that it tried to warn me about unnamed columns, but it was in the middle of about 40 different other lines of logspew and I had already went back and forth checking, rechecking, and renaming columns to try and get it to work

Show thread

Ok, you know how I've been banging my head against an inscrutable bug setting up for the past week or so?

The fix? There's an on-by-default option in TrainingArguments called remove_unused_columns. It deletes data the model doesn't know about.

Except I'm using Dataset transforms to transform all the columns into the names that the model wants, and those run AFTER the Trainer decides to delete ALL THE DATA in the dataset!

Show older
Pooper by

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!