Follow

I love how the conditional U-Net classes don't explain what the "cross_attention_dim" parameter actually means.

Is it the dimension of the CLIP space I trained yesterday? Is it some other thing? I'm reading through Diffusers source just to find out!

· · Web · 1 · 0 · 0

Checking the Stable Diffusion config shows that their cross_attention_dim matches their text encoder's hidden_size.

So... is that supposed to match the projection_dim on CLIPProcessor, too? Because the default image encoder's hidden_size is bigger than the text encoder on my CLIP. How does that even work?

Sign in to participate in the conversation
Pooper by Fantranslation.org

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!