I love how the conditional U-Net classes don't explain what the "cross_attention_dim" parameter actually means.

Is it the dimension of the CLIP space I trained yesterday? Is it some other thing? I'm reading through Diffusers source just to find out!


Checking the Stable Diffusion config shows that their cross_attention_dim matches their text encoder's hidden_size.

So... is that supposed to match the projection_dim on CLIPProcessor, too? Because the default image encoder's hidden_size is bigger than the text encoder on my CLIP. How does that even work?

· · Web · 0 · 0 · 0
Sign in to participate in the conversation
Pooper by Fantranslation.org

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!