I love how the conditional U-Net classes don't explain what the "cross_attention_dim" parameter actually means.
Is it the dimension of the CLIP space I trained yesterday? Is it some other thing? I'm reading through Diffusers source just to find out! #PDDiffusion
Checking the Stable Diffusion config shows that their cross_attention_dim matches their text encoder's hidden_size.
So... is that supposed to match the projection_dim on CLIPProcessor, too? Because the default image encoder's hidden_size is bigger than the text encoder on my CLIP. How does that even work?