Checking the Stable Diffusion config shows that their cross_attention_dim matches their text encoder's hidden_size.
So... is that supposed to match the projection_dim on CLIPProcessor, too? Because the default image encoder's hidden_size is bigger than the text encoder on my CLIP. How does that even work?