That last one is the thing that really bugs me.
Art generators do not "take references" in the same way a human does. You have a model that does noise detection (U-Net), and a pair of models that does image classification (CLIP).
What the generator does is, it takes a starting image, asks CLIP how close it is to the guidance, and then the U-Net uses that information to condition its noise prediction.
This is entirely a hack that just so happens to work for """drawing""" an image.
@kmeisthax Indeed it’s not like the conscious inspiration of having a reference. But IMO there is a valid analogy to how the brain unconsciously builds up a concept of ‘what an <insert object> looks like’, based on every time it’s seen that object, in art or (unlike models) in real life.