Ok, you know how I've been banging my head against an inscrutable bug setting up #PDDiffusion for the past week or so?
The fix? There's an on-by-default option in TrainingArguments called remove_unused_columns. It deletes data the model doesn't know about.
Except I'm using Dataset transforms to transform all the columns into the names that the model wants, and those run AFTER the Trainer decides to delete ALL THE DATA in the dataset!
Next problem: the CLIP tokenizer is refusing to pad things even though it knows the maximum length and the padding token, so torch.tensor shits its pants