Ok, you know how I've been banging my head against an inscrutable bug setting up #PDDiffusion for the past week or so?
The fix? There's an on-by-default option in TrainingArguments called remove_unused_columns. It deletes data the model doesn't know about.
Except I'm using Dataset transforms to transform all the columns into the names that the model wants, and those run AFTER the Trainer decides to delete ALL THE DATA in the dataset!
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!