Oh god this is fascinating. Is there some domain of perception that is completely inaccessible to us? God find out out what the features look like already!!!
I feel like this parallels the universal adversarial triggers of NLP models. Those are effective because they exploit a low level feature of the dataset the model is trained on. I wonder how you could apply “noise” onto the input of an NLP model to reduce lower level feature dependence... perhaps substituting words for close synonyms?
A token in a sentence is more analogous to a pixel in an image. Adding noise can be adding random words that doesn't misled human but misled the model.
there was a paper from ilyas et al out of MIT, that proposed that adv examples come from well generalizing features in the data sets. They call these features "brittle" because they are not what humans would pick up on.
16:55 the comment is valid, the second model just learned to imitate the previous model. The fact that the classifier architecture is slightly different is irrelevant.