The "projection layer" is not an architecture, but a job description. Any module that performs the job of a projection is a "projection layer". simCLR is an abstract framework for self-supervised contrastive learning. It consists of the following components: 1. data augmentation: turning data points into data point pairs (or triples, or n-tuples), to be used for contrastive learning. 2. working layer: a module for turning data points into general representations 3. projection layer: a module for turning general representation into specific representation adapted to specific purposes. 4. student network: a different network for distilling the teacher network. In the paper, simCLRv2 is concretely instantiated as the following: 1. data augmentation: random cropping, color distortion, and gaussian blur 2. working layer: ResNet-152 3. projection layer: 3-layered MLP 4. student network: ResNet but smaller than ResNet-152 The idea of a projection layer is to allow the working layer to focus on learning the general representation, instead of learning both a general representation AND the specific task in self-supervised training. Even self-supervised training is not a general task; it is specific! As they said in simCLRv1 paper. > We conjecture that the importance of using the representation before the nonlinear projection is due to loss of information induced by the contrastive loss. In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects. By leveraging the nonlinear transformation g(·), more information can be formed and maintained in h. This is similar to how, in iGPT (2020), the authors found that linear probing works best in the middle. Probably because in the middle, the Transformer has fully understood the image, and would then start to focus back to the next pixel. Imagine its attention as a spindle, starting local, then global, finally local again.
I think there are two reason to make a big deal out of that extra projection layer. 1. Its not standard practice, so their comparisions with previous methods aren't fully fair, so doing this might improve other methods as well. 2. The last layer of Resnet50 is CNN->Activation->Global Average Pooling, so its kinda different from regular models with only single linear layer on top of CNN
There usually isn't an activiation in front of GAP, I think at least. But yeah it's basically not just a stacked matrix multiplication (which should be equivalent to just using a wider layer) because of GAP. But it's pretty obvious why it works better. Basically they are bringing the fully connected layer back that was common place before GAP kinda replaced it for most cases. So there simply is more representational power. We shouldn't forget that a fully connected layer has orders of magnitude more weights compared to a Conv layer (depends on the number of filters but let's keep that reasonable). I'd bet it wouldn't matter if they just replaced GAP with a regular fully connected layer.
Essentially, they pre-train with contrastive learning and finetune, then do pseudo-labeling (but using the probability distribution over your labels) and retrain on that.
its been a while since i laughed while listening to something technical :D, excellent review and appreciate the funny commentary as I had similar questions .
You are so good at explaining things in a not mathematical way which helps me to grasp the insights very quickly. I kind of felt so much of knowledge I gained just watching your videos. Thank you so much. Keep posting. Can you put a video about SIREN?
Self-distillation is boostraping / self-play in RL. The recent paper BYOL also uses bootstraping to ditch away the negative samples altogether. I guess the reason that these self-play or distillation works is because of the initial inductive bias in the random intializaed + architecture. If you can't bootstrap to learn from initial inductive biases, no learning is possible. And because we know learning is possible, even from zero labels, as long as the inductive biases and procedure is correct, then bootstraping / self-distillation / self-play must work.
Danke Vielmal @Yannik for the video, you did a great job. On the intuition or explanation behind figure 1 plots and why is it so, here's my 2 cents: You just have to think in terms of percentage of trainable parameters for the downstream task. To elaborate firstly keep in mind that growing the model size means growing the encoder size only and the size of the classification (linear) head remains constant. Now since in fine-tuning only the head parameters are trained as you grow the size of the self-supervised encoder, the ratio of trainable number of parameters (corresponding to Head) shrinks with respect of the total model parameters. Therefore a downstream task with fewer labels is more benefited from drop in percentage of number of trainable parameters (as the encoder size grows) than its counter parts with more labels. I think this would be the intuition behind the observed larger gains. In other words the fewer the labels, the more expressive encoder is required to capture as much information about the structure and geometry of the (unlabeled) data as possible to compensate for the shortage of labels.
Great paper. Definitely the quality you expect from Hinton. Fun fact: His great-great grandfather was George Boole. (Boolean algebra) . 21:20 I think its to be noted that ResNet50 probably went through some extensive hyperparameter tuning to do exactly what it was supposed to do and thus had a fixed number of dense layers at the end. So, perhaps adding a new layer just happens to help in the problem we are trying to solve, i.e. the teacher student thing instead of one hot. . 18:43 The whistling in the background. Is someone snoring?
Wow, didn't know Hinton had royal blood :D Yea I agree this extra layer is super problem specific, but I don't get why they don't just say the encoder is now bigger, but instead make the claim that this is part of the projection head. and no, I have no clue what that noise is O.o
Regarding the broader impact statement, while I generally agree that many broader impact statements appear to not be useful, I do think that the “where it is more expensive or difficult to label additional data than to train larger models” point, along with the example of needing clinicians to carefully create annotations for the medical applications, was probably worth saying. That part appears to point to a specific area in which this improvement is useful. Of course, it would still be interesting even if it couldn’t be used for anything, but I do think that detail is still worthy of note. I imagine (with no real justification) that the reason that they mentioned crop yield was because they felt obligated to include at least one negative example, but wanted the positive examples they listed to outnumber the negative ones, so they needed a second one. Another beneficial use-case where getting labeled data is especially expensive or difficult, compared to other use-cases, and where it is clear that that is the case, may have been better than the part about food, but eh.
Yea it's kind of like a job interview where they ask you about your weaknesses and you want to say something that's so minor it's almost irrelevant :D jokes aside, it's actually awesome that you don't have to collect as many labels. but that doesn't belong in the broader impact section, at least not as it is defined by NeurIPS, because it still deals with the field of ML. In the BI section, you're supposed to pre-view how your method will influence greater society.
Very interesting hypothesis about why a bigger model provides better improvements through self supervised learning however, I would caution that bigger models do not actually necessarily mean more learned features, for instance, suppose you use a giant model where the last layer is 1 dimensional. In fact the dimensionality of the feature space is not at all dependent on the size of the model but the dimensionality of the output layer.
Hi Yannic, this set of ideas feels like gold. This is how 'we' as humans learn. Children are allowed to experience the world with as few 'adult-labels' as possible.. "to get a feel" of the world.. we then come along, explain things.. they kinda memorise what you said, but then years later come back going, "Ahh, now I get it on my terms..". So perhaps the warning for future abuse of this technique is somewhat valid. Can we now make a form of "accumulative consciousness scheme" (over all time) that could then be queried & labelled in the future. Retroactively plucking the knowledge after you become aware of the concept-label. ..this could be quite far reaching.
Hmm, how about teach such a system as described, then [later] give it a inference based connection to the internet (let it search) and let it figure out the labels latter? Going on a tangent, but I wonder if there has been much research into clustering multiple image-classifiers & nlp transformers into a label acquisition learning scheme?
This form of learning (minimal-labelling combined with jittered-input forms) gives the network time to breathe. A bad teacher barks the answer. I like it a lot.
Super interesting suggestions, I think what you're describing really goes into the direction of AGI where the system sort-of learns to reflect on what it knows and how it can learn the things it doesn't!
This one video, I feel like I've learned so many different insights. I'm still trying to level up where I understand that the math annotations easily and clearly like Mr. Kilcher but, the insights here are amazing. If I could suggest a paper/video to explain, SIREN: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Q2fLWGBeaiI.html Paper: arxiv.org/abs/2006.09661 The video does a decent job of explaining the concept and application. I'm more interested in your opinion on what you think the impact that this could have on the rest of the field by replacing ReLU and others with SIREN. As always, thank you for your work.
So only novel idea in this paper is just adding the self training or distillation part right? I wonder how come we had never thought of it before for unlabelled data given it seems so obvious especially after realising the benefits of label smoothing and mix-up technique.
Great video, thanks for making this so digestible. I am curious what the long term goal is here, it feels like we are piling on hack after hack to improve small percentage points. I understand that the overall goal of transitioning to semi-supervised learning is important, but so far feels very incremental.
Thanks Yannic this is great! I wonder, are you aware of any approach that deals with domains where augmentation, at least most of it, is not available? The best I remember is the ablation study on augmentations from...I forget which paper, might have been v1 of this one? In my domain, most augmentations, other than random crop, invalidate the image completely (they are physics simulations), I wonder if anyone has tested if the SSL approach still helps in these cases.
I still don't get the self-distillation part. If the teacher and the student are the same network, then they produce the same outputs. So what is there to even learn? In this case, the student didn't have the additional projection layer, so at least the networks aren't identical (though I still don't understand what there is to learn). But kilcher made it look like it would be useful even if the networks were the same
Do you have more information or intuition on self distillation? Why does distilling the same model/architecture on unlabeled using an identical architecture improve the student over the teacher?
So SSL gives us access to a sort of large feature space and distillation filters through which of those features are important for the task in hand? I wonder if there an experiment without distillation to see if that extra noise in the feature space hurts (so finetune and predict without student). Okay I'll stop being lazy and check!
You mentioned that it's not well understood why are we getting better model after distillation. Let me even further this question: if that's the case why can't we now take Student and treat it as a new Teacher to obtain even better Student? That doesn't make too much sense, does it?
one thing confuses me about distillation/self-supervised learning: Some methods enhance the pseudo label, some use confidence threshholds, some use augmentations for the student input, but this paper doesn't do any of those?
It uses agumentation. From the paper: "SimCLR learns representations by maximizing agreement [26] between differently augmented views of the same data example via a contrastive loss in the latent space. (...) We use the same set of simple augmentations as SimCLR."
21:20 I think the representations pass through a non-linearity. There's a sigma there. But anyway, the notation is more complicated than it needed to be frankly.
Lol. The same busybodies and morality police sticking their noses into open source communities, renaming NIPS, etc. Why does no one say No to these humorless twats and control freaks?