Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)

Yannic Kilcher

Подписаться 263 тыс.

Просмотров 34 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

3 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 79

@drhilm 4 года назад

You are so good at ditilling the paper knowledge. Clearing the top insights. Thanks.

@CosmiaNebula 2 года назад

The "projection layer" is not an architecture, but a job description. Any module that performs the job of a projection is a "projection layer". simCLR is an abstract framework for self-supervised contrastive learning. It consists of the following components: 1. data augmentation: turning data points into data point pairs (or triples, or n-tuples), to be used for contrastive learning. 2. working layer: a module for turning data points into general representations 3. projection layer: a module for turning general representation into specific representation adapted to specific purposes. 4. student network: a different network for distilling the teacher network. In the paper, simCLRv2 is concretely instantiated as the following: 1. data augmentation: random cropping, color distortion, and gaussian blur 2. working layer: ResNet-152 3. projection layer: 3-layered MLP 4. student network: ResNet but smaller than ResNet-152 The idea of a projection layer is to allow the working layer to focus on learning the general representation, instead of learning both a general representation AND the specific task in self-supervised training. Even self-supervised training is not a general task; it is specific! As they said in simCLRv1 paper. > We conjecture that the importance of using the representation before the nonlinear projection is due to loss of information induced by the contrastive loss. In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects. By leveraging the nonlinear transformation g(·), more information can be formed and maintained in h. This is similar to how, in iGPT (2020), the authors found that linear probing works best in the middle. Probably because in the middle, the Transformer has fully understood the image, and would then start to focus back to the next pixel. Imagine its attention as a spindle, starting local, then global, finally local again.

@Progfrag 4 года назад

Wow! So self-distillation is basically label smoothing but smoothing at the right places instead of evenly

@ralf2202 3 года назад

Yannic, you are great teacher network! Thank you.

@sudhanshumittal8921 4 года назад

Thanks a lot Yannic for latest updates.

@tylertheeverlasting 4 года назад

I think there are two reason to make a big deal out of that extra projection layer. 1. Its not standard practice, so their comparisions with previous methods aren't fully fair, so doing this might improve other methods as well. 2. The last layer of Resnet50 is CNN->Activation->Global Average Pooling, so its kinda different from regular models with only single linear layer on top of CNN

@quAdxify 3 года назад

There usually isn't an activiation in front of GAP, I think at least. But yeah it's basically not just a stacked matrix multiplication (which should be equivalent to just using a wider layer) because of GAP. But it's pretty obvious why it works better. Basically they are bringing the fully connected layer back that was common place before GAP kinda replaced it for most cases. So there simply is more representational power. We shouldn't forget that a fully connected layer has orders of magnitude more weights compared to a Conv layer (depends on the number of filters but let's keep that reasonable). I'd bet it wouldn't matter if they just replaced GAP with a regular fully connected layer.

@ensabinha 6 месяцев назад

Essentially, they pre-train with contrastive learning and finetune, then do pseudo-labeling (but using the probability distribution over your labels) and retrain on that.

@salmaalsinan8612 3 года назад

its been a while since i laughed while listening to something technical :D, excellent review and appreciate the funny commentary as I had similar questions .

@teslaonly2136 4 года назад

I was stunned when I saw the broader impact section.

@sathisha2394 4 года назад

You are so good at explaining things in a not mathematical way which helps me to grasp the insights very quickly. I kind of felt so much of knowledge I gained just watching your videos. Thank you so much. Keep posting. Can you put a video about SIREN?

@AdnanKhan-cx9it 2 года назад

that horrific background sound at 18:43 , btw excellent explanation as always

@ProfessionalTycoons 4 года назад

such a great paper, still so much secrets to unravel

@jaesikkim6218 4 года назад

Really awesome explanation! Easy to understand!

@rustists 3 года назад

very good presentation. thank you Yannic!!

@PhucLe-qs7nx 4 года назад

Self-distillation is boostraping / self-play in RL. The recent paper BYOL also uses bootstraping to ditch away the negative samples altogether. I guess the reason that these self-play or distillation works is because of the initial inductive bias in the random intializaed + architecture. If you can't bootstrap to learn from initial inductive biases, no learning is possible. And because we know learning is possible, even from zero labels, as long as the inductive biases and procedure is correct, then bootstraping / self-distillation / self-play must work.

@Guesstahw 3 года назад

Danke Vielmal @Yannik for the video, you did a great job. On the intuition or explanation behind figure 1 plots and why is it so, here's my 2 cents: You just have to think in terms of percentage of trainable parameters for the downstream task. To elaborate firstly keep in mind that growing the model size means growing the encoder size only and the size of the classification (linear) head remains constant. Now since in fine-tuning only the head parameters are trained as you grow the size of the self-supervised encoder, the ratio of trainable number of parameters (corresponding to Head) shrinks with respect of the total model parameters. Therefore a downstream task with fewer labels is more benefited from drop in percentage of number of trainable parameters (as the encoder size grows) than its counter parts with more labels. I think this would be the intuition behind the observed larger gains. In other words the fewer the labels, the more expressive encoder is required to capture as much information about the structure and geometry of the (unlabeled) data as possible to compensate for the shortage of labels.

@SungEunSo 4 года назад

Thank you for great explanation!

@herp_derpingson 4 года назад

Great paper. Definitely the quality you expect from Hinton. Fun fact: His great-great grandfather was George Boole. (Boolean algebra) . 21:20 I think its to be noted that ResNet50 probably went through some extensive hyperparameter tuning to do exactly what it was supposed to do and thus had a fixed number of dense layers at the end. So, perhaps adding a new layer just happens to help in the problem we are trying to solve, i.e. the teacher student thing instead of one hot. . 18:43 The whistling in the background. Is someone snoring?

@YannicKilcher 4 года назад

Wow, didn't know Hinton had royal blood :D Yea I agree this extra layer is super problem specific, but I don't get why they don't just say the encoder is now bigger, but instead make the claim that this is part of the projection head. and no, I have no clue what that noise is O.o

@drdca8263 4 года назад

Regarding the broader impact statement, while I generally agree that many broader impact statements appear to not be useful, I do think that the “where it is more expensive or difficult to label additional data than to train larger models” point, along with the example of needing clinicians to carefully create annotations for the medical applications, was probably worth saying. That part appears to point to a specific area in which this improvement is useful. Of course, it would still be interesting even if it couldn’t be used for anything, but I do think that detail is still worthy of note. I imagine (with no real justification) that the reason that they mentioned crop yield was because they felt obligated to include at least one negative example, but wanted the positive examples they listed to outnumber the negative ones, so they needed a second one. Another beneficial use-case where getting labeled data is especially expensive or difficult, compared to other use-cases, and where it is clear that that is the case, may have been better than the part about food, but eh.

@YannicKilcher 4 года назад

Yea it's kind of like a job interview where they ask you about your weaknesses and you want to say something that's so minor it's almost irrelevant :D jokes aside, it's actually awesome that you don't have to collect as many labels. but that doesn't belong in the broader impact section, at least not as it is defined by NeurIPS, because it still deals with the field of ML. In the BI section, you're supposed to pre-view how your method will influence greater society.

@victorrielly4588 4 года назад

Very interesting hypothesis about why a bigger model provides better improvements through self supervised learning however, I would caution that bigger models do not actually necessarily mean more learned features, for instance, suppose you use a giant model where the last layer is 1 dimensional. In fact the dimensionality of the feature space is not at all dependent on the size of the model but the dimensionality of the output layer.

@bengineer_the 4 года назад

Hi Yannic, this set of ideas feels like gold. This is how 'we' as humans learn. Children are allowed to experience the world with as few 'adult-labels' as possible.. "to get a feel" of the world.. we then come along, explain things.. they kinda memorise what you said, but then years later come back going, "Ahh, now I get it on my terms..". So perhaps the warning for future abuse of this technique is somewhat valid. Can we now make a form of "accumulative consciousness scheme" (over all time) that could then be queried & labelled in the future. Retroactively plucking the knowledge after you become aware of the concept-label. ..this could be quite far reaching.

@bengineer_the 4 года назад

Hmm, how about teach such a system as described, then [later] give it a inference based connection to the internet (let it search) and let it figure out the labels latter? Going on a tangent, but I wonder if there has been much research into clustering multiple image-classifiers & nlp transformers into a label acquisition learning scheme?

@bengineer_the 4 года назад

This form of learning (minimal-labelling combined with jittered-input forms) gives the network time to breathe. A bad teacher barks the answer. I like it a lot.

@YannicKilcher 4 года назад

Super interesting suggestions, I think what you're describing really goes into the direction of AGI where the system sort-of learns to reflect on what it knows and how it can learn the things it doesn't!

@authmanapatira3016 4 года назад

Love all your videos.

@alviur 3 года назад

Thanks a lot Yannic!

@vitocorleone1991 2 года назад

I salute you sir!!!

@christianleininger2954 4 года назад

great job ! amazing

@SachinSingh-do5ju 4 года назад

You have fans.., And many of them 😛 I am one now

@phsamuelwork 4 года назад

Broader impact... it is like something one put in an NSF proposal.

@eelcohoogendoorn8044 4 года назад

So.. putting a bunch of cool existing methods together works pretty well? Sarcasm aside, the extensive experiments are appreciated.

@sam.q 4 года назад

Thank you!

@johngrabner 4 года назад

Wow, maybe this paper discovered why we dream.

@slackstation 4 года назад

This one video, I feel like I've learned so many different insights. I'm still trying to level up where I understand that the math annotations easily and clearly like Mr. Kilcher but, the insights here are amazing. If I could suggest a paper/video to explain, SIREN: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-Q2fLWGBeaiI.html Paper: arxiv.org/abs/2006.09661 The video does a decent job of explaining the concept and application. I'm more interested in your opinion on what you think the impact that this could have on the rest of the field by replacing ReLU and others with SIREN. As always, thank you for your work.

@RohitKumarSingh25 4 года назад

Agree. Yannic please review this paper if you get time.

@christianleininger2954 4 года назад

I really like your videos Maybe you would like to make a video about the paper Accelerating Online Reinforcement Learning with Offline Datasets

@RohitKumarSingh25 4 года назад

So only novel idea in this paper is just adding the self training or distillation part right? I wonder how come we had never thought of it before for unlabelled data given it seems so obvious especially after realising the benefits of label smoothing and mix-up technique.

@shivanshu6204 3 года назад

Damn you went hard after the broader impact lol.

@hexinlei6250 4 года назад

Really good presentation!!! btw, may I ask what's the presenting app?

@YannicKilcher 3 года назад

OneNote

@grinps 2 года назад

Thank for the great review. What app did you use for read and annotate the pdf in this video?

@JavierPortillo1 2 года назад

Thanks! Very clearly explained! Coud you please explain the SwAV model?

@johnkrafnik5414 4 года назад

Great video, thanks for making this so digestible. I am curious what the long term goal is here, it feels like we are piling on hack after hack to improve small percentage points. I understand that the overall goal of transitioning to semi-supervised learning is important, but so far feels very incremental.

@sayakpaul3152 4 года назад

22:13 why did you mention you were wrong in the supervised loss part? Sorry if this is a redundant question.

@YannicKilcher 4 года назад

I just re-watched it and I can't figure it out myself :D

@sayakpaul3152 4 года назад

@@YannicKilcher no worries man. I think these little traits make us human. Anyway, great explanation as always.

@theodorosgalanos9663 4 года назад

Thanks Yannic this is great! I wonder, are you aware of any approach that deals with domains where augmentation, at least most of it, is not available? The best I remember is the ablation study on augmentations from...I forget which paper, might have been v1 of this one? In my domain, most augmentations, other than random crop, invalidate the image completely (they are physics simulations), I wonder if anyone has tested if the SSL approach still helps in these cases.

@YannicKilcher 4 года назад

No idea. Yes, I also recall that crop is the main driver of performance here.

@mhadnanali 2 года назад

you are really good at paper reading. how to gain this skill?

@nopnopnopnopnopnopnop 2 года назад

I still don't get the self-distillation part. If the teacher and the student are the same network, then they produce the same outputs. So what is there to even learn? In this case, the student didn't have the additional projection layer, so at least the networks aren't identical (though I still don't understand what there is to learn). But kilcher made it look like it would be useful even if the networks were the same

@sudhanshumittal8921 4 года назад

And that saturates the semi-supervised image classification performance. The community needs more realistic/harder benchmarks.

@RobNeuhaus 4 года назад

Do you have more information or intuition on self distillation? Why does distilling the same model/architecture on unlabeled using an identical architecture improve the student over the teacher?

@YannicKilcher 4 года назад

because it sees more data than the teacher

@theodorosgalanos9663 4 года назад

So SSL gives us access to a sort of large feature space and distillation filters through which of those features are important for the task in hand? I wonder if there an experiment without distillation to see if that extra noise in the feature space hurts (so finetune and predict without student). Okay I'll stop being lazy and check!

@YannicKilcher 4 года назад

Yes, the first experiments in the paper are without distillation, as far as I understand (it's not explicitly clear, though)

@MastroXel 4 года назад

You mentioned that it's not well understood why are we getting better model after distillation. Let me even further this question: if that's the case why can't we now take Student and treat it as a new Teacher to obtain even better Student? That doesn't make too much sense, does it?

@YannicKilcher 4 года назад

People do that, but there are diminishing returns.

@sacramentofwilderness6656 4 года назад

I would like a neural network to slow down the time to keep up with the advances in machine learning and AI

@antonio.7557 4 года назад

one thing confuses me about distillation/self-supervised learning: Some methods enhance the pseudo label, some use confidence threshholds, some use augmentations for the student input, but this paper doesn't do any of those?

@rpcruz 4 года назад

It uses agumentation. From the paper: "SimCLR learns representations by maximizing agreement [26] between differently augmented views of the same data example via a contrastive loss in the latent space. (...) We use the same set of simple augmentations as SimCLR."

@antonio.7557 4 года назад

@@rpcruz ah ok thanks! makes a lot more sense then

@twobob 2 года назад

agree

@andres_pq 4 года назад

What is the difference between contrastive loss and triplet loss?

@YannicKilcher 4 года назад

Haven't looked at triplet loss yet, but contrastive loss has an entire set of negatives

@snippletrap 4 года назад

Triplet loss is a kind of contrastive loss

@Sileadim 3 года назад

"That would be ridiculous. We'll I guess, in this day and age nothing is ridiculous." xD

@rajeshdhawan4624 4 года назад

I want to connect about same...kindly let me know how??

@sayakpaul3152 4 года назад

21:20 I think the representations pass through a non-linearity. There's a sigma there. But anyway, the notation is more complicated than it needed to be frankly.

@alonsorobots 3 года назад

pure alchemy...

@deeplearner2634 3 года назад

crop yields... haha. did the researchers suffer from food shortage??

@scottmiller2591 4 года назад

If I were cynical, I would think you don't see much value in broader impact statements. If I were cynical.

@YannicKilcher 4 года назад

hypothetically

@snippletrap 4 года назад

Lol. The same busybodies and morality police sticking their noses into open source communities, renaming NIPS, etc. Why does no one say No to these humorless twats and control freaks?