Knowledge Distillation in Deep Learning - Basics

Dingu Sagar

Подписаться 321

Просмотров 20 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

19 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 46

@mariama8157 Год назад

Great Professor. Easy and high level explanation.

@manub.n2451 Год назад

Thanks Sagar for a brilliant explanation of the basics of KD

@TechNewsReviews 10 месяцев назад

The explanation looks good. However, many words are unclear because of bad sound quality. My suggestion is to use some AI-based audio enhancement tools to make the voice clearer and noise-free, then update the video. You will definitely get more views.

@dingusagar 10 месяцев назад

Thanks for the feedback. Yes the audio is really bad. I am planning to re-record this and upload soon.

@gaurav230187 2 года назад

Well done, good and simple explanation.

@dhirajkumarsahu999 2 года назад

Thank you so much, you have earned my subscription

@dingusagar 2 года назад

Thanks. Will try to do more such videos.

@mariama8157 Год назад

Thank you so much. Please, make a lot of videos in machine learning.

@WALLACE9009 Год назад

Amazing that this works at all

@GulnazShalgumbayeva-d8i 2 месяца назад

excellent explanation!

@goelnikhils Год назад

Amazing explanation of knowledge distillaton

@lazy.researcher Год назад

can you please tell me the advantage of smoothening the logits using temperature inclusion; why can't we use softmax to compare the output of the teacher and student models for distillation loss?

@dingusagar Год назад

good question. one way to think about it is this. the teacher's probabilities on different classes are very semantically rich. it captures the data distribution and the relation between various classes as explained in the video on the animals example. But the teacher's probabilities are coming from softmax which was trained to match the one hot encoded labels of the correct class when the teacher was trained. so even though, various class probabilities from the final teacher layer represent rich information about the data distribution, the actual values for the correct class will be so high and the other classes will have low probabilities. It's like the signal is there, but very hard to see unless we amplify it. So that's why we do softmax with temperature. it amplifies the probabilities of the rest of the classes at the cost of bringing down the probability of the positive class a little bit, (because softmax outputs should sum to 1). This way, the student is able to see these other probabilities more clearly and learn from them.

@miriamsilverman5126 2 года назад

great explanation!!! thank you! I wish the sound was better.. maybe you can record it again:)

@dingusagar 2 года назад

Thanks :). Sorry about the bad sound quality. Will definitely work on it next time.

@kristianmamforte4129 Год назад

Wow, thanks for this video!

@teay5767 8 месяцев назад

Nice Video, thanks for the help

@tranvoquang1224 Год назад

Thank you.

@ilhamafounnas8279 3 года назад

Looking forward for more information about KD. Thank you

@dingusagar 3 года назад

Glad to hear that. I was exploring KD in the nlp space and thought of creating few videos around it. Let me know if there is any specific topic in KD or in general that you are looking forward to. If it overlaps with the things that i am exploring, would be happy to make videos around it.

@ruksharalam173 11 месяцев назад

Would be great if you could please improve the audio.

@Jamboreeni Год назад

Great video! Love how you simplified it that even a novice like me understood it 😊😊😊 If possible please use a better mic the sound qualit on this video was a little low and foggy

@dingusagar Год назад

Thanks. Glad to hear that.😊 Yes, I will definitely work on the sound quality.

@jatinsingh9062 2 года назад

thansk !!

@shipan5940 2 года назад

I've an probably stupid question: why don't we just directly train the Student model? having the pre-trained Teacher model makes the Student model more accurate?

@dingusagar 2 года назад

There are no such thing as stupid questions :) Let me try to answer as per my understanding, feel free to reply back with further queries if any. In a simplified analogy, knowledge distillation is like a real life teacher and student combination. If a student tries to learn a new subject from scratch all by herself, then it would take a lot of time. Whereas an intelligent teacher who has already done all the hardwork in learning everythig before can give rich and summarised information after skipping the useless info to the student and the student would be able to learn it in lesser time. the trend today is really large models when trained on huge sizes of data tends to have better representational power and thus more accuracy. that is why the bigshot companies are on a constant race to build the next biggest model trained on bigger datasets. In our analogy, this is like the teacher reading lots of books to really understand the subject. Since the teacher has a bigger brain (more layers), it can go through the huge datasets, learn interesting patterns and discard useless patterns. After this intensive learning is done, the teacher acts as a pretrained model. Now the output coming out from the teacher model is very rich in information (refer 1:53), this is why a student model with a smaller brain (lesser number of layers), is able to consume this rich information and learn in shorter time.

@Speedarion 2 года назад

If the final layer has a sigmoid activation function , can the output of the sigmoid function be used as the input to a softmax function with temperature ?

@dingusagar 2 года назад

Interesting idea, In theory we could define the loss function like u said and the training would still work. But practically I am not sure to what extent it would help. Should try this out. We are essentially doing softmax twice. Here is an article on why we should't do that in a normal NN setup. jamesmccaffrey.wordpress.com/2018/03/07/why-you-shouldnt-apply-softmax-twice-to-a-neural-network/ Intuitively applying softmax twice is making the function more smoother and that's what we are trying to achieve in the KD setup. But if the same effect can be achieved by tuning the hyperparameter T of the softmax with temperature directly from the logits, then that's a simpler approach from training perspective, neverthless its an interesting idea to explore.

@Speedarion 2 года назад

@@dingusagar Thanks for the reply . If the final layer is a fully connected layer followed by a sigmoid activation function , essentially , the logits would be the inputs going into the sigmoid right ? I guess to perform KD , I would take this inputs and pass it to a softmax function with temperature

@dingusagar 2 года назад

@@Speedarion yes you are right. logits are what is coming out of the final layer before any activation is applied.

@andreisimion1636 Год назад

For Loss2, don' you want to do CrossEntropy(p(1), y_true), i.e. use the probabilities from the student w/o temperature scaling? Also, y_true is a 1 hot vector, no? It seems like Loss2 is a CrossEntropy between 2 1-hot vectors, so unsure if this is right. Am I missing something?

@dingusagar Год назад

Yes correct, loss 2 is between two one hot vectors. Cross entropy is just defined over 2 distributions and it doesn't really have any requirement of the distribution being 1hot or not..what you suggested is also correct i feel. It's just that this is how the authors have defined initially. Now different implementations can modify the loss based on what they find empirically more accurate. Having said that, intuitively i can think one reason in favor of this approach and that is the fact that loss one already uses soft predictions which help in models converging on learning the differences between the rich features of the images from the teacher model. So loss 2 is restricted to just focus on getting the classification correct which is expressed in one hot vector format.

@lm_mage Год назад

If the second argument of CrossEntropy() is the true labels, shouldn't Loss 1 be CrossEntropy(p,q) instead of CrossEntropy(q,p)?

@ThePaintingpeter Год назад

Great video but the sound could be improved

@nayanshah6715 2 года назад

why dingu sounds like english man ? or is it just me ... And also good content!!!!!!!!

@terrortalkhorror 4 месяца назад

if the model has just 1 and 0 in labels actual, then you must have mistakenly said that the model predicts with 0.39% that it is a horse. Instead it should be that with 0.39% the model thinks its the deer.

@terrortalkhorror 4 месяца назад

but i must say your expalanation is really good

@dingusagar 4 месяца назад

@@terrortalkhorror thanks for the feedback. I am not sure if I understood what you pointed out exactly. the predictions are done on the input image. The 3 images on the right are just for visualizing the classes. From the perspective of predicting the input image, the model thinks it is a deer, horse and peacock by probabilities 0.6, 0.39, 0.01 respectively as mentioned in the slide. The audio quality is poor and that could have created some confusion.

@terrortalkhorror 4 месяца назад

@@dingusagar yes you are right. i just rewatched it and now it makes sense.

@terrortalkhorror 4 месяца назад

I just sent a connection request on LinkedIn