ResNet (actually) explained in under 10 minutes

Подписаться 6 тыс.

Просмотров 89 тыс.

50% 1

Want an intuitive and detailed explanation of Residual Networks? Look no further! This video is an animated guide of the paper 'Deep Residual Learning for Image Recognition' created using Manim.
Sources / credits
Resnet Paper: arxiv.org/abs/1512.03385
Manim animation library: www.manim.community/
Pytorch ResNet implementation: github.com/pytorch/vision/blo...

Наука

Опубликовано:

22 окт 2022

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 72

@nialperry9563 Год назад

Cracking video, Rupert. Well animated and explained. I am already satisfied with my understanding of ResNets after this.

@devanshsharma5159 11 месяцев назад

love the animation! Thanks for the clean and clear explanation!

@poopenfarten4222 Год назад

legit one of the best explanations i found

@rupert_ai Год назад

Thanks myyy dude!

@Cypher195 Год назад

Thanks. Been out of touch with AI for far too long so this summary is very helpful.

@rupert_ai Год назад

Thanks Aziz, good luck with getting back in touch with AI

@agenticmark 5 месяцев назад

lol, I have fought that exact trendline so many times in ML :D Great humor. Great video work.

@TheBlendedTech Год назад

Thank you, this was well put together and very useful.

@rupert_ai Год назад

Thanks!

@MuhammadHamza-o3r 2 дня назад

Very well explained

@Omsip123 7 месяцев назад

I pushed it to exactly 1k likes, cause it deserves it ... and many more

@ShahidulAbir Год назад

Amazing explanation. Thank you for the video

@rupert_ai Год назад

Thank you Shahidul!

@djauschan 6 месяцев назад

Amazing explanation of this concept. Thank you very much

@terencechengde Год назад

Great content! Thank you for the effort!

@rupert_ai Год назад

Thanks Terence! :)

@sarthakpatwari7988 11 месяцев назад

Mark my words, if he become consitent, this channel will become one of the next big thing in AI

@christianondo9637 5 месяцев назад

great video, super intuitive explanation

@rezajavadzadeh5597 Год назад

thank you so much

@rupert_ai Год назад

Thanks Reza!

@carolinavillamizar795 9 месяцев назад

Thanks!!

@sergioorozco7331 6 месяцев назад

Is the right hand side of the addition supposed to have height and width dimension of 32x32 at 7:08? I think there is a small typo in the visual.

@user-mg3ey1uq8f 4 месяца назад

It's amazing. Both resnet and this explaination.

@datascience8775 Год назад

Good content, just subscribed, keep sharing.

@rupert_ai Год назад

Thanks, will do :)

@logon2778 Год назад

You say that the identity function is added elementwise at the end of the block. So say I have an identity [1,2] and the result of the block is [3,4]. So would the output of the layer be [4,6]? So its not a concatenation of the identity function which would be [1,2,3,4], correct? You basically ensure the identity function is the same dimensionality as the output of the block then add them element-wise.

@rupert_ai Год назад

Hey Logon, great question, you are totally correct the output from your example (identity [1,2] and block output [3, 4]) would be [4, 6] e.g. you simply add the values based on their twin positions. You don't concatenate! Yes, the last section on dimension matching covers the scenario when the dimensions don't match (and therefore you can't add them element-wise until you modify them).

@logon2778 Год назад

@@rupert_ai So in the case of the 1x1 convolutions where there are 3 input channels and 6 output channels of equal size... How are they added element-wise? Are the input features add elementwise twice? Once for each pair of 3 output channels? Or does it only add element-wise to the first 3 output channels and leaves the other 3 untouched.

@rupert_ai Год назад

Hi @@logon2778, as is standard with convolutional neural networks each 1x1 convolution takes contributions from all channels (in this case across all 3 channels of the input). So in order to have 6 output channels you have 6 lots of 1x1 convolutions that take contributions from all 3 channels. In order to half the size you skip every other pixel (e.g. a stride of 2). That is simply what is used for the original paper, obviously other approaches work too. Now you have a 6 channel output which is half the height and width which matches the network dimensions and you can do element wise addition as usual. Have a watch of the video again and look up convolutional basics - I have a video on this actually - hopefully that might shed some light on things ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-6VP9k2WM6k0.html

@logon2778 Год назад

@@rupert_ai I understand how convolution works for the most part. 8:45 you show that there are 6 output channels of equal size to the input. But how can you element wise add 3 input channels to 6 output channels of equal size? In my mind you have double the dimensions. You have 6, 64x64 output channels. But you have 3, 64x64 input channels. So how can you element wise multiply them?

@rupert_ai Год назад

@@logon2778 The section you mention discusses what must be done to the copy of the identity along the residual connection BEFORE you do element wise addition with the output from the resnet block. The process follows this logic: 1) save a copy of your input as the identity (e.g. 3 channels 64x64) 2) run your input through the main block this outputs a new tensor. This new tensor can have the same dimensions or it can have different dimensions (e.g. 6 channels 32x32). If it has different dimensions proceed to step 3) if it has the same time dimensions proceed to step 4). 3) take the copy of the identity in step 1) and apply 6 1x1 convolution kernels with stride 2 to it, this outputs 6 channels 32x32. 4) do element wise addition with your identity and your resnet block output. Note that if the dimensions changed, then you also changed your identity with step 3 to ensure you can do element wise addition. Element wise addition is simply adding each corresponding value with one another. E.g. the value in the top left corner of channel 2 for the first tensor is added to the value in the top left corner of channel 2 for the second tensor. You don't do element wise multiplication as you mention. Hope that clears it up!

@mahneh7121 11 месяцев назад

great video man

@Antagon666 10 месяцев назад

Idk why, but simply adding bicubicly upscaled image to output of CNN with pixel shuffling layer achieves much better results than having any amount residual blocks. Also it's much faster.

@nxtboyIII Год назад

Great video well explained thanks!

@nxtboyIII Год назад

I liked the visuals too

@rupert_ai Год назад

@@nxtboyIII Thank you Lucas 🙏

@moosemorse1 11 месяцев назад

Subscribed. Thank you so much

@jamesnorton4953 Год назад

🔥

@RadenRenggala Год назад

Hello, is the term "residu" referring to the convolutional feature maps from the previous layer that are then added to the feature maps output in the current layer?

@rupert_ai Год назад

The residual is actually the 'difference' between two features! In ResNets the feature maps from previous layers are added onto the current features maps, this means the current layer can learn the 'residual' function where it only needs to learn the difference

@RadenRenggala Год назад

@@rupert_ai So, residual is the difference between the current feature map and the previous feature map, and to obtain the residual, we need to perform an addition between those feature maps?.. Thank you.

@tanmayvaity9437 Год назад

nice video

@rupert_ai Год назад

Thanks Tanmay!

@egesener1932 Год назад

Everyone say ResNet solves vanishing/gradient problem but dont we already use ReLu function istead of sigmoid to solve it ? Also part 4.1 of article say plain counterpart with batch normalization doesn't causes vanishing problem but still causes more error rate when layers are increased 18 to 34. Can you explain it ?

@rupert_ai Год назад

1) there are multiple things that help solved the vanishing/exploding gradient problem, residual connections in general help massively with the learning process - as they ground the learning process around the desired result. e.g. you learn the difference between what you have and the correct result (the residual). 2) batch normalisation also helps with the vanishing/exploding gradient problem as again this allows features of each layer to have a normalised distribution that is scaled so it won't explode/vanish, etc. 3) your point around 4.1 they are saying that networks without residual connections (plain) have worse error when they have more layers (18 vs 34) for the exact reason I stated in part 1) of this answer, it is a difficult optimisation problem for the network to solve without the residual, when you add residuals you aren't penalising adding more layers to your network. Hope that makes sense!

@SakshamGupta-em2zw Месяц назад

Love the Music

@SakshamGupta-em2zw Месяц назад

And love that you used manim, keep it up

@enzogurijala5464 Год назад

great video

@mohamed_akram1 Год назад

Nice video. Did you use Manim?

@rupert_ai Год назад

Hey Mohamed! Yes I did - my first video using manim! I hope to use it for some more complex things in the future :)

@ColorfullHD Год назад

Hey, its 3blue1brown All jokes aside, great explanation, cheers

@rupert_ai Год назад

Hahaha well it is using his animation library ;) all hail grant sanderson

@JoydurnYup Год назад

great vid sir

@rupert_ai Год назад

Thanks Joydurn! :)

@wege8409 Месяц назад

6:38 this is the part that really made me understand, thank you

@firefistace8569 Год назад

What is the residual in the image classification task?

@rupert_ai Год назад

Good question! It can be tricky to understand what the residual might be in the image classification task as it is more abstract when compared to the super resolution task, essentially, you use the feature maps from previous layers and learn the 'residual' between previous layers and the current layer - in essence this makes a very powerful block of computation that is grounded by the skip connections. This makes image classification easier as the network itself can process the image in a more comprehensive way. There really isn't any 'end-to-end' residual in image classification like there is with super resolution, I hope that answers your question

@firefistace8569 Год назад

@@rupert_ai Thanks!

@januarchristie615 Год назад

Hello, I apologize for my question, but I still don't quite understand why learning residuals can improve model predictions better? Thank you

@giovannyencinia9239 Год назад

I think, that is because this arquitecture can apply the identity function, first you have an input a^[l] and this pass forward the convolutions, batch normalization, activation funciton etc. and finally there is an output z^[l+2] (this output in the hidden layers has some parameters theta), and here is where the architecture add the a^[l] (ReLU(z^[l+2] + a^[l])), then in the back propagation step there is the posibility that the optimal parameters in z^[l+2] are 0, so the result is a^[l] this is because you apply a ReLU activation funtion, and this means that the intermediate layers wont be use. If you build a big and deeper NN this arquitecture can skip the layers(blocks of residuals) that does not help to reach the local optima.

@dapr98 8 месяцев назад

Great video! Thanks. Would you recommend ResNet over CNN for music classification?

@krishnashah6654 5 месяцев назад

i'd just say thank you so much man!

@BABA-oi2cl 4 месяца назад

Thanks a lot ❤

@doudouban 7 месяцев назад

2:06, the equation shift seem problematic.

@the_random_noob9860 4 месяца назад

Lifesaver! Also, for classification, it's inevitable that the dimensions go down and channels go up across the network. But the 1 x 1 convolution on the input features to 'match the dimensions' kinda loses the original purpose i.e to retain/boost the original signal.. In a sense it's another conv operation that is no longer similar to the input (I mean it could be similar but certainly as not as the input features themselves). It's just the original idea was to have the same input features so that we could zero out the weights if no transformation is needed. Atleast they're not as different from how the input features as transformed across the usual conv block(conv, pooling, batch norm and activation). Let me know if I am missing anything