ResNets are tricky to conceptualise as there are many nuances to consider. Dr Bryce, you have done a great job here offering such a brilliant explanation that is both logical and easy to follow. You definitely have a gift of explaining complex ideas. Thank you!
I am writing a thesis on content-based image retrieval and I had to understand the ResNet architecture in-depth and by far this is the most transparent explanation ever!!
very very very good explanation. almost all explanations on this forget about the influence of random weights on the forward propagation and focus solely on the backward gradient multiplication. which is why i never understood why you needed to feed forward the input. thanks a lot
i have seen a lot of online lectures but you are the best for two reasons, the way you speak is not monotonous which give time to comprehend and process what your are explaining, and the second is the effort put in video editing to speed up when writing things down on board which doesn't break the flow of the lecture. Liked your video. Thanks🙂!
This is the clearest video that I've ever seen which explains the resnet for a layman, while at the same time conveying all the very important and relevant information related to resnet - I couldn't understand the paper - but with this video finally understood it - thanks a lot Professor Bryce - hope you create more such videos on deep learning
Until now, this is the best Residual Network tutorial I have found. As constructive feedback, I would like you to dive more deeply into how shape mismatches are handled because that part is not at par with the rest of the highly intuitive explanations of various things happening in a ResNet.
Excellent class! I watched many videos before I came to this video and none explained the concept of residual networks as clearly as you did. Greetings from México!
Wow This explanation is amazing. So clear! I saw some videos about resNets but none of them describes what skip connections mean inside, what is their inside structure and working logic. But your explanation gives me much more. You explained the way of thinking and inside structure and advantages. Wow!
This is such a clean and helpful video! Thank you very much! The only thing I still don't know is during the propagation, we now have two sets of gradients for each block? One for going through the layers, one for going around the layers, then how do we know which one to use to update the weights and bias?
Good question. For any given weight (or bias), its partial derivative expresses how it affects the loss along *all* paths. That means we have to use both the around- and through-paths to calculate the gradient. Luckily, this is easy to compute because the way to combine those paths is just to add up their contributions!
@csprof, By consistently including the original information alongside the features obtained from each residual block, are we inadvertently constraining our ResNet model to closely adhere to the input data, possibly leading to a form of over-memorization?
Thanks for nice explanation But I have one query, in part 16:00 where you said "each output neuron get input from every neuron across the depth of previous layer", here doesn't that make each output depth neuron same??
Thank you very much. I am not sure yet how residual block lead to faster gradient passing when the gradient has to go through both paths please? It means as I understand that this adds more overhead to compute the gradient. Please correct me if I am wrong. Also can you please add more how 1x1 reduce the depth or make a video please if possible? For example, I am not sure how the entire depth say of size 255 gives output to one neuron.
You're right that the residual connections mean more-complicated gradient calculations, which are therefore slower to compute for one pass. The sense in which it's faster is that it takes fewer training iterations for the network to learn something useful, because each update is more informative. Another way to think about it is that the function you're trying to learn with a residual architecture is simpler, so your random starting point is a lot more likely to be in a place where gradient descent can make rapid downhill progress. For the second part of your question, whenever we have 2D convolutions applied to a 3D tensor (whether the third dimension is color channels in the initial image, or different outputs from a preceding convolutional layer) we generally have a connection from *every* input along that third dimension to each of the neurons. If you do 1x1 convolution, each neuron gets input from a 1x1 patch in the first two dimensions, so the *only* thing it's doing is computing some function over all the third-dimension inputs. And then by choosing how many output channels you want, you can change the size on that dimension. For example, say that you have a 20x20x3 image. If you use 1x1 convolution with 8 output channels, then each neuron will get input from a 1x1x3 sub-image, but you'll have 8 different functions computed on that same patch, resulting in a 20x20x8 output.
Isn't this similar to RNNs where subsets of data is used for each epoch & in residual network, a block of layers is injected with fresh signal, much like boosting.