Hello Alf, Thank you for your amazing effort to record these videos. I also appreciate the time you spend answering the comments. I am wondering if it is possible that you also share course homeworks with us?
You are also working on the hypothesis of identically distributed independence in deep learning , this is the line of research that is close to my heart for a thesis !!!!!
@@alfcnz Okay !!!!! however I love the courses and I thank you for making them available to everyone for my part I am more interested in the hypothesis of identically distributed independence (iid) in deep learning
Lecun: Ok, that's it for today, thank you, see you next week. (dead silence) It struck me right away. I rewatched it 10 times and it was very sad, almost painful. After a beautiful 1 hour and 43 minutes long lecture what does he get? Dead silence. Of course money too. It is still sad. Maybe I am used to the german academia style where at the end of a lecture there will be knocking on the tables (instead of applause). Lecun deserved a lot after this lecture. I'm home and I knocked for two.
Thank you so much for posting these lectures. They are definetly a great source to learn the mechanics behind DL. So at 1:22:55 , what does professor wants us to look out for when using cross-entropy loss in pytorch?
The loss is the log of a soft(arg)max, which has an exponential at the numerator that can easily cause numerical stability problems. If you analytically compute the logsoft(arg)max, then you can simplify the exp with the log, and no issues arise. More info here timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/
Thank you so much for sharing this. If you're accepting feedback: - I think it's worth using either hand drawn or latex equations over "monospace" or other standard computer fonts. - Having learned backprop before, it wasn't super clear how dC/(nth module) is useful as opposed to just dC/(nth weight).
Hello, Thank you so much for uploading this course online. Very helpful. A question on building the Jacobian - intuitively speaking, I understand that we need δc/δwk to minimise the cost function w.r.t weights wk (around 48:00 in the video) but wondering why do we derive δc/δzk?
Thank you for the excellent course. I would like to ask if there are English subtitles for every videos rather than auto-generated ones?. My English is not that perfect so I would like to understand literally, especially technical terms. Thank you :)
Yes, there are. Meaning, you, my subscribers and Twitter followers, are cleaning the subtitles and translating them in your own language. When you're done, I'll upload them here on RU-vid. Actually, some are already ready. So, I guess I'll upload what I have available right now.
Words cannot really describe how useful is this course and I don't really know how to be thankful for your greatest effort in this series! I do appreciate it. I just have one question in regards to this session of class, as I went for deep math, trying to understand the system work, following every word of Prof. Lecun, I was wondering to know what is your opinion on "why do we add Non-linearity to the network?" I perfectly know that if we sum several linear functions, the output gonna be linear so there would be no point, but I wanna know the pure mathematical or functional reason for adding such non-linearity? I appreciate your thoughts on it.
I'm glad you like the course. I'll open a Patreon account very soon, in case you want to show your gratitude. That's the actual reason. In order to avoid collapse of the entire hierarchy, you need “separators” which keep the sandwich of linear layers to revert to a simply loaf of bread. In more technical terms, the class of functions that you can fit is drastically larger when non linearities are used.
@@alfcnz Awesome! So, with respect to your words, the main reason is because of the number of objective functions, in case we keep it linear, will be much large, and basically computationally inefficient!! Did I understand correctly?
@@alfcnz Sorry, Dear Alfredo, if you don't mind I have another quick question in regards to the video, in 56:56 minutes of the video, Dr. Lecun is explaining abt the different types of basic modules. Accordingly, I was wondering to know "How" can we chose a module for our model? In other words, "how" should I know that for example for one specific problem I gotta use RelU, while for another one I use Duplicate module and so forth! what is the choice measure basically?
There's no independence assumption whatsoever. The sensitivity of a summation module 𝑦 = ∑𝑥ᵢ is just one ⇒ ∂𝑦/∂𝑥ᵢ = 1. That's it. Therefore, ∂𝐶/∂𝑥ᵢ = ∂𝐶/∂𝑦.
54:30 What does it mean to differentiate with respect to a matrix? I've asked around and googled a lot, but it does not seem like there is a sensible definition for it. So I wonder how to do it consistently, or does one just do the calculation with index notation and then, after that, one makes up something that works in the context at hand?
Hi @Alfredo, awesome edits ;) Yann mentioned that we need both jacobians wrt the parameters (w) and wrt the inputs of each layer (z/x0), what is the reason of the latter case? Thanks in advance!
What is the use of using relu if I'm not sending any negative score to it, doesn't it becomes linear in that case ? Bcz when I send a positive score relu just passes to next layer without any transformation .
Hi, i am not a super DL hero but i'll try to explain some basic idea. when your relu neuron outputs a non zero number that means that this neuron is learning some thing, therefore this neuron impacts your final output. second about passing the score that has been calculate throw the linear layer before the relu, you said that relu becomes an identity function so it is not impacting the final result (see resnet paper). the purpose of relu is to determine the active neuron by active i mean neuron that learns. finally, it has been shown that using relu activation in hidden layer can speed up training more than tanh pr sigmoid (see alexnet paper).
What do you mean by «if I'm not sending any negative score to it»? How do you _choose_ not to send any negative value? We use the non-linear behaviour around the zero to introduce a mean for the network to treat data differently. For a more visual explanation, please check out my (high school) student video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-UOvPeC8WOt8.html
@@alfcnz When I was commenting about this, I was thinking about an example like if I have an image data and if I normslise all the pixel values between 0 and 1 but now I think as and when it gets multiplied by weight matrix (hidden layer) I will have data points around 0 mean, is my understanding is correct ? If I really have to talk about one hypothetical example if my data is in the range of [0,1] and my initialised weights are also [0,1] if I'm not using He or Xavier initialisation in these cases relu will not work right ?
@@rakshithv5073 I see what you mean. To begin with, your data is zero-mean (even if it was not originally, we always zero-mean it because that's what we expect when initialising the model's parameters) and so are the weights (which are sampled from a zero-mean Gaussian of appropriate variance). The beauty of the ReLU is that you "select" regions of your input space to which we apply an arbitrary linear (affine) transformation. So, you can think of a ReLU net as being a piecewise linear (affine) model.
Have you tried printing them in PDF from the website? Don't forget to change the theme to a light one, so you can actually have a "normal printing format". I'll have a look about markdown to LaTeX conversion, though. I should be releasing both formats, you're right.
@@alfcnz I tried it. It somehow worked. But figure, sometimes, of one page move to the next page keeping a blank . It will be so much appreciated if you can manage to convert the on tex or PDF format.
EDIT: Found it! Kaiming initialisation! What is the weight initialisation trick that Prof. LeCun is referring to at 1:38:26 ? My google-fu failed me here :(
I really think courses like these should be taught either handwriting on a blackboard (see Kilian Weinberger’s ML course on YT), using a pointer that can appropriately direct attention, or designing slides and transitions that carefully direct attention. Explaining busy static slides like the one at ~32:30 by vaguely waving at the projection is pretty suboptimal for anyone who doesn’t already understand what’s going on. Great lecture otherwise!
ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-d9vdh3b787Y.html : why is dc/dw is [1 x N] size even though w is [N x 1] vector? I think they are supposed to be same shape?
@@faizanshaikh5326 d0 is the input dimension, and it's equal to the number of elements of a single input data point. It's used to initialise the mynet object with correct dimensionality.
@@alfcnz Thanks for the clarification on d0. Actually, I wanted to understand the part at 41:11 onward where the speaker says "This doesn't work actually. It's just a screen dump". Maybe I'm overfocusing, but just wanted to clarify
@@faizanshaikh5326 got it! I fixed the slides / code, and I should have cut out the segment where he says it doesn't work. I'm trying to edit this out now. Thanks for the pointer!
It's okay not to like our content. I only wish they could provide pointers and what exactly they would do differently. Otherwise it's hard for us to please them next time…