Week 2 - Lecture: Stochastic gradient descent and backpropagation

Alfredo Canziani

Подписаться 39 тыс.

Просмотров 54 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

16 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 100

@thinhon3421 4 года назад

thanks Alfredo for sharing, this series is so wonderful, I have never approached DL in such a detailed way.

@alfcnz 3 года назад

I'm glad now you have the chance! 🤓

@kanishkmair2920 3 года назад

@@alfcnz I tried 11-785 from CMU (which is also online), but it was too difficult at that time. This is comparatively simpler to grasp!

@amirhosseinmohammadi5675 2 года назад

Hello Alf, Thank you for your amazing effort to record these videos. I also appreciate the time you spend answering the comments. I am wondering if it is possible that you also share course homeworks with us?

@alfcnz 2 года назад

Check the Spring 21 course website. 😉

@mahdijavadi3769 2 года назад

Thank you very much. God keep you healthy🙏🙏

@alfcnz 2 года назад

🥰🥰🥰

@madaragrothendieckottchiwa8648 3 года назад

You are also working on the hypothesis of identically distributed independence in deep learning , this is the line of research that is close to my heart for a thesis !!!!!

@alfcnz 3 года назад

That's great! 🤩

@madaragrothendieckottchiwa8648 3 года назад

@@alfcnz Okay !!!!! however I love the courses and I thank you for making them available to everyone for my part I am more interested in the hypothesis of identically distributed independence (iid) in deep learning

@incendioraven4269 4 года назад

thanks for the correction work. nice explanation

@alfcnz 4 года назад

Hehe 😁, I didn't want to interrupt him 😅 Although not too pretty, at least is correct now.

@incendioraven4269 4 года назад

@@alfcnz hahaha

@tominikolla2699 3 года назад

Lecun: Ok, that's it for today, thank you, see you next week. (dead silence) It struck me right away. I rewatched it 10 times and it was very sad, almost painful. After a beautiful 1 hour and 43 minutes long lecture what does he get? Dead silence. Of course money too. It is still sad. Maybe I am used to the german academia style where at the end of a lecture there will be knocking on the tables (instead of applause). Lecun deserved a lot after this lecture. I'm home and I knocked for two.

@alfcnz 3 года назад

🥺😍🥳

@anurag316 3 года назад

Thank you so much for posting these lectures. They are definetly a great source to learn the mechanics behind DL. So at 1:22:55 , what does professor wants us to look out for when using cross-entropy loss in pytorch?

@alfcnz 3 года назад

The loss is the log of a soft(arg)max, which has an exponential at the numerator that can easily cause numerical stability problems. If you analytically compute the logsoft(arg)max, then you can simplify the exp with the log, and no issues arise. More info here timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/

@ryanglambert 3 года назад

Thank you so much for sharing this. If you're accepting feedback: - I think it's worth using either hand drawn or latex equations over "monospace" or other standard computer fonts. - Having learned backprop before, it wasn't super clear how dC/(nth module) is useful as opposed to just dC/(nth weight).

@alfcnz 3 года назад

• I'm forcing him to update all his equations. • dC/(n-th module) is necessary to compute dC/(n-1-th weight).

@purnimakamath3474 3 года назад

Hello, Thank you so much for uploading this course online. Very helpful. A question on building the Jacobian - intuitively speaking, I understand that we need δc/δwk to minimise the cost function w.r.t weights wk (around 48:00 in the video) but wondering why do we derive δc/δzk?

@alfcnz 3 года назад

Actually, we need ∂c/∂z_n+1 to compute ∂c/∂w_n (as you can see from the timestamp you pointed out).

@NiRaNaam555 3 года назад

Thank you for the excellent course. I would like to ask if there are English subtitles for every videos rather than auto-generated ones?. My English is not that perfect so I would like to understand literally, especially technical terms. Thank you :)

@alfcnz 3 года назад

Yes, there are. Meaning, you, my subscribers and Twitter followers, are cleaning the subtitles and translating them in your own language. When you're done, I'll upload them here on RU-vid. Actually, some are already ready. So, I guess I'll upload what I have available right now.

@shirinyamani9733 3 года назад

Words cannot really describe how useful is this course and I don't really know how to be thankful for your greatest effort in this series! I do appreciate it. I just have one question in regards to this session of class, as I went for deep math, trying to understand the system work, following every word of Prof. Lecun, I was wondering to know what is your opinion on "why do we add Non-linearity to the network?" I perfectly know that if we sum several linear functions, the output gonna be linear so there would be no point, but I wanna know the pure mathematical or functional reason for adding such non-linearity? I appreciate your thoughts on it.

@alfcnz 3 года назад

I'm glad you like the course. I'll open a Patreon account very soon, in case you want to show your gratitude. That's the actual reason. In order to avoid collapse of the entire hierarchy, you need “separators” which keep the sandwich of linear layers to revert to a simply loaf of bread. In more technical terms, the class of functions that you can fit is drastically larger when non linearities are used.

@shirinyamani9733 3 года назад

@@alfcnz Awesome! So, with respect to your words, the main reason is because of the number of objective functions, in case we keep it linear, will be much large, and basically computationally inefficient!! Did I understand correctly?

@shirinyamani9733 3 года назад

@@alfcnz Sorry, Dear Alfredo, if you don't mind I have another quick question in regards to the video, in 56:56 minutes of the video, Dr. Lecun is explaining abt the different types of basic modules. Accordingly, I was wondering to know "How" can we chose a module for our model? In other words, "how" should I know that for example for one specific problem I gotta use RelU, while for another one I use Duplicate module and so forth! what is the choice measure basically?

@alfcnz 3 года назад

It does have nothing to do with the objective function. The objective function is what you minimise in order to find the correct parameters.

@alfcnz 3 года назад

We teach that throughout the course. This class gives you a panoramic about what's out there.

@theperfectprogrammer 2 года назад

why cross entropy expect OUTPUT FROM log(softmax,,?? 1:23:00

@rakshithv5073 3 года назад

Could you please guide me to any blog or literature how exactly LASSO pushes weights to zero ?

@alfcnz 3 года назад

Check out the regularisation (practicum 14) video!

@madaragrothendieckottchiwa8648 3 года назад

You are working on a research axis on causal learning in deep learning ?

@alfcnz 3 года назад

No, no causal learning in this course.

@faizanshaikh5326 4 года назад

In the diagram of NN at 37:15, s2 should be followed by h(s2) rather than h(s1), am I interpreting this correctly?

@alfcnz 4 года назад

Yes, indeed. I cannot fix this kind of typos after the video is uploaded.

@MrFurano 4 года назад

1:02:10 When y=x1 + x2, then dC/dx1 = dC/dy .... That's true only when x1 and x2 are independent, right?? Is this assumption practical?

@alfcnz 4 года назад

There's no independence assumption whatsoever. The sensitivity of a summation module 𝑦 = ∑𝑥ᵢ is just one ⇒ ∂𝑦/∂𝑥ᵢ = 1. That's it. Therefore, ∂𝐶/∂𝑥ᵢ = ∂𝐶/∂𝑦.

@MrFurano 4 года назад

24:40 Yann talks about why he views this network as a two-layered network.

@alfcnz 4 года назад

Hahaha, and I still see input, hidden, output, hence 3 layer net 😜

@faizanshaikh5326 4 года назад

@@alfcnz And hence the speaker quotes "you do this - but you don't wanna do this" :D

@sillymesilly 3 года назад

@@alfcnz I input X counted as layer 0?

@alfcnz 3 года назад

@@sillymesilly I start counting from 1.

@TheAravindppadman 3 года назад

For a Turing award winner, the chain rule is definitely kinder garden stuff. hehe

@bhagwat1210 3 года назад

Hi @Alfredo Why can't i play the youtube video any more. Is there any other place where i can find them. Please let me know.

@alfcnz 3 года назад

Have you tried connecting to a more stable WiFi?

@allnamesaregiven 3 года назад

54:30 What does it mean to differentiate with respect to a matrix? I've asked around and googled a lot, but it does not seem like there is a sensible definition for it. So I wonder how to do it consistently, or does one just do the calculation with index notation and then, after that, one makes up something that works in the context at hand?

@pranavpandey2965 3 года назад

Maybe It's the "Jacobian" they are talking about.

@allnamesaregiven 3 года назад

@@pranavpandey2965 But the result of the differentiation of Wx is x^T, x transposed, which is not the jacobian I think.

@alfcnz 3 года назад

Right, just use the index notation and you'll be fine. Moreover, I'd suggest checking out my friend Marc's book mml-book.github.io/ chapter 5.

@rafacardenas8783 3 года назад

Hi @Alfredo, awesome edits ;) Yann mentioned that we need both jacobians wrt the parameters (w) and wrt the inputs of each layer (z/x0), what is the reason of the latter case? Thanks in advance!

@alfcnz 3 года назад

We express the Jacobian with respect to the n module's weight in terms of Jacobian of the input of the n+1 module.

@rakshithv5073 3 года назад

What is the use of using relu if I'm not sending any negative score to it, doesn't it becomes linear in that case ? Bcz when I send a positive score relu just passes to next layer without any transformation .

@kh41r17 3 года назад

Hi, i am not a super DL hero but i'll try to explain some basic idea. when your relu neuron outputs a non zero number that means that this neuron is learning some thing, therefore this neuron impacts your final output. second about passing the score that has been calculate throw the linear layer before the relu, you said that relu becomes an identity function so it is not impacting the final result (see resnet paper). the purpose of relu is to determine the active neuron by active i mean neuron that learns. finally, it has been shown that using relu activation in hidden layer can speed up training more than tanh pr sigmoid (see alexnet paper).

@rakshithv5073 3 года назад

@@kh41r17 Thanks for your reply

@alfcnz 3 года назад

What do you mean by «if I'm not sending any negative score to it»? How do you _choose_ not to send any negative value? We use the non-linear behaviour around the zero to introduce a mean for the network to treat data differently. For a more visual explanation, please check out my (high school) student video: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-UOvPeC8WOt8.html

@rakshithv5073 3 года назад

@@alfcnz When I was commenting about this, I was thinking about an example like if I have an image data and if I normslise all the pixel values between 0 and 1 but now I think as and when it gets multiplied by weight matrix (hidden layer) I will have data points around 0 mean, is my understanding is correct ? If I really have to talk about one hypothetical example if my data is in the range of [0,1] and my initialised weights are also [0,1] if I'm not using He or Xavier initialisation in these cases relu will not work right ?

@alfcnz 3 года назад

@@rakshithv5073 I see what you mean. To begin with, your data is zero-mean (even if it was not originally, we always zero-mean it because that's what we expect when initialising the model's parameters) and so are the weights (which are sampled from a zero-mean Gaussian of appropriate variance). The beauty of the ReLU is that you "select" regions of your input space to which we apply an arbitrary linear (affine) transformation. So, you can think of a ReLU net as being a piecewise linear (affine) model.

@TheAravindppadman 3 года назад

Where can I find the assignments for this course?

@alfcnz 3 года назад

I'll release this semester's when we're done.

@TheAravindppadman 3 года назад

@@alfcnz Thank You

@naimshant7129 4 года назад

Is there any way to make the lecture notes as pdf? It is easier to read the notes as pdf by marking the important parts and revise them again

@alfcnz 4 года назад

Have you tried printing them in PDF from the website? Don't forget to change the theme to a light one, so you can actually have a "normal printing format". I'll have a look about markdown to LaTeX conversion, though. I should be releasing both formats, you're right.

@naimshant7129 4 года назад

@@alfcnz I tried it. It somehow worked. But figure, sometimes, of one page move to the next page keeping a blank . It will be so much appreciated if you can manage to convert the on tex or PDF format.

@jwc7663 3 года назад

Why is the product of two linear functions is a linear function, sir?... counter e.g. x * x = x^2

@jwc7663 3 года назад

@@hihi-pr4uq Thanks for the clear explanation

@SubhomMitra 3 года назад

EDIT: Found it! Kaiming initialisation! What is the weight initialisation trick that Prof. LeCun is referring to at 1:38:26 ? My google-fu failed me here :(

@alfcnz 3 года назад

Léon Bottou's trick. You can find more context here: yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

@SubhomMitra 3 года назад

Thanks Alfredo. Love the work you're doing with uploading this amazing content for the wider world! 😍

@alfcnz 3 года назад

@@SubhomMitra glad to please. 😊😊😊

@alfcnz 3 года назад

Also, see lecture 1 from David's course: davidrosenberg.github.io/ml2016

@SubhomMitra 3 года назад

@@alfcnz Thank you. I'll check it out.

@ChrisOffner 3 года назад

I really think courses like these should be taught either handwriting on a blackboard (see Kilian Weinberger’s ML course on YT), using a pointer that can appropriately direct attention, or designing slides and transitions that carefully direct attention. Explaining busy static slides like the one at ~32:30 by vaguely waving at the projection is pretty suboptimal for anyone who doesn’t already understand what’s going on. Great lecture otherwise!

@alfcnz 3 года назад

Indeed. That's why I'm teaching the practica, step by step, either with slides animations or digital whiteboard.

@jwc7663 3 года назад

ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-d9vdh3b787Y.html : why is dc/dw is [1 x N] size even though w is [N x 1] vector? I think they are supposed to be same shape?

@alfcnz 3 года назад

Nope. This is called _numerator layout._ You can read more here en.wikipedia.org/wiki/Matrix_calculus#Numerator-layout_notation

@napulen 3 года назад

What was the question on 1:11:52?

@alfcnz 3 года назад

Investigating (increasing volume, denoising + asking the author).

@napulen 3 года назад

@@alfcnz Thanks!

@alfcnz 3 года назад

@@napulen «What does the log do that you don't get from only using the softmax?»

@faizanshaikh5326 4 года назад

What is the issue at 41:23 that the speaker talks about?

@faizanshaikh5326 4 года назад

Is it d0 passed in mynet?

@alfcnz 4 года назад

The question was: “why are we introducing those nonlinearities?”. Answer, everything would collapse to a single linear layer otherwise.

@alfcnz 4 года назад

@@faizanshaikh5326 d0 is the input dimension, and it's equal to the number of elements of a single input data point. It's used to initialise the mynet object with correct dimensionality.

@faizanshaikh5326 4 года назад

@@alfcnz Thanks for the clarification on d0. Actually, I wanted to understand the part at 41:11 onward where the speaker says "This doesn't work actually. It's just a screen dump". Maybe I'm overfocusing, but just wanted to clarify

@alfcnz 4 года назад

@@faizanshaikh5326 got it! I fixed the slides / code, and I should have cut out the segment where he says it doesn't work. I'm trying to edit this out now. Thanks for the pointer!

@ahmadanis9930 2 года назад

Shouldn't dc/ds = dc/dz*dz/dh*dh/ds instead of dc/dz*dz/ds

@ahmadanis9930 2 года назад

as we want to calculate dc/ds where c can be formulated by z(h(s))

@user-ud9hc5sh6n 3 года назад

This guy looks very similar to Bill Gates, except he has Margot (Alan Ford) hairstyle.

@alfcnz 3 года назад

😮😮😮

@user-ud9hc5sh6n 3 года назад

@@alfcnz The meaning of Margot hairstyle is: "I work for who pays more". :)

@alfcnz 3 года назад

@@user-ud9hc5sh6n 😮😮😮

@BuzzBizzYou 3 года назад

I wonder who are those 7 people who didn’t like the video...

@alfcnz 3 года назад

It's okay not to like our content. I only wish they could provide pointers and what exactly they would do differently. Otherwise it's hard for us to please them next time…