This is a terrific lecture on a difficult subject. Andrej strips all the complication away, gets to the essentials, and avoids jargon and "academese". The slides are great and the presentation engaging and easy to understand. The kind of teacher we wish was teaching every class!
Having suffered a lot and struggled through the cs229 of Andrew Ng, I can certify that this lecture is far more accessible to the least math-oriented souls out there. The way Andrej explains concepts *in english* and connects ideas to each other is one of the reasons i did not give up studying these topics.
Can't thank Andrej, the cs231n team, Stanford enough. Thoroughly enjoy your lectures. Knowledge is one form of addiction and pleasure and thank you so much for providing it freely. I hope you all enjoy giving it as much as we enjoy receiving it.
This was gold to me! "... and this process is called Back Propagation: it's a way of computing, through recursive application of the chain rule in a computational graph, the influence of every single intermediate value in that graph on the final loss function." (14:48)
Andrej Karpathy's lectures and articles are basically the reason I know anything about anything.The webpage he uses near end of lecture to show decision boundary is quite literally one of the reasons I got interested in machine learning at all.
Thank you so much for making these lectures public !! I had been struggling to understand backpropagation for a long time. You have made it all so simple in this lecture. I am planning to watch the entire series.
Finally, a lecture that lifts the curtain of mystery on backward propagation! Excellent delivery of core concepts in neural networks. This is the best lecture I have watched on the topic so far (including the most popular ones that are heavily promoted on public domains!)
After seeing a dozen videos on BackPropagation (that includes CSxxx courses too) , this is by far the best explanation given by anybody. Thank you Andrej, you have made life bit simpler in COVID.
This is a beautiful lecture - gave a very fundamental understanding of backward propagation and its concepts - I see backward propagation correlates to demultiplexing and forward prop corresponds to multiplexing where we are multiplexing the input .
Starts at 6:30 13:40 the local gradient is computed on the forward pass, this node while doing the forward pass can immediately know what dz/dx and dz/dy are because it knows what function it’s computing 31:30 when two branches coming into a node when doing backprop we add their gradients 34:15 example of backward pass code 37:00 the gradient of the loss wrt x is NOT just “y” but it’s y*z because this z is dL/dZ and we want to find dL/dX, dL/dX = dL/dZ * dZ/dX, and z the incoming gradient is dL/dZ and y = dZ/dX 43:30 why are x,y,z vectors? Because recall the notation for forward propagation, we do forward prop for more than one input at a time. Each input can be represented as a column vector or a row vector in the input matrix X? 1:10:00 decision boundaries formed by neural network
In summary: we calculate forward to get loss value , calculate backward to get gradient value and then update weight for the next step. And so on, until we 're happy with loss value.
This video made me cry😿. This guy did his best and enthusiastically tried to make the students understand BackProp. Stanford students are so lucky. It’s not that you are smarter, but that other former children, such as me, don’t have such a good teacher.
Just one question. Why here ( ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-i94OvYb6noo.html )you take input value to the gate for its own derivative while in all the other parts you take the already computed value ( ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-i94OvYb6noo.html )? It's kinda confusing. I'd expect that in the beginning we take just 1 not the input to the gate.(1/x)
suppose output of (1/x) is F and (+1) is Q. So, the local gradient on the F gate should be (dF/dQ). why is that not the case? here the local gradient over F is (dF/dx).. pls explain? @ 18:20
+Profilic Locker LOL, Thank God, it's not only me. He picked up a pretty accent too. Well, his recent years in US did him good. These 8 years changed Badmephisto's voice a lot.
Really good lectures & assignments. I could not figure out how to get dW, I worked out other parts of the assignments, now that Assignments 1 & 2 are graded can some one please tell me Thanks
The output layer has to match the dimensions of the output, so the dimensions are fixed. Also, since the output is not used for computation, don't think about it in terms of weights and biases. Output layers have no weights and biases, it's just the output. If you're using a sigmoid layer at the end, there is no bias coming in from the previous layer, since that would skew your probability estimates. All the hidden layers can have biases, though
Around 24:00, shouldn't the gradients be switched? I.e: x0: [-1] X [0.2] = -0.2 w0: [2] X [0.2] = 0.4 Oh wait nevermind, I see it's explained a couple slides further!
No. dL/dx0 = df/dx0 * dL/d(w0 * x0) df/dx0 (local gradient) = 2.0 (because the value of x0 is multiplied by w0 = 2.0, every time we increase the value of x0 by h, the value of w0 * (x0 + h) increases by h * w0) dL/d(w0 * x0) is already calculated as 0.20 2.0 * 0.20 = 0.40 The same is applied to calculation of dL/dw0
Someone please help. I am a pianist, so my math is weak, I know I am stupid, but... The derivative of (x+y)... If I do an implicit differentiation, it turns out: (x+y)=3-> d(x+y)/dx=d3/dx-> d(x+y)/dx=0-> 1+1*(dy/dx)=0-> dy/dx=-1 And the result must be a positive 1! Where is my mistake? Please someone point out.
Good lecture, but not a good idea to teach the next generation of data scientists that (in this case, Python) code should have no comments and maximum terseness.