No video :(

Week 2 - Practicum: Training a neural network

Alfredo Canziani

Подписаться 39 тыс.

Просмотров 26 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

28 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 69

@dontwannabefound 2 года назад

This practicum series is incredible (I only watched this one so far), I didn't even watch Yann's lectures and still was able to follow. Kudos to the instructor. Thank you for teaching with such care, I am excited to jump into DL.

@alfcnz 2 года назад

😅😅😅

@Papillon375 4 года назад

Best explanation and slides I have ever seen on this topic. Please keep going !

@alfcnz 4 года назад

🤗 Heading to office to produce two new videos.

@yashtrivedi2702 2 года назад

Man, You are a LEGEND! Big FAN.

@alfcnz 2 года назад

🥰🥰🥰

@introvertedbot Год назад

Your humorous way of teaching really makes me LMAO 🤣

@alfcnz Год назад

Yay! 😇😇😇

@luksdoc 3 года назад

You're teaching this in a beautiful way, simply amazing.

@alfcnz 3 года назад

Yay! I try very hard, at least 😅

@zukofire6424 Год назад

this class is gold! thank you ! :)

@alfcnz Год назад

You’re welcome 🤗🤗🤗

@jonathansum9084 4 года назад

Thank You for sharing the course. I am sure your work is one of the greatest contributions to the community. But I also hope one thing. I hope you will teach something that is more advance too in the future.

@alfcnz 4 года назад

Hahahaha! Just wait until the semester kicks in! I believe you'll be satisfied with the level advancement!

@MrFurano 4 года назад

33:35 I really like your use of concrete examples to illustrate how the math is done and what the matrices look like. You're awesome! Thank you very much! (The downside of watching this online compared to attending your class in person is that I don't get to see what your laser pointer is pointing at. Hope the pure online lessons will show the pointer too!)

@alfcnz 4 года назад

Thank you for your feedback! Yes, these were in class lectures, and therefore are tailored to my physical students. I have completely different lecturing styles, which try to maximise the engagement with my current audience. If you check some of my other videos on my channel, you'll notice the difference.

@nurbekss2729 Год назад

watching your playlist during my univ courses of deep learning.Thank you for your work ,very appreciated.

@alfcnz Год назад

During? 😮😮😮

@juanmanuelcirotorres6155 3 года назад

This is pure gold, you're amazing, thanks a lot

@alfcnz 3 года назад

Jaja, ¡muchas gracias! 😄

@mizupof 3 года назад

Wow! The students are interacting with you this week! Congrats Alf :D

@alfcnz 3 года назад

Haha, yeah. It takes a few iterations before they get enough courage, haha! That's when I'm really having fun! The interaction and "fun" are keys to learning!

@anrilombard1121 Год назад

This is better than watching anime!

@alfcnz Год назад

👀👀👀

@dontwannabefound 2 года назад

At around 32:52 I felt really dumb as I could not make sense of l(yhat, c) := -log(yhat[c]). Specifically, I was confused on the expression yhat[c]. This is basically an array subset. So if yhat (the predicted value) is (1,0,0) and c = 1 then yhat[c] here is just (1,0,0)[1] = 1 (the first element of the yhat array). I come from purely mathematical background and this notation was non-standard for me but I think this is standard in CS. I also suspect this might be the question alluded to around 33:30 and... I think this is what he was trying to answer around 35:00

@manchen5221 4 года назад

Thank You for sharing the course

@alfcnz 4 года назад

You're extremely welcome 😄

@pranavpandey2965 4 года назад

18:58 Why we don't use normalisation instead of softmax is because the outputs of the layers can have negative values, and if we normalise those values as we do generally we won't get all the values between 0 and 1 , and neither the sum of all the values will be 1. Since softmax takes an exponential, all the values become positive and we get the required probability distribution.

@alfcnz 3 года назад

Normalisation with what parameters? We want a pseudo-probability as output, therefore we use a softargmax, which inverse temperature β parametrises its coldness / sharpness. (Super cold, β → +∞ ⇒ argmax.)

@siddhantverma532 2 года назад

When you talk about moving from ow dimension to high dimension things are far, this metric for far is the eucidean distance? or anything else or an intuition only?

@thomasdeniffel2122 4 года назад

If I change the optimizer at 46:00 from : optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=lamda_l2) to optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=lamda_l2) (like in the first example) without changing anything else, I get a very bad result (acc~=0.524) Why is that? I did not expect this, as I expected everything else is the same (46:08) BTW: Thank you so much for your effort. These videos are the best material I know of. Keep going

@thomasdeniffel2122 4 года назад

I was confused as you stated, that you didn't change anything but adding the ReLU (ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-WAn6lip5oWk.html&list=PLLHTzKZzVU9eaEyErdV26ikyolxOsz6mq&t=2768) BTW: Thank you so much for your effort. These videos are the best material I know of. Keep going

@alfcnz 4 года назад

Try to run the training a few time? See if you simply got a wacky initialisation. Also, next time use minute:second instead of links, to point out the timestep you're referring to. There's an edit button to these messages, so you can put everything in a single comment. I did not notice the optimiser is different. You can send a PR where you change the first SGD with Adam, and post the results you're getting (so we compare just networks, and not optimisers). Thank you for following up, and applying yourself by running the notebooks. Keep the feedback coming (and if it's coding specific, feel free to open issues directly on GitHub, so I can better understand what's going on). I'm glad you enjoy the material.

@mattst.hilaire9101 3 года назад

Loving these lectures so far. Question around 45:40 with the initial loss calculation. I thought the initial y_hat vector would be [1/3, 1/3, 1/3] as the model is randomly guessing between 3 choices, making the initial loss equal -log(1/3) or 0.481. But I'm having it come out to be ~1.1. What am I missing here?

@alfcnz 3 года назад

In math we call log what engineers call ln 😜 1.1 _is_ the correct answer.

@theperfectprogrammer 3 года назад

i have a question why we use pytorch instead of tensorflow? what are the advantages?

@alfcnz 3 года назад

I don't know anyone using TensorFlow for research nor I have ever used it. So, I'm not too sure what would be the advantages of adopting it.

@mikhaeldito 3 года назад

I frequently fast-forwarded the video whenever math symbols appeared. I think I am missing a lot of your valuable insights. Math is my weakness but I sincerely want to improve. I have tried going through several frequently recommended resources, such as Khan Academy, but I got bogged down and demotivated because I could not grasp the relevance of the math to deep learning. I have done FastAI but I really want to understand the maths behind it. Could you recommend a list of resource that can give me a solid mathematical foundation especially for deep learning? Thanks in advance!

@alfcnz 3 года назад

mml-book.github.io/

@ManpreetSinghMinhas 3 года назад

Hey Alfredo! Great videos and content. Thank you for sharing them with us! I have a question in the third notebook. I understand that the x was drawn from a random normal distribution. But how could I have arrived at the number of iterations required for the norm to exceed 1000? I'm really curious! If you could explain it would be awesome! Thanks. # BEFORE executing this, can you tell what would you expect it to print? print(i)

@alfcnz 3 года назад

Yup, you have that information already. Trying to figure it out is an exercise you should go through.

@sepehreftekharian8659 3 года назад

Hi Alfredo. Thank you so much for the amazing training. I have a question. in the video (30:40), you mentioned almost One is equal to 1+, or almost 0 is considered as 0+. Can you explain that? or introduce me a source to red and figure it out? Thanks

@alfcnz 3 года назад

Almost 1 is actually 1-. These are the left and right limits of a open interval. The extrema are not included in the domain. Look up the difference between [0, 1] and ]0, 1[ or (0, 1). Some uses the flipped ] and other the ( to indicate the extremum is not included.

@Epistemophilos 2 года назад

Logit is normally ln(p/1-p). Why is it used here as the "output of the final linear layer?"

@alfcnz 2 года назад

You need to include the minutes:seconds. Moreover, that could be a mistake on my side. I've always seen the logits as the linear output of a model. I've never pinpointed the actual definition.

@Epistemophilos 2 года назад

@@alfcnz Pardon me. It's at 28:26. Thanks for answering. The logit is the inverse of the logistic (sigmoid), so the logit of the output of the softmax gets you back to the inputs - so in that sense one might call them logits I suppose? If I'm not mistaken this means you can interpret the output of the last linear layer (just before the final softmax) as the log of the odds.

@alfcnz 2 года назад

Yeah, that should be it.

@HemilDesai 4 года назад

Thanks a lot for sharing the course. At around 31:00, the cross-entropy is mentioned as l(yhat, c) = -log(yhat[c]). Shouldn't it be l(yhat, c) = -(y_c * log(yhat[c])) ? l(yhat, c) = -log(yhat[c]) is just the information content of yhat[c] right?

@alfcnz 4 года назад

You're welcome 😊 About your question, c is the correct class, hence y_c = 1.

@faizanshaikh5326 4 года назад

42:18 import random ** 🤣 Best class ever

@alfcnz 4 года назад

Oops 😛

@rguilliman9216 4 года назад

@@alfcnz What is wrong with import random ?)

@alfcnz 3 года назад

@@rguilliman9216 hum… I just imported “random stuff”, meaning “several libraries”.

@AbhijitGuptamjj 4 года назад

Great content! I really found the visualization very appealing. I can well imagine the effort that went into creating those. I have a question -- you've shown the affine transformation as WX + b, however, in many places I have seen people representing the same as XW + b, thinking about it like a DAG where X flows into node representing matmul(W, inp) with other node joining it further representing "+" operation with b. Which one do you think is more apt in the context of explaining backprop ?

@alfcnz 4 года назад

Yeah, I've spent quite some time making these, hehe. Backprop is "simply" chain-rule. If you are not familiar with chain-rule, then drawing a chain definitely helps illustrating the algorithm that computes the partial derivatives of the final scalar (loss, objective, energy, whatever) with respect to the model parameters. In particular, I have not explained the chain-rule in my lectures, but I can see how using a DAG can be helpful. I think I've responded to your question. Let me know if it's not the case.

@ashishjohnsonburself 4 года назад

hahaha.... At 21:56 what a choice of words

@alfcnz 4 года назад

😜😜😜

@MrYahoo660 3 года назад

How to translate squashing to math language? Im not english speaker(

@alfcnz 3 года назад

"Squashing" → non-linear transformation, in contrast to "rotating" → linear transformation. In Russian I'd say компрессия. Does it make sense?

@MrYahoo660 3 года назад

@@alfcnz Thank you, Alfredo! Just found out that you mention about uncertainty of model 55:55. Variance good for Normal-like distributions, Entropy good for MultiModal distributions, right? Is there any additions? I mean in which cases We usually use Variance and in which Entropy?

@alfcnz 3 года назад

@@MrYahoo660 no, no. To estimate the confidence of a given prediction, regardless its nature, you want to compute the variance across multiple models or, alternatively, the variance of a model which has its dropout layers active during inference. The entropy of a classification model does not reflect the prediction certainty. They are indeed connected, but shouldn't be used as a proxy for it.