The Evolution of Gradient Descent

Подписаться 770 тыс.

Просмотров 93 тыс.

50% 1

Which optimizer should we use to train our neural network? Tensorflow gives us lots of options, and there are way too many acronyms. We'll go over how the most popular ones work and in the process see how gradient descent has evolved over the years.
Code from this video (with coding challenge):
github.com/llS...
Please subscribe! And like. And comment. Thats what keeps me going.
More learning resources:
sebastianruder....
www.tensorflow...
machinelearning...
cs231n.github.i...
www.cs.toronto...
• Gradient Descent - Art...
Join us in the Wizards Slack channel:
wizards.herokua...
And please support me on Patreon: www.patreon.co...
Follow me:
Twitter: / sirajraval
Facebook: / sirajology Instagram: / sirajraval Instagram: / sirajraval
Signup for my newsletter for exciting updates in the field of AI:
goo.gl/FZzJ5w
Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
www.wagergpt.co

Опубликовано:

29 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 185

@omarch 7 лет назад

I'm graduating in Computer Science Masters, my research is NLP (use a lot of RNNs) but your videos always give me small insights that help me understand deep learning more "deeply" haha. Just wanted to say that, you're great Siraj!

@SirajRaval 7 лет назад

hell yes great to hear

@12229sz 7 лет назад

These videos are definitely getting better.

@alexandremarcotte7368 7 лет назад

Luis He for sure optimize his video production with ADAM

@SirajRaval 7 лет назад

thanks luis i optimize

@ananthraj7368 7 лет назад

Hi Siraj I got a chance to watch few of your video. I am 8 years Ml researcher but I found you teaching method is awesome. anyone could learn. Great work

@pranavagarwal768 7 лет назад

1:59 was simply awesome

@SirajRaval 7 лет назад

glad u liked

@RahulSingh-xj5ry 6 лет назад

Bro,That was Awesome when you said that "ohh Gradient Descent lead us to convergence!!".

@ryancooper944 7 лет назад

Thank you so much for this video! I was just about to start researching the differences between the SGD optimization algorithms. Thank you so much for saving me so much time and making a video that has all the pertinent information in a very informative and understandable way. I love your videos so much. Thank you, Siraj, you are my favorite person on the internet. Don't stop what you're doing. You're helping so many people learn so much information that can be sometimes hard to find. Thanks!!!

@onefulltimeequivalent1230 7 лет назад

your ode to gradient descent is priceless

@rasen58 7 лет назад

Your last few videos have been so on point! Very interesting things that are useful for someone who already knows a decent amount of ML and NNs, but not NNs so deeply.

@RedShipsofSpainAgain 6 лет назад

@2:30 Siraj, you should change that graphic of the function y=x^2. The function you have shown there is not x^2 and could confuse people. You're talking about decreasing the x value in the negative direction of the gradient from x=2.3 to x=1.4 to x=0.7, basically moving from a high x on the right towards smaller x values on the left. This is decreasing the x value, yet the graphic you have up there is showing moving right from the left. Some newbies may be confused by that. But great vid overall.

@NStewF 4 года назад

Thank you for stating this, as I was confused about the gradient (slopes at the tangent of the cost function) values in relation to the graph. I was look for such a comment here and you provided it.

@deniscandido4116 7 лет назад

duuuuuude, you're the best. I wasn't able to understand these concepts (reading these overly complicated articles) but now it's becoming clear. Thanks a lot. For example, I realized that adam is the best solver after weeks of gridsearch tests, but I didn't know why.... and now it's clear.

@leosousa7404 7 лет назад

*Laughs* I try to hate Siraj because of his too animated video format, but I actually really like him: "Generalization is the hallmark of intelligence", this is true and a rare statement, congratulations!

@Isti115 5 лет назад

This video has way fewer views than it should have... I really hope that more people will find you and your great content!

@skatinho9883 2 года назад

Aye what an explanation man, big ups, you make an already interesting topic way more interesting. Thanks Siraj!

@Schmuck 7 лет назад

Siraj is a robot. His videos keep getting better and better.

@stftcalculations 6 лет назад

awesome video man. Never seen a guy explain something so technical in such an ebullient way!!

@TheNiklas3000 7 лет назад

You are improving a lot in you presentation style Siraj! Talking slower and clearer is really working for your material. Great work👍

@heri_prieto 7 лет назад

Just finished reading "Ch 4: Numerical Computation" from Bengio's "Deep Learning" book, I actually understood what you were talking about! haha

@suprotikdey1910 7 лет назад

Yes more evolution of DL algorithms pls. Its really hard to decide which algo to use in which situation most of the time! Thanks Siraj for these great videos

@martonveto 7 лет назад

I can't believe how useful is this video. Rad! Thanks Siraj

@janakpatelk9 7 лет назад

Awesome video. I checked my video speed settings twice to check if RU-vid having issues or Siraj is speaking slowly ;)

@apoorvwatsky 6 лет назад

1:51 THAT looked so good. :D

@hanyuliangchina 7 лет назад

Feel that your course more and more scattered, if you can have an exciting project, you have taught a lot of knowledge points can be used, then we have a deep understanding of these knowledge points will be very helpful. So I once again suggested that you should do an automatic driving car project, rappi pi + camera + remote control car, there will be a lot of problems to be solved, you teach a lot of knowledge points can be used in such projects.

@swapanjain892 7 лет назад

libai tony exactly my thoughts on this

@hammadshaikhha 7 лет назад

My mind is blown! Gradient descent telling me to subscribe now to optimize my future!

@centar15 7 лет назад

Hi Siraj, just wanted to complement everything that you do but the last three videos in particular - the pacing and the overviews were awesome (usually your videos are a bit too fast for me, and I have to go over and over...) :) And a question - I am working with Keras right now (it is just sooo much easier and intuitive compared to TF, for which in 5 tutorials I see 5 different coding approaches and TF parameters used for effectively the exactly same network) and thought about 2 options for deploying: 1. export model and weights, load them in TF, and do everything according to your video 2. save model and weights, and make a small script in Keras that loads the model and does prediction. Thoughts? Thanks!

@Donaldo 7 лет назад

One of Siraj's best IMHO

@jony7779 6 лет назад

Fantastic video. Keep up the great work Siraj.

@nands4410 7 лет назад

5:33 Was awesome

@oluchukwuokafor7729 7 лет назад

Man, your memes are out of this world.

@yvonnemal7751 6 лет назад

in love with this video ! it makes make graduation work easier thanks :)

@jnsmtzgr 7 лет назад

I just wanted to tell you that this video and the similar one on the activation functions are by far the ones that helped me the most! It really helps getting started with ML. I guess it is probably much more difficult (if possible at all), but I would love to get a similar intuition for the appropriate size and number of layers in a deep neural network, depending on what I wanted to do with it. Or at least, how to tackle the task of finding the best choices for them. Are you planning to do something like that? :)

@gaaligadu148 7 лет назад

Hey siraj , love your videos.Can you do a video on batchnormalization and batchrenormalization ? Thank you

@lucha6262 4 года назад

This is a very good video. Just wanted to add in from this paper (arxiv.org/abs/1705.08292) that adaptive optimizers seem to not always generalise so well

@cuchitp 7 лет назад

Hi, I just came across your videos and ML and I am loving it... I saw an example recently on youtube of a Google employee training a mobile to identify labels on candy bar and chocolate wrapper...Would you kindly do one from beginning to end? None of the code show how to capture the label in real time and doing the recognition in the device itself... Thank you once again. You are really terrific doing all this for all of us hungry for knowledge

@dkishore599 4 года назад

i seen some videos on image classification they are using rmsprop like, why do we use , where do we use "rmsprop" model.compile(loss='categorical_crossentropy', optimizer='rmsprop',metrics=["accuracy"])

@jaituteja88 6 лет назад

Thanks for sharing the knowledge.. this was really helpful in understanding the core concept, although lots of things are there to digest.. still very useful..!!

@dkishore599 4 года назад

You explained about Sgd, momentum, nag, adagrad,adadelta,adam , Can you explain about ''rmsprop'' when will we use?

@mk17173n 7 лет назад

I cant focus. his shirt is distracting.

@beastybear4499 6 лет назад

mk17173n lmfaoooo

@JorgetePanete 5 лет назад

can't*

@swastiksingh8452 5 лет назад

typical indian

@jeremysender 7 лет назад

Great videos Siraj. Keep up the awesome work

@jonreynolds7143 6 лет назад

Three cheers for Gradient Descent. Hip hip, hooray!

@boemioofworld 7 лет назад

you are always improving, thanks

@embiem_ 7 лет назад

You're awesome. Thanks for making these videos!! They really help and are entertaining as well.

@shivankgtm 7 лет назад

1:52-2:02 was fabulous

@CryAgony88 7 лет назад

@Siraj: Although ADAM is objectively the best call, I have noticed that it can't be generalised. Especially in computer vision problems, I have realised that most of the time a vanilla SGD works better than other advanced methods; or some other times SGD works better for the first n-th epochs, and only after ADAM gives a good contribution. What do you think about this ?

@netrunningnow 7 лет назад

A min to that ode to gradient decent.

@johnhammer8668 7 лет назад

Absolutely awesome

@revimfadli4666 7 лет назад

Having to endure a semester of Control Systems Engineering, the resemblance between Nesterov's accelerated gradient and PID controllers is uncanny. The momentum alone acts like the I(ntegral) term in a PI controller, accelerating convergence while adding its own oscillations. Meanwhile, Nesterov's modification and D(erivative) term both serve to "brake" the momentum/integral from overshooting. I wonder, what other control theories could be applied?

@noneofyourbusiness6913 7 лет назад

Holy shit Siraj, the video quality has gotten so amazing. :)

@Fearror 7 лет назад

great video. I would have been interested by knowing more about nadam too

@rajivgoodboy 7 лет назад

Hey siraj, I was wondering if you could have a live session where we decide on some topic beforehand (something new and really challenging) and you do a live session where all of us contribute and we try out some new network structures.(brainstorming with ideas from arxiv-sanity) I am saying like a web session which companies have amongst employees but now you have the power to do it with a lot more. I wanna talk about general ideas in this field and there are just too many being published everyday :D

@FilipeSilva1 7 лет назад

Awesome! Loved the visual demos

@sharpEAGLES 7 лет назад

Bro, please do a video on face recognition using tensorflow. If you have done some work on it please share it with me. I am in grave need. Thanks, love your channel.

@eav300M 7 лет назад

Where was Siraj when i had to take Numerical Analysis!!

@MarvinSchmarvin 5 лет назад

I have understanding problems from 7:10 to 7:35. Fist I don't get it why an square root in a fraction causes the lerarning rate to decrease (dividing with small numbers return big numbers). And I have problems understanding the term: E[g^2].

@farzadkhorasani4023 5 лет назад

Adagrad is not borrowing the idea from Nesterov. Nesterov emphasizes on the momentum of the observed gradients while Adagrad signifies the importance of less frequently seen (sparse) updates. This makes the transition in the middle of the video a bit confusing. Great video though.

@sau18794 7 лет назад

Very Enlightening T-shirt, for a moment I thought of asking you about my future 😂

@sapiranimations 7 лет назад

Hello siraj, I enjoy your videos so very much!😊 but I have a question. I myself don't use tensorflow or any similar library, I enjoy coding the models myself completely from scratch in c++ and implementing the training algorithm the same way. How useful would be such a skill set in the market?

@tamilupk 6 лет назад

Ron it is always recommend to use a library for deep learning, as most deep learning training needs lots of calculations that can be done best with GPU and distributed computing where the established libraries like tensorflow excels. But for beginner you can hand code everything from scratch to understand the concepts. Speaking for the market you asked, there are machine learning research labs and big software companies like google where you can join.

@Philson 6 лет назад

It depends right? Different situations might favor different ones.

@SirajRaval 6 лет назад

yes

@e89647 7 лет назад

You are amazing. Thank you so much for these videos. So entertaining and great content!

@renanangelodossantos4726 4 года назад

Gradient Descent with all those optimization is better the Genetic Algorithm? Should i use only back propagation w/ gradient descent?

@nesatdereli 6 лет назад

In visualization(8:14) NAG and Momentum methods follow adaptive gradient methods but you told that they go to wrong direction. Can you make it clearer?

@taihatranduc8613 4 года назад

You are always the best

@everydayhustler1637 5 лет назад

This video is part of the Deep Learning/Neural Network Playlist in Siraj's channel; FYI B)

@Gouranshu 7 лет назад

thanks man this video really helped me understand this concept.

@FelheartX 7 лет назад

But what is the difference between having momentum vs. just having a higher learning rate? After everything in an update step is complete the result is the same, is it not? Basic momentum calculation -> may overshoot because we are still "moving"; and high learning rate -> we overshoot as well.

@sibyjoseplathottam4828 7 лет назад

Great video as always! Have you seen the Levenberg-Marquardt algorithm been used with any deep learning frameworks. It is available in the neural network toolbox in MATLAB and I have found it to give better results that Adam for single layer NN's.

@maanvis81 7 лет назад

Love your shirt siraj :)

@joannot6706 7 лет назад

2:01 Siraj : To convergence! ML squad : *To convergence!*

@SirajRaval 7 лет назад

woot!!

@andreasv9472 6 лет назад

You're awesome! Alao, cool shirt bro

@jrabelo_ 7 лет назад

hey siraj, great video, do you have the code for the stochastic gradient descent animation?

@alexp5693 7 лет назад

Hello. I hope you will answer as it's really important for me. I'm currently working on a project and my task is to generate meaningful unique text from a set of keywords. It doesn't need to be large, at least a couple of sentences. I'm pretty sure I have to use LSTM but I can not find any good examples of generation of meaningful texts. I saw a few of randomly generated but that's all. I would be grateful for any advice. Thank you in advance.

@83ETai 7 лет назад

Great stuff! Thank's Siraj!

@emas8484 6 лет назад

Thank you

@JTMoustache 7 лет назад

Clarity ! 💪🏾

@MH_HD 5 лет назад

another question: What about the Stochastic Average Gradient Descent? or SAG solver used in python sklearn?

@raviraaja1282 6 лет назад

@Siraj could you explain more about ada grad and ada delta , i noticed from 7:09 to 7:35 , this part of video is confusing , As you stated that for ada delta *Running average at a time step depends on only the previous average and current gradient* but few seconds back you said that *E[g^2]t is the sum of all past squared gradients* , i feel these both lines are mutually exclusive , could you explain more! please

@l.mrteera 7 лет назад

You are awesome. Thank you for this nice video!

@qwerty11111122 7 лет назад

Adam was developed in 2014... imagine what type of optimizations the near future has for neural networks, eh?

@y__h 7 лет назад

Adaptive Adam. We should call it as Eve.

@luck3949 7 лет назад

We need to train a neural network to predict step.

@ukaszOgan01 6 лет назад

@Siraj Raval 04:09 do you have python code for example this plot?

@ismailelezi 7 лет назад

Why not NAdam (Adam with Nesterov momentum)?

@Luckasborges 7 лет назад

Amazing video!!

@nas4799 5 лет назад

Helps alot. Thanks!

@funeralsfriend7 7 лет назад

So an optimisers efficiency depends on the amount of data available? Whats the best optimiser for all the data?

@nikhilmkul 7 лет назад

Thanks

@yashchauhan5710 5 лет назад

U are God of this field man haha

@gokulsreekumar4371 4 года назад

The neural networks in my brain are classifying this video as SPAM

@codelume 7 лет назад

fist view, first like, first comment, lol now let's watch the video,, thanks

@kemal4282 5 лет назад

why u not mention rmsprop which is very useful ?

@prakashyadav008 7 лет назад

hey, Siraj what do u think of information security or cyber security as a career vs Machine learning as a career. If u could make a video on different career options like web devlopment, cyber security and machine learning. and do we use gradient as a term only for derivative of multivariable functions or for both single and multi variable functions..?

@MLDawn 5 лет назад

I would much rather watch a video dissecting 1 or 2 of these methods in detail, rather than discussing the big picture of all of them. Of course you have done a good job. Take my comment with a pinch of salt ;-)

@StyleTrick 7 лет назад

Siraj, how much maths should I know in order to learn Machine Learning? I understand I have to know Linear Algebra, Calculus and Probability, but to what extent?

@joannot6706 7 лет назад

I am not Siraj, but I can suggest another way of learning things, start by learning what you find interesting, (for you, probably something related to AI, you are awesome! ;) ) Learn that and when you encounter maths, only then learn the maths that you need this will prevent you from wasting your time with knowledge that you don't need. I laughed at myself when I first had to learn what a basic sigma does to understand softmax on an ML course on Udacity (yeah I wanted to learn ML without even knowing that! xD) but now I know far more than that and I am on my way to learn and create so much more and so can anyone!

@StyleTrick 7 лет назад

Ah that's awesome! How far are you into the Udacity ML course and what do you think of it? I want to take the ML course by Andrew Ng on Stanford, but I need prior knowledge of linear algebra and calculus. I am learning both at the moment, but I'm not too sure how much of each I need. What would you recommend?

@joannot6706 7 лет назад

StyleTrick I am still a beginner as well in ml so I don't have the best opinion on that ^^ I stopped the intro to machine learning course at lesson three, it's not my first udacity class and as usual it's a little slow to go through all the videos and sometimes it really goes into every steps sometime, I wish there where more details. I was too impatient so I jumped right into tensorflow, this I recommend no matter what course you want to take! Try the "get started" on their website try to understand and run "MNIST for ML Beginners"and "Deep MNIST for Experts" you'll see how much deep learning reduces error rate and also you'll see how fast it is to make a model with this library compared to the way it's taught in the Stanford course! I saw the ml course you talked about and it really goes into the details. I am stating the obvious but when you build any code, you can do more and faster by using libraries. if you are more interested in research for AI my guess is that it's good, but is it better than the udacity class? that's for you to tell! It depends on how you want to use AI in the end. Whatever you do the only thing that matter is to keep trying at it as much as you can, seriously, I'm ditching calls and snaps from my friends, I know that's bad but it doesn't even bother me ^^ try to learn tensorflow on their official website if you didn't yet!

@StyleTrick 7 лет назад

Thanks so much for the information! I've heard great things about TensorFlow. It's just I'm studying some Linear Algebra and Calculus to get some indepth knowledge on ML. Btw, Can I jump into tensorflow straight away, like no prereqs?

@SirajRaval 7 лет назад

wait 5 days

@eav300M 7 лет назад

...lead us to singularity....

@joannot6706 7 лет назад

*To singularity!*

@SirajRaval 7 лет назад

i will

@FermionCP36 7 лет назад

Can you explain about lambda and gamma variables in nesterov_method.py? I have no idea where it comes from. It doesn't seem to be part of the equation in the video.

@itsRAWRtime007 7 лет назад

you didnt specify how to submit the adam optimizer implementation, or i missed it in the readme

@EndersupremE 5 лет назад

Hey man, Im in real need of help with the Adam algorithm I just cant grasp it. I cant find anywhere what the terms in the algorithm means. Like, what would Mt and Vt mean? I know its the Mean and a Variance but I don't understand what you mean by that. And what does T stand for? You also said that it adapts and "learns" the learning rate, making it a parameter instead of a hyper one. Then why is it in the math? Is that N the initial learning rate that I have to tweak? All I can understand is the e (realy small number just so it doesn't divide by 0) and Beta, being another hyperparamter. And O, being each weight or bias, but I don't know what the Ts stand for Thanks for all the videos btw. Youre videos usually don't get a lot into detail, but its from them that I acknowledge that these new optimizing functions and so on exist in the first place. Usually I just run off to google looking and learning for myself the details of what u show on the videos, but in this case I cant learn anywhere and Im desperate lol. Thx for the help