No video :(

Synthetic Gradients Explained

Siraj Raval

Подписаться 770 тыс.

Просмотров 21 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

29 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 129

@vikram.pandya 6 лет назад

I have been following your channel since its inception. You had 67 subscriptions when I joined, congratulations on crossing 200k subscription. It is commendable achievement to get so many subscriptions for programming/technical channel. Keep up the good work.

@armandduplessis5348 6 лет назад

SIRAJ !! YOU ARE THE BEST MAN !! THANKS SO MUCH FOR THE EFFORT YOU PUT INTO MAKING THE VIDEOS AND FOR KEEPING US SHARP ! I am a PhD student working with neural networks and support vector machines and you HAVE MADE THINGS SO MUCH FUN !

@shivroh7678 6 лет назад

Man you explained it so beautifully....you are awesome....!!

@SirajRaval 6 лет назад

thanks!

@shivroh7678 6 лет назад

Hi Siraj, is there any video you posted on selecting number of layers in DNN and selecting hyper-parameters and selecting different types of activation function on the basis of requirement i.e. situation based ??

@IgorAherne 6 лет назад

Siraj, you are a beast! Only few months after it's posted on arXiv, and you are already well into the paper. A synthetic gradient model tells me you will be one of the multi-millionaires in future :D Might meet each other one day, who knows

@rahulsoibam9537 6 лет назад

Been waiting for this for so long. Thanks!

@Frankthegravelrider 6 лет назад

Interesting and clear, thanks Siraj!

@SirajRaval 6 лет назад

thanks fr ya!

@revoiceful 6 лет назад

Great Video Siraj!

@seanpedersen829 6 лет назад

nice video siraj, but have I missed it or you did not compare training results of the synthetic gradients with normal back-prop? That should have been IMO a part of this vid.

@seanpedersen829 6 лет назад

I created a pull request, fixing the notebook code and adding normal-backprop for comparison. The synthetic gradient method converges drastically slower.

@DamianReloaded 6 лет назад

It may be, as with most parallelizations, that the gain in performance only appears with big number of parameters/layers, that is, when the computations at each layer become slower than the latency of parallelization.

@seanpedersen829 6 лет назад

That's what I think as well. Siraj could have definitly underlined that characteristic of synthetic gradient method more

@DamianReloaded 6 лет назад

Hopefully he will expand this subject in future videos. I'd also like to see more low level ML programming like he did in this video.

@triplebig 6 лет назад

Yea, and not waste so much time explaining basic gradient descent and what calculus is on a video that is explaining a recently invented top-notch technique. PbsSpaceTime does this much better. Just point to videos you already have explaining the prerequistes...

@HusoAmir 6 лет назад

would have been awesome to run it on the same example and show the difference and how it would outperform the regular method

@SirajRaval 6 лет назад

great feedback, noted

@deeplightstudio 6 лет назад

Thanks for the explainer!

@hannobrink8828 6 лет назад

Dude! This is the first time the back propegation has made any type of sense to me! Thaks! I actually also like the slower format for these technical topics. And seeing the code made all the complicated abstract math much more concrete and understandable. Best!

@antonylawler3423 6 лет назад

Hinton was suggesting that backprop had run its course only a month ago. Everyone is obsessed with it, but it is clearly suboptimal. And here you are, putting forward new ideas already. Love your neuroplasticity.

@SirajRaval 6 лет назад

@jat9292 6 лет назад

Could you do a video on quasi-diagonal Riemannian stochastic gradient? I think it deserves way more popularity!

@codersgarage2279 6 лет назад

DL tutorials are back 😍

@g0d182 6 лет назад

input dot weight, add a bias activate

@SirajRaval 6 лет назад

repeat

@solidsnake013579 6 лет назад

Holy moly! I know this. I have seen this before. This is bootstrapping TD learning from RL

@SbotTV 6 лет назад

So, each layer essentially guesses what the next layer needs, and the bit that guesses updates upon receiving the 'real' gradient? Sounds like a great way to distribute ML computation, if it really works

@jxchtajxbt53 5 лет назад

Sounds like at each but the last layer (output layer) the synthetic gradient is using a second order approximation: the gradient of the gradient from the next layer. which might explain why the technique would only improve accuracy in large numbers capturing the loss in the gradient propagating back through many layers.

@mike61890 6 лет назад

Hi Siraj, love your videos! Could you do a video on Hinton’s new capsule paper?

@dpmitko 6 лет назад

It's really helpful that you're using pure python. I felt kinda lost when using the built in functions of tensorflow, but now it makes much more sense how it's all connected. Thank you!

@SirajRaval 6 лет назад

definitely, will continue

@Fireking300 6 лет назад

Thanks for the video! It's hard to keep up with all these new techniques in machine learning. You make it a lot easier to do so.

@SirajRaval 6 лет назад

thanks for watching!

@larryteslaspacexboringlawr739 6 лет назад

thank you for Synthetic Gradients

@ShahidulAbir 6 лет назад

Wait. A optimization method which is better than back propagation. >>> Mind == blown >>> True

@Nissearne12 6 лет назад

Ohhh shit.. So smart idee.. Thank's for your good video. A small training neural network is the core of the syntetic gradient generator, so simple and clever. The hole AI community seem's to re use basic idees and make AI even more powerful. Must try to play with this Idee you share.

@chinmaydas562 6 лет назад

Hello Siraj , Great video !!! It will be great to if you make videos on some other cool papers also .....

@spekode2563 6 лет назад

Hey Siraj can you make a video on node2vec? It came out (semi) recently and I think it's REALLY cool with the ability to predict features and it's honestly not getting enough attention as it should. You should seriously give it a try. Btw love your videos as always

@SirajRaval 6 лет назад

will consider thx

@gokhandemirkiran1134 6 лет назад

Wonderful!

@taylanbilal6652 6 лет назад

this probably also helps with the vanishing gradient problem.

@SirajRaval 6 лет назад

+Taylan Bilal I never thought of that. You're right it does!

@nikhilpandey2364 6 лет назад

Do a video on U-Nets. Or any other image segmentation deep net.

@guypersson9981 6 лет назад

Siraj, I'm interested to know what sources you use, whether aggregate or primary, to keep up to date on new technology. Do you read papers from any particular publication? GitHub pages of particular research projects? Twitter feeds that aggregate promising research?

@TummalaAnvesh 6 лет назад

@Siraj, we are still back-propagating the true gradient in the Synthetic Gradient right, which is more like a back-propagation. As in the code example we are sequentially doing those steps for each iteration , we might not be seeing much power of Synthetic Gradient right, I mean does this method manly to allow the parallel processing? If so, Can the update of the synthetic weight has to happen at least once in an iteration ? if that is the case we might still be having the sequential dependency we had in back-propagation right ?

@TummalaAnvesh 6 лет назад

Can any one please comment on this? Am I not following some thing?

@jjhepb01n 6 лет назад

Sidekick learn

@SirajRaval 6 лет назад

lol

@tonyholdroyd5491 6 лет назад

Very interesting video and technique. Just a suggestion: timings and convergence rate for the two versions of the binary adder network would have been useful to show the speed-up with the use of synthetic gradients.

@leemccarthy680 6 лет назад

skip to 16:00 if you already know what backpropagation is and how it works :P

@sauronazo 6 лет назад

So... first we trained a single model adjusting some hyperparams. Then in GANs we train 2 models at the same time which means twice hyperparams. And now we have to train with an extra model on each layer with its specific hyperparams?? I think I've had enough, Deep Learning. Thats it for me. I'm mowing lawns for a living

@SirajRaval 6 лет назад

dude this is cutting edge stuff. just use a library! discipline my friend, no one great ever quit so easy you'll get there

@IgorAherne 6 лет назад

Too many hyperparams? Just delegate to the genetic algorithm, and gg

@Philbertsroom 6 лет назад

so the neural network is learning to learn faster by predicting its error........ nice

@AntonisNikitakis 6 лет назад

The method presented in the video will converge faster than normal back prop after the following improvements: a. NN learning rate should be decoupled from synthetic gradients learning rate, b. synthetic gradients should not be used from the beginning, some iterations are needed to improve themselves c. layer_2 normal update was missing from the original code. Download the modified code from here: (www.dropbox.com/s/0wspaeesz73ip16/synthetic_gradient_network%20_anikita.py?dl=0)

@siriusblack9999 6 лет назад

i don't see how this makes things any faster... because in normal NN you do forward pass, backwards pass, then update in DNI you do forward pass, then update, then backwards pass (so the same amount of total work, if not more), but your updates are less accurate because they look at the gradient for the previous step though i do think this could benefit from parallelisation, as every layer is decoupled and could be constructed as a self-contained unit using channels and only in/out values are passed between layers without being interdependent (layer 3 does not need to know layers 1 and 5 exist as the gradient ONLY depends on layer 4)

@BohumirZamecnik 6 лет назад

Thanks. Actually it reminds me quantized gradient method (eg. 1-bit SGD) for distributed training. In that case the size of gradients for being transferred is a bottleneck. So instead only the quantized value (possibly down to mere 1-bit sign) is propagated as an approximation. It turns out that the true gradient information eventually propagates, but at slower rate. But thanks to smaller data transferred it may happen at shorter wall time.

@_ma7dev 6 лет назад

Hey Siraj, I love your stuff, but could you do something for me? Could you make a ML algorithm to edit your videos? Love, your fan

@John-lw7bz 6 лет назад

he is not a good enough speaker to pull those long cuts off... Needs more practice or become one of those other cut junkies.

@NickKartha 6 лет назад

8:18 Correction: 10010 + 1001 = 11011

@novovires5625 6 лет назад

He almost killed his AI channel for ETH Foundation..... Unbelievable.....

@SirajRaval 6 лет назад

haha. it all starts and ends with AI

@augustus2043 6 лет назад

THIS IS SO COOL.

@highqualityinspector6845 6 лет назад

A very interesting method they introduced there. But in my opinion, the synthetic gradient will only shine at more complex architectures because of the overall lockdown of the first layers. I believe that is the reason why you did not run it in this video - with only 3 Layers the normal back propagation would have been faster.

@nickdodson7916 6 лет назад

Hey Siraj, love the videos. I am picking classes for next semester and can choose either calculus 3 or linear algebra. What do you think will be best as far as becoming a better programmer/ML student?

@Gleethos 6 лет назад

Did I understand correctly? : The synthetic gradient generator is being trained with it's own gradients minus the gradients of the next layer? So this delta 'weight matrix' of 'g(n)- g(n+1)' is it's training data?

@DESIGN4WOW 6 лет назад

love it

@KnThSelf2ThSelfBTrue 6 лет назад

I wonder what function predicts the the minimum sizes of neural networks that reward adding synthetic gradients to your neural network. I'm curious if there would even be a size that would reward adding a synthetic gradient to your synthetic gradient!

@raunaquepatra3966 6 лет назад

Is cost vs weight curve a parabola ? I knew that it is a parabola (convex) only for linear function(linear/logistic regressions) but for neural nets it is a complicated curve with many local minimas. Am I right ?

@douglasoak7964 6 лет назад

why do you multiply the weights by 0.2 and then subtract 1?

@luizgarciaaa 6 лет назад

Hey Siraj, why do you transpose the matrix near 24:02 ?

@luck3949 6 лет назад

So is it only useful for those who have a cluster of GPUs? Seems like it allows to train layers in parallel, but requires even more computations.

@ahmadchamseddine6891 6 лет назад

it similar to the Newton method but instead of using a second derivative method they use a neural network to learn the pattern !!

@SirajRaval 6 лет назад

great analogy!

@nihaltahariya8858 6 лет назад

Hi Siraj ,Can you predict which video you are gonna make on 27 octomber 2040 ?

@2009worstyearever 6 лет назад

this is a minor point when you say dot product tells us how to multiply matrices at 11:00, isnt that a little bit wrong? matrix multiplication also exists in linear algebra, what the dot product does is reduce two tensors into a single number.

@WildAnimalChannel 6 лет назад

I have an idea: Synthetic synthetic gradients.

@SirajRaval 6 лет назад

my mind is blown rn

@menzithesonofhopehlope7201 6 лет назад

Thanks Siraj. Since I heard of this technique from a lecture by Alex Graves. I have been interested in Synthetic Gradients. Google goes further with self gated Activation functions. Can hardly keep up with the rate of progress.

@SirajRaval 6 лет назад

ikr, i need to look into self gated act funcs

@simjank 6 лет назад

just curious, is loss function you used in this case MSE or cross-entropy?

@ajzaff2 6 лет назад

Do recurrent networks need to be unrolled to apply synthetic gradients? Thanks Edit to clarify: it seems you would need a distinct gradient generator at each t?

@serdarbaykan2327 6 лет назад

Training neural nets to approximate gradient values. But what do you train them with? The actual gradient value. Isnt this only an extra calculation? In order to benefit one should stop calculating actual gradients (stop training the synthethic nn)at some point i guess?But then how do you decide when and how do you make sure they don't overfit etc?

@SirajRaval 6 лет назад

what 021 said

@doppler71 6 лет назад

Thanks for your answer.

@rishikksh20 6 лет назад

I have face problem while implementing synthetic gradient on LSTM

@alberjumper 6 лет назад

What problem?

@DamianReloaded 6 лет назад

a face problem

@rishikksh20 6 лет назад

Integrating synthetic gradient with BPTT, as normal back-propagation not used in RNN

@anynamecanbeuse 6 лет назад

Is that valid as training on an existing weighted model?

@TummalaAnvesh 6 лет назад

@Siraj, other question I have is, if the true gradient is propagating slowly in Synthetic Gradient, I don't understand how this method could perform better than back-propagation? Am I missing something?

@IgorAherne 6 лет назад

Gradient is used to update the weights, "teleporting us slightly sideways", where the error looks to be smaller. A mini-Neural network can spot the pattern of these directions, *just* by looking at how the slope of the *following* layer changes. What's cools is after a while, it will get an idea of *entire* downstream structure and how true gradient usually flows through it, depending on output our current layer generates. This mini-network will be really dumb initially, but after a few "gradient examples", it will learn to pick the next "sideways" direction better. It will still need occasional slaps to the face with a true gradient, but it will learn most of the rules after observing the true gradient several times. In other words, this mini-network will learn how to navigate the hyper-surface intelligently, anticipating a certain gradient response from its following (downstream) layer, just by knowing what it supplied to that downstream layer. And, because this mini-net is intelligent enough, our main network will trust its advice with the "predicted" (synthetic) gradient

@lucidlarva 6 лет назад

God your hair poof bounce is mesmerizing

@aleksandrweyland3469 6 лет назад

I need help, I have learned java and android development, what I have to read to start with Deep Learning?

@nafeesahmed4942 6 лет назад

How can I code and learn like you :(

@tissuebox1229 6 лет назад

hey, I just saw your sigmoid_out2deriv function and was wondering, is it suppose to be the same as sigmoid prime? because I was using def sigmoidprime(x): return np.exp(-x) / ((1 + np.exp(-x))**2) can you explain whats the difference?

@siarez 6 лет назад

But WHY does it work Siraj? WHY?

@asho4821 6 лет назад

Cause neural networks are the best

@asho4821 6 лет назад

For teaching neural networks

@SirajRaval 6 лет назад

will explain more soon

@IgorAherne 6 лет назад

@NickKartha 6 лет назад

@Siraj Did you do a follow up on this video?

@ajaytej27 6 лет назад

THE EARLY SQUAD HERE.

@sandzz 6 лет назад

why does it works so well...why god why?? *starts cring in the corner cause I am dumb as f

@SirajRaval 6 лет назад

no i shouldve explained better dont blame yourself plz

@sandzz 6 лет назад

I was just kidding,,,, you explained well... I am now really into ML because of your videos..before that I had no idea about it..Thank you. Keep it up.. blockchains are awesome too