I have been following your channel since its inception. You had 67 subscriptions when I joined, congratulations on crossing 200k subscription. It is commendable achievement to get so many subscriptions for programming/technical channel. Keep up the good work.
SIRAJ !! YOU ARE THE BEST MAN !! THANKS SO MUCH FOR THE EFFORT YOU PUT INTO MAKING THE VIDEOS AND FOR KEEPING US SHARP ! I am a PhD student working with neural networks and support vector machines and you HAVE MADE THINGS SO MUCH FUN !
Hi Siraj, is there any video you posted on selecting number of layers in DNN and selecting hyper-parameters and selecting different types of activation function on the basis of requirement i.e. situation based ??
Siraj, you are a beast! Only few months after it's posted on arXiv, and you are already well into the paper. A synthetic gradient model tells me you will be one of the multi-millionaires in future :D Might meet each other one day, who knows
nice video siraj, but have I missed it or you did not compare training results of the synthetic gradients with normal back-prop? That should have been IMO a part of this vid.
I created a pull request, fixing the notebook code and adding normal-backprop for comparison. The synthetic gradient method converges drastically slower.
It may be, as with most parallelizations, that the gain in performance only appears with big number of parameters/layers, that is, when the computations at each layer become slower than the latency of parallelization.
Yea, and not waste so much time explaining basic gradient descent and what calculus is on a video that is explaining a recently invented top-notch technique. PbsSpaceTime does this much better. Just point to videos you already have explaining the prerequistes...
Dude! This is the first time the back propegation has made any type of sense to me! Thaks! I actually also like the slower format for these technical topics. And seeing the code made all the complicated abstract math much more concrete and understandable. Best!
Hinton was suggesting that backprop had run its course only a month ago. Everyone is obsessed with it, but it is clearly suboptimal. And here you are, putting forward new ideas already. Love your neuroplasticity.
So, each layer essentially guesses what the next layer needs, and the bit that guesses updates upon receiving the 'real' gradient? Sounds like a great way to distribute ML computation, if it really works
Sounds like at each but the last layer (output layer) the synthetic gradient is using a second order approximation: the gradient of the gradient from the next layer. which might explain why the technique would only improve accuracy in large numbers capturing the loss in the gradient propagating back through many layers.
It's really helpful that you're using pure python. I felt kinda lost when using the built in functions of tensorflow, but now it makes much more sense how it's all connected. Thank you!
Ohhh shit.. So smart idee.. Thank's for your good video. A small training neural network is the core of the syntetic gradient generator, so simple and clever. The hole AI community seem's to re use basic idees and make AI even more powerful. Must try to play with this Idee you share.
Hey Siraj can you make a video on node2vec? It came out (semi) recently and I think it's REALLY cool with the ability to predict features and it's honestly not getting enough attention as it should. You should seriously give it a try. Btw love your videos as always
Siraj, I'm interested to know what sources you use, whether aggregate or primary, to keep up to date on new technology. Do you read papers from any particular publication? GitHub pages of particular research projects? Twitter feeds that aggregate promising research?
@Siraj, we are still back-propagating the true gradient in the Synthetic Gradient right, which is more like a back-propagation. As in the code example we are sequentially doing those steps for each iteration , we might not be seeing much power of Synthetic Gradient right, I mean does this method manly to allow the parallel processing? If so, Can the update of the synthetic weight has to happen at least once in an iteration ? if that is the case we might still be having the sequential dependency we had in back-propagation right ?
Very interesting video and technique. Just a suggestion: timings and convergence rate for the two versions of the binary adder network would have been useful to show the speed-up with the use of synthetic gradients.
So... first we trained a single model adjusting some hyperparams. Then in GANs we train 2 models at the same time which means twice hyperparams. And now we have to train with an extra model on each layer with its specific hyperparams?? I think I've had enough, Deep Learning. Thats it for me. I'm mowing lawns for a living
The method presented in the video will converge faster than normal back prop after the following improvements: a. NN learning rate should be decoupled from synthetic gradients learning rate, b. synthetic gradients should not be used from the beginning, some iterations are needed to improve themselves c. layer_2 normal update was missing from the original code. Download the modified code from here: (www.dropbox.com/s/0wspaeesz73ip16/synthetic_gradient_network%20_anikita.py?dl=0)
i don't see how this makes things any faster... because in normal NN you do forward pass, backwards pass, then update in DNI you do forward pass, then update, then backwards pass (so the same amount of total work, if not more), but your updates are less accurate because they look at the gradient for the previous step though i do think this could benefit from parallelisation, as every layer is decoupled and could be constructed as a self-contained unit using channels and only in/out values are passed between layers without being interdependent (layer 3 does not need to know layers 1 and 5 exist as the gradient ONLY depends on layer 4)
Thanks. Actually it reminds me quantized gradient method (eg. 1-bit SGD) for distributed training. In that case the size of gradients for being transferred is a bottleneck. So instead only the quantized value (possibly down to mere 1-bit sign) is propagated as an approximation. It turns out that the true gradient information eventually propagates, but at slower rate. But thanks to smaller data transferred it may happen at shorter wall time.
A very interesting method they introduced there. But in my opinion, the synthetic gradient will only shine at more complex architectures because of the overall lockdown of the first layers. I believe that is the reason why you did not run it in this video - with only 3 Layers the normal back propagation would have been faster.
Hey Siraj, love the videos. I am picking classes for next semester and can choose either calculus 3 or linear algebra. What do you think will be best as far as becoming a better programmer/ML student?
Did I understand correctly? : The synthetic gradient generator is being trained with it's own gradients minus the gradients of the next layer? So this delta 'weight matrix' of 'g(n)- g(n+1)' is it's training data?
I wonder what function predicts the the minimum sizes of neural networks that reward adding synthetic gradients to your neural network. I'm curious if there would even be a size that would reward adding a synthetic gradient to your synthetic gradient!
Is cost vs weight curve a parabola ? I knew that it is a parabola (convex) only for linear function(linear/logistic regressions) but for neural nets it is a complicated curve with many local minimas. Am I right ?
this is a minor point when you say dot product tells us how to multiply matrices at 11:00, isnt that a little bit wrong? matrix multiplication also exists in linear algebra, what the dot product does is reduce two tensors into a single number.
Thanks Siraj. Since I heard of this technique from a lecture by Alex Graves. I have been interested in Synthetic Gradients. Google goes further with self gated Activation functions. Can hardly keep up with the rate of progress.
Do recurrent networks need to be unrolled to apply synthetic gradients? Thanks Edit to clarify: it seems you would need a distinct gradient generator at each t?
Training neural nets to approximate gradient values. But what do you train them with? The actual gradient value. Isnt this only an extra calculation? In order to benefit one should stop calculating actual gradients (stop training the synthethic nn)at some point i guess?But then how do you decide when and how do you make sure they don't overfit etc?
@Siraj, other question I have is, if the true gradient is propagating slowly in Synthetic Gradient, I don't understand how this method could perform better than back-propagation? Am I missing something?
Gradient is used to update the weights, "teleporting us slightly sideways", where the error looks to be smaller. A mini-Neural network can spot the pattern of these directions, *just* by looking at how the slope of the *following* layer changes. What's cools is after a while, it will get an idea of *entire* downstream structure and how true gradient usually flows through it, depending on output our current layer generates. This mini-network will be really dumb initially, but after a few "gradient examples", it will learn to pick the next "sideways" direction better. It will still need occasional slaps to the face with a true gradient, but it will learn most of the rules after observing the true gradient several times. In other words, this mini-network will learn how to navigate the hyper-surface intelligently, anticipating a certain gradient response from its following (downstream) layer, just by knowing what it supplied to that downstream layer. And, because this mini-net is intelligent enough, our main network will trust its advice with the "predicted" (synthetic) gradient
hey, I just saw your sigmoid_out2deriv function and was wondering, is it suppose to be the same as sigmoid prime? because I was using def sigmoidprime(x): return np.exp(-x) / ((1 + np.exp(-x))**2) can you explain whats the difference?
Gradient is used to update the weights, "teleporting us slightly sideways", where the error looks to be smaller. A mini-Neural network can spot the pattern of these directions, *just* by looking at how the slope of the *following* layer changes. What's cools is after a while, it will get an idea of *entire* downstream structure and how true gradient usually flows through it, depending on output our current layer generates. This mini-network will be really dumb initially, but after a few "gradient examples", it will learn to pick the next "sideways" direction better. It will still need occasional slaps to the face with a true gradient, but it will learn most of the rules after observing the true gradient several times. In other words, this mini-network will learn how to navigate the hyper-surface intelligently, anticipating a certain gradient response from its following (downstream) layer, just by knowing what it supplied to that downstream layer. And, because this mini-net is intelligent enough, our main network will trust its advice with the "predicted" (synthetic) gradient
I was just kidding,,,, you explained well... I am now really into ML because of your videos..before that I had no idea about it..Thank you. Keep it up.. blockchains are awesome too