1. How does JVP/VJP avoid calculation of full jacboian? I think the idea behind JVP is kinda mentioned there: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-wG_nF1awSSY.html&ab_channel=AriSeff . However, I would love to hear your explanation. 2. Are you planning to make videos on JIT, vmap, pmap ? and use of these to train a neural net ? and also use of PRNG is JAX? and pure function in JAX? I think this will come together, when you try to build a neural net in jax and train it. Thanks a lot for these amazing videos! Your content is awesome.
Hi, thanks for the nice comment and the interesting question. :) I am (also) deeply fascinated by AD and the differentiable programming it allows. Regarding your questions: (1) I think it is better to comprehend with forward-mode, hence Jacobian-vector products (Jvp, also called pushforward). For a function y = f(x), you want to compute jvp(f, x, v) = (df/dx) @ v. The idea is to use a directional derivative of f in the direction of v, i.e., jvp(f, x, v) = d/dh (f(x + h * v)) and use forward-mode AD on the !!scalar!! multiplier h. Hence, you only require one additional function evaluation instead of the many you needed for a full Jacobian. Check out this video in Julia, where I did exactly that ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-f_wre1FjPh4.html (I think this is clearer than writing it down mathematically). [BTW: This shares a great similarity with how you would do functional derivatives ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-6VvmMkAx5Jc.html ]. Also check out this video on how one could implement a dual-number approach to forward-mode AD: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-vAp6nUMrKYg.html . For reverse-mode, hence vector-Jacobian products (vJp, also called pullback), I do not have a good intuition, except for "we know how the Jacobian acts". Maybe think of it like matrix-free solutions to linear systems of equations, instead of having a full matrix, you just know "how the matrix-vector product would be computed". I know this is not a really satisfying answer, :D Maybe I can find something more intuitive and make a video about it in the future. Though, more generally speaking, JAX (and also any other AD engine) build on a collection of Jvp and vJp primitives (in Julia they call them ChainRules.jl). Those have the knowledge about how the Jacobian acts on vectors from the right (Jvp) and from the left (vJp) without computing it. I think I should do a video on how such a rule looks like for a specific example :). (2) That's definitely the plan and should be what the playlist is leading towards. First introduce all the basice and then do a "neural network from scratch" :) Thanks again for the interesting comment :)