the masked attention is, as i understand it, really more of a trick for training: when putting in a training example sentence, we let the decoder predict the "next word" for all subsentences simulatenously and then add the losses for all the predictions (and also still add losses along the batch dimension). however that requires that the attention heads are never allowed to attend to next token - so that they are basically all put into the same situation of not knowing what comes after. now that your model has trained with masking, you'll have to keep the masks during inference as well, because thats now the behaviour that the subsequent layers in the network expect!
Nice summary of survey paper. Can you also make video on how to reduce llm response for complex RAG architecture and prompt technique which uses reasoning.
Thank you for the complete explanation. In quiz 3, I didn't get why bypassing some modules may be helpful. You mean sometimes the initial prompt is as straight-forward that there is no need to use some modules?
at 37:42, shouldn't we permute the values to (1, 4, 8, 64) first and then reshape to (1,4,8*64) to accurately concatenate multiple heads for corresponding sentences?
The development of calculus is largely attributed to European mathematicians Sir Isaac Newton and Gottfried Wilhelm Leibniz in the late 17th century. However, Indian mathematicians made significant contributions to mathematical concepts that are foundational to calculus.
Hi Ajay. Great work!! Quick question, in the code the default for mask is set to None. Is there an instance during training/inference where we won't add a mask? Is masking needed only for inference?
Thanks for the detailed explanation. The first part of the Euclidian max-min distance vs #dimensions was revealing ! One point I am thinking over is even though the max-min distance is shrinking, the ranking of distances will (or might) still hold true, irrespective of #dimensions. If that's the case, the algorithms should not loose any discriminative power in theory. In practice, yes, the strain this might bring on compute requirements can make it impractical and hence the needs to reduce dimensions. Would love to hear your thoughts @CodeEmporium