You are amazing as always! We all have such a gift and blessed to have you teaching these classes. I am truly amazed with your level of commitment to the society
I love how Jeremy explains techniques like gradient accumulation. He makes it seem so obvious and powerful that it's hard to forget them. Never again I'll think big models are out of scope for my experiments! :D
"At this point if you've heard about embeddings before you might be thinking: that can't be it. And yeah, it's just as complex as the rectified linear unit which turned out to be: replace negatives with zeros. Embedding actually means: “look something up in an array”. So there's a lot of things that we use, as deep learning practitioners, to try to make you as intimidated as possible so that you don't wander into our territory and start winning our Kaggle competitions." 🤣
Accumulated gradients is a nice trick, however for sufficiently large datasets and run times your memory bandwidth latency will increase by the same multiple you accumulate
Jeremy - In the deep learning implementation of collaborative filtering the input is concatenated embedding of user and items, however my understanding is that the model is not learning the embedding matrix here, instead it's learning the weights (176 * 100) in the first layer and (100 * 1) in the second layer. Am I missing something? Appreciate your inputs
I understand the advantage of gradient accumulation in terms of being able to run your training on smaller GPUs by "imitating" a larger batch size when calculating the gradients, but wouldn't a major drawback of the gradient accumulation an increase in training time and ultimately in energy use? i.e. isn't your training going to run half as slow when accum is set to 2? And the more you increase the accum number the slower the training gets because your actual batch sizes are getting smaller and smaller?