Welcome to AI Bites! I am here to help you understand AI concepts and research papers by providing clear and concise explanations. I will explain the most impactful papers and ideas in Computer Vision, Machine (Deep) Learning, Natural Language Processing, Reinforcement Learning and Generative Adversarial Networks (GANs).
I am a research engineer by profession. As a former member of the University of Oxford Visual Geometry Group (VGG group), I have worked with world leading scientists from leading research labs. While I may be privileged to join a top research labs, I want to share my learning and help you be at your best.
During my MSc in Computer Vision I noticed some students with fantastic programming skills struggle to understand mathematical terms and equations in papers. I found my strength in understanding and explaining these papers to them in simple terms. So here I am to leverage those skills to help you in your journey.
Bro, your statement from 05:22 is completely wrong and misguiding. LoRA is used for finetuning LLM models, when full-finetuning is not possible. It does so by freezing all model weights, and incorporating and training low-rank matrices(A*B) in Attention modules. LoRA speeds up training and reduces memory requirements but does not provide a speedup during inference. If LLM model is too large to be handled by LoRA due to GPU memory limitations, Quantized LoRA is used to finetune the model. Overall, QLoRA is a more advanced solution when LoRA alone cannot handle large models for finetuning.
The quantisation of changing number format applies only to the result of activation function or also to the individual weights ? Where we apply this quantisation in the NN
nice i made this before ! A model which picks the correct model ! BUT : then i decided tat a 1b agent can be the router model ! Then i decided that models as TOOLS ! so once you create an anthrpic as a tool , it will select the anthropic insteads ! i think its all about understanding the power of Tools andeven graphs and nodes : If we create some graphs then their start point are the tool ! SO: the docstring methodolgy is the best version of the tool calling method ! , perhaps with a react type framwork ( epecally when using tools ) by creating details docstring and example in the docstring , ach tool added will be woven into the prompt ! so the aim is to create model ( or tune one ) to use the react framwork as well as selecting tools ! -- I think that higging face agents is the methodology which is correct because we ca host models on hugging face .. and hit those spaces ! ... Spaces as TOOLS !.. SO again we see tools takinng a front role as the main prompt is to select the correct tool for the intent: also train for slot filling and intent detection (hf dataset ) .... the routing method was very good learning execises ! ... but it also needs the pydantic to send back the coreect route to select , when it could be done via a tool which is already preprogrammed iin to the library ( stoping reason )...
i always got this message ImportError: cannot import name 'load_flow_from_json' from 'langflow' (unknown location) already clone from github using windows
I don’t understand why you say that LoRA is fast for inference… in any case you need to forward through the full rank pretrained weights + low-rank finetuned weights.
ah yes. If only we could quantize the weights, we can do better than the pre-trained weights. You are making a fair point here. Awesome and thank you! :)
@@AIBites Yeah, if only we could replace the pretrained Full-Rank weights by the Low-Rank Weights... really nice video and illustrations! Thanks a lot!
Thank you, that's a beautiful explanation! One thing I struggle understanding, is the term quantization blocks in 4:30 - why we need several of them. In my understanding from the video, we ponder about using 3 blocks of 16 bits to describe a number. Which is 48 bits and is more expensive than 32-bit float. But couldn't we just use 16*3 = 48 bits per number instead? Using 48 bits (without splitting it) would give us a very high precision within [0,1] range, due to powers-of-two I did ask GPT, and it responded that there exists a 'Scale Factor' and a 'Zero-Point', which are constants that shift and stretch the distribution in 6:02 Although I do understand these might be those quantization constants, - I am not entirely sure what the 64 blocks described in the video are 6:52 Is this because of the Rank of Matrix-Decompositions is 1, with 64 entries in both vectors?
Hi, when i am having 100k pdf documents and i store all the embedding into vector store without following any chunking. Now if i want to retrieve using prompt how can we augment relevant information on such an huge un-chunked vector? Please suggest what is the best way to handle this problem? Please help some references as well along with your inputs
is there any particular reason you skipped the chunking process? As the pre-processing and chunking operation is kinda one-time operation, I can think of re-doing the entire vector store with chunking. It may then be much easier to retrieve several times, for multiple queries, as and when needed What are your thoughts?
the conversion doesn't go through unless we convert to gguf. At least it was the case for me when I did the work. May be some recent commits to the library has eased the process and skipped the step?
I have a question. after embedding we still have the same number of features x1->x4, and say that they have dimension is 1x10, means 10 features each. W* is 4x4 right! My question is X is 4x10 and X^T is 10x4. How we do dot product W . X^T as the dimension is (4x4) . (10x4)! or I am missing something?
we have to choose the dimentions accordingly. So the weights will be 10X10. the choice of these parameters is paramount in desigining deep architectures that work end-to-end.
The training of artificial neural networks is computationally expensive for several key reasons: Large number of parameters: Deep neural networks often have millions or even billions of parameters (weights and biases) that need to be optimized during training. Updating such a large number of parameters requires significant computational resources. Iterative process: Training typically involves many iterations over the entire dataset (epochs) to gradually adjust the parameters. This iterative nature makes it time-consuming, especially for large datasets. Backpropagation: The backpropagation algorithm, used to calculate gradients and update weights, requires forward and backward passes through the network for each training example. This process is computationally intensive, especially for deep networks. Matrix operations: Neural network computations involve many matrix multiplications and other mathematical operations, which are computationally expensive, especially as network size increases. Large datasets: Training on large datasets, which is often necessary for good performance, requires processing massive amounts of data repeatedly. Hyperparameter tuning: Finding optimal hyperparameters (e.g., learning rate, network architecture) often involves training multiple models with different configurations, multiplying the computational cost. Complex architectures: Advanced architectures like convolutional neural networks or recurrent neural networks involve specialized operations that add to the computational complexity. Gradient descent optimization: The optimization process itself, typically using variants of gradient descent, requires many small updates to the parameters, each requiring computation of gradients. These factors combined make the training of artificial neural networks a computationally intensive task, often requiring specialized hardware like GPUs to accelerate the process.
Hyperparameter tuning: Finding optimal hyperparameters (e.g., learning rate, network architecture) often involves training multiple models with different configurations, multiplying the computational cost. Complex architectures: Advanced architectures like convolutional neural networks or recurrent neural networks involve specialized operations that add to the computational complexity.
What makes the training of artificial neural networks computationally expensive The training of artificial neural networks is computationally expensive for several key reasons: 1. Large number of parameters: Deep neural networks often have millions or even billions of parameters (weights and biases) that need to be optimized during training. Updating such a large number of parameters requires significant computational resources. 2. Iterative process: Training typically involves many iterations over the entire dataset (epochs) to gradually adjust the parameters. This iterative nature makes it time-consuming, especially for large datasets. 3. Backpropagation: The backpropagation algorithm, used to calculate gradients and update weights, requires forward and backward passes through the network for each training example. This process is computationally intensive, especially for deep networks. 4. Matrix operations: Neural network computations involve many matrix multiplications and other mathematical operations, which are computationally expensive, especially as network size increases. 5. Large datasets: Training on large datasets, which is often necessary for good performance, requires processing massive amounts of data repeatedly. 6. Hyperparameter tuning: Finding optimal hyperparameters (e.g., learning rate, network architecture) often involves training multiple models with different configurations, multiplying the computational cost. 7. Complex architectures: Advanced architectures like convolutional neural networks or recurrent neural networks involve specialized operations that add to the computational complexity. 8. Gradient descent optimization: The optimization process itself, typically using variants of gradient descent, requires many small updates to the parameters, each requiring computation of gradients. These factors combined make the training of artificial neural networks a computationally intensive task, often requiring specialized hardware like GPUs to accelerate the process. Citations: [1] Training Neural Networks | Machine Learning - Google for Developers developers.google.com/machine-learning/crash-course/training-neural-networks/video-lecture [2] Various Optimization Algorithms For Training Neural Network towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6?gi=ea8e0c3dd721 [3] 5 Algorithms to Train a Neural Network - DataScienceCentral.com www.datasciencecentral.com/5-algorithms-to-train-a-neural-network/ [4] The differences between Artificial and Biological Neural Networks towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7 [5] [PDF] Introduction to Neural Network Algorithm einsteinmed.edu/uploadedFiles/labs/Yaohao-Wu/Lecture%209.pdf
Is there any type of mathematical proof or at least reasoning that the weight of pretrained Neuron Network Weights are normal distributed? This in essense is the foundational data point they are using. And yes, very well done - I just found you via good old Google Search looking for QLoRA. Thanks for investing your time to bring these concepts closer to the community and people.
Great job mate, keep going. Could you please update on current research for proof-of-concept CLIP/text based prompting for SAM? Note one suggestion kindly: Don't make the text animation bounce. It's very distracting when trying to read the text. Maybe you can try other kind of animation or just keep it simple. Even other kinds of objects in flowchart bounce when they appear. Please avoid this bounce animation.
thanks for the great feedback. I never thought bouncing will turn annoying. I was thinking it was cool to animate in different ways. So will keep it simple going forward. I have done a video on SAM2, which is the updated version of SAM for videos. Would you still like text based prompting for SAM? If so, can you give more details as to what you wish to learn?
yes, thats why they keep the KAN network much smaller than the traditional MLP networks. Also I think its just the beginning of a new type of network. Lets wait and watch developments that address overfitting and other shortcomings of KANs.
sorry to hear you feel this way. Have had some happy comments and feedback on LinkedIn too :) but will keep it up for next time. YT doesn't allow us to edit already uploaded videos, unfortunately!
Amazing video! I'm wondering that for the model architecture images you provided, are the spatial and temporal layers only talking about the architecture of the U-net block? Would the rest of the Video LDM model be the same?