Fine-Tuning Meta's Llama 3 8B for IMPRESSIVE Deployment on Edge Devices - OUTSTANDING Results!

Подписаться 30 тыс.

Просмотров 10 тыс.

50% 1

This video demonstrates an innovative workflow that combines Meta's open-weight Llama 3 8B model with efficient fine-tuning techniques (LoRA and PEFT) to deploy highly capable AI on resource-constrained devices.
We start by using a 4-bit quantized version of the Llama 3 8B model and fine-tune it on a custom dataset. The fine-tuned model is then exported in the GGUF format, optimized for efficient deployment and inference on edge devices using the GGML library.
Impressively, the fine-tuned Llama 3 8B model accurately recalls and generates responses based on our custom dataset when run locally on a MacBook. This demo highlights the effectiveness of combining quantization, efficient fine-tuning, and optimized inference formats to deploy advanced language AI on everyday devices.
Join us as we explore the potential of fine-tuning and efficiently deploying the Llama 3 8B model on edge devices, making AI more accessible and opening up new possibilities for natural language processing applications.
Be sure to subscribe to stay up-to-date on the latest advances in AI.
My Links
Subscribe: / @scott_ingram
X.com: / scott4ai
GitHub: github.com/sco...
Hugging Face: huggingface.co...
Links:
Colab Demo: colab.research...
Dataset: github.com/sco...
Unsloth Colab: colab.research...
Unsloth Wiki: github.com/uns...
Unsloth Web: unsloth.ai/

Опубликовано:

25 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 43

@israelcohen4412 4 месяца назад

So i never post comments, but the way you explained this was by far the best i have seen online, i wish I found your channel 8 months ago :) Please keep posting videos your explanation is very well thought off and put together.

@ChargedPulsar Месяц назад

I'm still struggling. There is a lot of advertisements and explanations, tutorials of other tools which I don't care at all. I want to learn about fine tuning, it should be all about some python code and whatever libraries used. People can use all kinds of other tools to support the development, if they cannot open an excel file or know where to write their python code, they shouldn't be in the AI business at all.

@ratsock 4 месяца назад

Absolutely fantastic! Really appreciate, detailed, clear breakdown of concrete steps that let us drive value, rather than the clickbait hypetrain that everyone else is on.

@williammcguire9058 Месяц назад

Big thanks for the detailed walkthrough-really learned a lot from your video!

@petroff_ss 3 месяца назад

Thank you! You have a talent for explaining and planning a workshop! Thank you for your work!

@tal7atal7a66 4 месяца назад

i like the thumbnails, topic types, explains methods, and the mr who explain. nice channel very valuable infos ❤

@RameshBaburbabu 4 месяца назад

Thank you so much for sharing that fantastic clip! It was really informative. I'm currently looking into fine-tuning a model with my ERP system, which handles some pretty complex data. Right now, I'm creating dataframes and using panda-ai for analytics. Could you guide me on how to train and make inferences with this row/column data? I really appreciate your time and help!

@scott_ingram 4 месяца назад

Thanks for your question and for watching the video. I'm glad you found it informative! Your approach largely depends on your use case and the kind of insights you're looking to derive from your data. Generally, you're going to want to follow these steps to train a model with complex data: Decide how you plan to interact with the model. For instance, maybe you're doing text generation; or natural language understanding tasks like sentiment analysis, named entity recognition and question answering; or text summarization; or domain-specific queries like legal, medical or corporate. Choose a model that has high benchmarks for the specific requirements of your task, the nature of your data and the desired output format. A model is more likely to train well if the base model's capabilities are already very strong for the task you intend to use it for. Consider factors like model performance, computational resources, and the availability of pre-trained weights for your specific domain or language. Prepare and preprocess your dataframes, removing/filling missing values, encoding variables numerically and normalizing the data. The cleaner the data, the better the training will be. Split the data into a training set and validation set. The validation set will be data you haven't trained the model on to see how the model performs with unseen data. Fine-tune with your dataset, test the model out, then iterate on the process by tweaking data, adding more data, trying different training parameters, even trying different models. Hope this helps guide you in your endeavor!

@gustavomarquez2269 4 месяца назад

You are amazing! This is the best explanation about this topic. I liked it and just subscribed. Thank you very much !!!

@scott_ingram 4 месяца назад

Thank you so much for the kind words and for subscribing, I really appreciate it! I'm so glad you found the video helpful in explaining how to fine-tune LLaMA 3 and run it on your own device. It's a fascinating topic and technology with a lot of potential. I'm looking forward to sharing more content on large language models and AI that you'll hopefully find just as valuable. Stay tuned!

@hellohey8088 2 месяца назад

Nice video. I have a question: At 8:10, is there any reason why you set add_special_tokens=false in the .encode_plus method? I thought special tokens are added during training so wouldn't it make more sense to set add_special_tokens=true if we want to know how large the biggest training example will be?

@vilivilhunen3383 2 месяца назад

Amazing! Thanks for the help :)

@SilentEcho-d9q 4 месяца назад

Is the output from Ollama on your MacBook in real-time? Or you have speed up in the video? On my 2014 iMac, it is significantly slower. It's about time for a new one. What are the technical specifications of your Mac?

@scott_ingram 4 месяца назад

Except for the download, which I sped up significantly, everything in terminal was shown in real time. The demo was done on a MacBook Pro M3 Pro Max. YMMV with other hardware.

@ganeshkumara8501 2 месяца назад

what are the parameters need to update, If I using a 1000 question and answer pair in CSV and what will be the value of those parameters.

@scott_ingram 2 месяца назад

Thanks for your question! When scaling up from just 30 records to 1000 question-answer pairs, you'll likely need to adjust several parameters to accommodate the increased dataset size and try to improve the fine-tuning. Here are some key parameters to consider updating: * Batch size: With a larger dataset, you may want to increase the batch size to process more examples per iteration. This can help improve training efficiency. * Learning rate: You might need to adjust the learning rate. With more data, you could potentially use a slightly higher learning rate, but be careful not to set it too high. * Number of epochs: With more data, you might need fewer epochs to reach convergence. Monitor the training and validation loss to determine the optimal number of epochs. * LoRA rank: You might consider increasing the LoRA rank to allow for more capacity in the adapted layers, given the larger dataset. * Warmup steps: Adjust the number of warmup steps proportionally to the increased dataset size. Weight decay: You might need to adjust the weight decay to prevent overfitting on the larger dataset. * Gradient accumulation steps: If memory is a constraint, you might need to use gradient accumulation to effectively increase the batch size. * Evaluation strategy: Consider evaluating more frequently with the larger dataset. As for specific values, it's hard to provide exact numbers, as they're going to differ between use cases, actual data, hardware, and desired outcomes. But, here are some general guidelines: * Batch size: Try increasing to 16 or 32, depending on your GPU memory. * Learning rate: Start with 2e-4 to 5e-4 and adjust based on training performance. * Number of epochs: Start with 3-5 epochs and adjust based on convergence. * LoRA rank: Consider increasing to 16 or 32. * Warmup steps: 10% of the total training steps is a good starting point. * Weight decay: Start with 0.01 and adjust as needed. * Gradient accumulation steps: Start with 4 or 8 and adjust based on your hardware capabilities. * Evaluation strategy: Consider evaluating every 500 or 1000 steps. These are just starting points. Every use case is different. You'll need to experiment. You might be able to use techniques like learning rate finders or hyperparameter tuning to optimize for your specific case. Good luck! Scott

@andrepamplona9993 4 месяца назад

Super, hyper fantastic! Thank you.

@lorenzoplaatjies8971 2 месяца назад

I'm able to get as far as inference, once the model is trained i get an error: name 'FastLanguageModel' is not defined but thank you for the tutorial

@ganeshkumara8501 2 месяца назад

I sir . First i have started with limited data. It works fine . After i added the another 20 data in csv. Not answering correctly. Some other answer it is giving.why?

@scott_ingram 2 месяца назад

Hi! Thanks for your question. When working with a 4-bit quantized version of Llama 3 7B using LoRA adaptation layers and PEFT, it's not uncommon to see performance changes as you add more data. This is especially true given the constraints of training in Colab on an A100. The issue you're experiencing could be due to several factors: Quantization effects: 4-bit quantization, while efficient, can sometimes lead to loss of precision, which might become more apparent with more diverse data. LoRA adaptation: The low-rank adaptation might need adjustment as your dataset grows. The rank or alpha parameters might need tuning. PEFT limitations: While PEFT is great for efficient fine-tuning, it might struggle to capture more complex patterns as your dataset expands. Data quality and consistency: Ensure the new data aligns well with your original dataset and task objective. Limited model capacity: Despite Llama 3 7B being a large model, the quantization and adaptation techniques might limit its capacity to absorb significant new information without losing previous learning. To address this: Try adjusting your LoRA parameters (rank, alpha) to allow for more capacity. Experiment with different learning rates or training durations. Consider using a higher bit quantization if possible, though this might be constrained by Colab's resources. Ensure your new data is well-preprocessed and consistent with your original dataset. You might need to balance between the amount of new data added and the model's capacity to learn from it given the constraints. Remember, fine-tuning large language models, especially with techniques like quantization and PEFT, often requires iterative experimentation to find the right balance. Keep monitoring your performance metrics and adjusting your approach accordingly. Good luck! Scott

@ganeshkumara8501 2 месяца назад

@@scott_ingram are you intrested to colab with my company? Regarding this Fine tuning

@Hotboy-q7n 2 месяца назад

Did you play Chef Slowik in the movie "The Menu"?

@SilentEcho-d9q 4 месяца назад

Amazing video, thanks for the best explanation I’ve ever seen on RU-vid. Could you also please make a video how to finetune the phi3 model? 🙏

@scott_ingram 4 месяца назад

Great suggestion! I will look into that.

@andrew.derevo 3 месяца назад

Good stuff sir, thanks a lot 🙌

@andrew.derevo 3 месяца назад

Did you have any experience with fine tuning for non english data on this model, any suggestions for a good multilingual open sources models?🙏

@scott_ingram 2 месяца назад

I haven't personally fine-tuned models for languages other than English, but Llama 3 is one of the best available multilingual open source models.

@hellohey8088 2 месяца назад

I think calling the .for_inference method before training will interfere with training, so it seems like a bad idea. The training in the notebook converges without a problem for me using a T4 GPU if just skipping that step.

@madhudson1 4 месяца назад

rather than using google colab + compute for training, what are your thoughts on using a local machine + GPU?

@guyvandenberg9297 4 месяца назад

Good question. I am about to try that. I think you need an Ampere architecture on the GPU - (A100 or RTX 3090). Scott, thanks for a great video.

@guyvandenberg9297 4 месяца назад

Ampere architecture for BF16 as opposed to F16 as per Scott's explanation in the video.

@scott_ingram 4 месяца назад

Thanks for your question! The notebook is designed to do the training on Colab, but you can run it locally for training if you have compatible hardware; I haven't tested it locally though. The RTX3090 does support brain float. Install python, then set up a virtual environment: python3 -m venv venv source venv/bin/activate Next, install and start the Jupyter notebook service: pip install jupyter jupyter notebook --kernel_name=python3 That will run a local jupyter notebook service and connect to a Python 3 kernel. Then, test GPU availability: import torch print(torch.cuda.is_available()) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") Here's how you would create a tensor with pytorch on the RTX 3090 and tell it to use brain float: tensor = torch.randn(1024, 1024, dtype=torch.bfloat16) Some cells in the notebook won't run correctly, such as the first cell that sets up text wrapping (this cell is not relevant for training); that's designed for Colab specifically. There may be other compatibility issues, but I haven't tested it running locally. This should get you started to see whether your GPU could potentially work. Let me know how it works out!

@PreparelikeJoseph 4 месяца назад

@@scott_ingram Id really like to get some ai agents running locally on a self hosted model. Im hoping two rtx 3090 can combine just via PCI and load a full 70b model.

@EuSouAnonimoCara 4 месяца назад

Awesome content!

@azkarathore4355 3 месяца назад

Hi i want to finetune llama3 for English to urdu machine translation can you guide me regarding this.dataset is opus 100

@Danishkhan-ni5qf 3 месяца назад

Wow!

@jonassteinberg9598 2 месяца назад

"The A100 works well" You don't say lol -- bruh this is a $50K GPU which costs $2-$3K/month to run.

@John-kn6df Месяц назад

Search für Google colab. U Clan use there gpu

@ronaldmatovu5383 2 месяца назад

Thank you for this wonderful video, very educative. I have a question, incase I have a dataset with questions and answers but the answers are not written in proper English grammar which at hugging face as matovu-ronald/kisan_call_centre_groundnut_crop_QA_dataset_cleaned. What is the best way to make a mode return an answer that is grammatically formatted

@scott_ingram 2 месяца назад

Thanks for your question! It sounds like you want to clean the dataset and then use it to fine-tune a model before you use it for inference. This is because if you use grammatically bad data for training, you're going to get a model that dutifully responds with bad grammar. Fixing the grammar from a model's response means that you may already have lost some context before you even get to post-processing, although it can be done programmatically. That approach has many problems. Instead, I would suggest using a large language model to clean the dataset: Preprocess the Dataset: Go through each question-answer pair in the dataset. Use a model like GPT-3.5 or GPT-4 to correct the grammar of each answer. Design a prompt for grammar correction: Create a prompt that instructs the model to correct grammar while preserving meaning. Run batch cleaning: Process the dataset in batches to speed up the cleaning process. Here's a short example of how you might implement this using the OpenAI API (with API key): ```python import openai import pandas as pd openai.api_key = 'your-api-key' def correct_grammar(text): response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant that corrects grammar while preserving the original meaning."}, {"role": "user", "content": f"Correct the grammar of the following text, maintaining its original meaning: '{text}'"} ] ) return response.choices[0].message['content'] # Load your dataset df = pd.read_csv('your_dataset.csv') # Apply grammar correction to the 'answer' column df['corrected_answer'] = df['answer'].apply(correct_grammar) # Save the corrected dataset df.to_csv('corrected_dataset.csv', index=False) ``` Here are the advantages to this approach: * Consistency: It ensures a uniform level of grammatical correctness across the dataset. * Meaning Preservation: By instructing the model to maintain the original meaning, you reduce the risk of altering the content. * Efficiency: You can process a large number of entries quickly. * Flexibility: You can adjust the prompt to focus on specific aspects of grammar or style as needed. After cleaning the dataset this way, you can then use it with a fine-tuning process on the grammatically correct data (or whatever your use case prescribes). You should end up with a model that produces more grammatically correct output so you can avoid post-processing steps. Hope this helps! Scott

@ronaldmatovu5383 2 месяца назад

@@scott_ingram thank you very much Scott this is much helpful.