Fine-tune Mixtral 8x7B (MoE) on Custom Data - Step by Step Guide

Подписаться 161 тыс.

Просмотров 36 тыс.

50% 1

In this tutorial, we will walk through a step by step tutorial on how to fine tune Mixtral MoE from Mistral AI on your own dataset.
LINKS:
Colab (free T4 will not work): tinyurl.com/2hfk2fru
Mistral 7B fine-tune video: • Mistral: Easiest Way t...
‪@AI-Makerspace‬
Want to Follow:
🦾 Discord: / discord
▶️️ Subscribe: www.youtube.com/@engineerprom...
Want to Support:
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Support my work on Patreon: / promptengineering
Need Help?
📧 Business Contact: engineerprompt@gmail.com
💼Consulting: calendly.com/engineerprompt/c...
Join this channel to get access to perks:
/ @engineerprompt
Timestamps:
[00:00] Introduction
[00:57] Prerequisites and Tools
[01:52] Understanding the Dataset
[03:35] Data Formatting and Preparation
[06:16] Loading the Base Model
[09:55] Setting Up the Training Configuration
[13:22] Fine-Tuning the Model
[16:28] Evaluating the Model Performance
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Наука

Опубликовано:

1 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 70

@MikewasG 6 месяцев назад

Thank you for sharing, this is very helpful! Looking forward to the next videos!

@dev_navdeep 4 месяца назад

kudos, really simple and direct explaination.

@WelcomeToMyLife888 6 месяцев назад

Awesome content as usual! Thanks!

@engineerprompt 6 месяцев назад

Thank you 😊

@AbhishekShivkumar-ti6ru 5 месяцев назад

very nicely explained!

@jprobichaud 6 месяцев назад

🎯 Key Takeaways for quick navigation: 00:00 🚀 *Introduction to Fine-Tuning Mixtral 87B Model* - Overview of the video's purpose: fine-tuning Mixtral 87B model from Mistral AI on a custom dataset. - Mention of the popularity and potential of Mixtral 87B as a mixture of experts model. - Emphasis on practical considerations for fine-tuning, such as VRAM requirements and dataset details. 01:28 🛠️ *Installing Required Packages and Data Set Overview* - Installation of necessary packages: Transformers, TRL, accelerate, P torch bits, and bytes. - Discussion on using the Mosaic ML Instruct with 3 datasets for fine-tuning. - Overview of the dataset structure, splits, and sources. 03:45 📝 *Formatting Data for Fine-Tuning Mixtral 87B* - Explanation of the prompt template for fine-tuning, specific to Mixtral 87B Instruct version. - Discussion on rearranging data to make it more challenging by creating instructions from provided text. - Demonstration of a function to reformat the initial data into the desired prompt template. 06:28 🧩 *Loading Base Model and Configuring for Fine-Tuning* - Acknowledgment of the source for the notebook and clarification that the base version is used. - Setting configurations, loading the model, and tokenizer, along with using Flash attention. - Explanation of the importance of setting up configurations for a smooth fine-tuning process. 08:18 🔄 *Checking Base Model Responses Before Fine-Tuning* - Use of a function to check responses from the base model before any fine-tuning. - Illustration of the base model behavior in generating responses to a given prompt. - Recognition that the base model tends to follow next word prediction rather than explicit instructions. 10:06 📏 *Determining Max Sequence Length for Fine-Tuning* - Explanation of the importance of max sequence length in fine-tuning Mixtral 87B. - Presentation of a code snippet to analyze the distribution of sequence lengths in the dataset. - Emphasis on selecting a max sequence length that covers the majority of examples. 12:20 🧠 *Adding Adapters with Lura for Fine-Tuning* - Overview of the Mixtral 87B architecture, focusing on linear layers for adding adapters. - Introduction to Lura configuration for attaching adapters to specific layers. - Demonstration of setting hyperparameters and using the TRL package for supervised fine-tuning. 14:36 🚥 *Setting Up Trainer and Initiating Fine-Tuning* - Verification of multiple GPUs for parallelization during model training. - Definition of output directory and selection of training epochs or steps. - Importance of configuring the trainer, including considerations for max sequence length. 16:50 📈 *Analyzing Fine-Tuning Results and Storing Model* - Presentation of training and validation loss graphs, indicating a gradual decrease. - Acknowledgment of the need for potential longer training for better model performance. - Demonstration of storing the fine-tuned model weights locally and pushing to Hugging Face repository. 17:46 🔄 *Testing Fine-Tuned Model Responses* - Utilization of the fine-tuned model to generate responses to a given prompt. - Comparison of responses before and after fine-tuning, showcasing improved adherence to instructions. - Acknowledgment that further training could enhance the model's performance. Made with HARPA AI

@ahmedmechergui8680 6 месяцев назад

Thanks for the video 😃 i just have a question , is it possible to use the model through an API and also provide the source files for the data with the response ?

@user-hc5os4fs5k 6 месяцев назад

can you also make a video on fine-tuning multimodal models like llava, cog-vlm

@HarmeetSingh-ry6fm 5 месяцев назад

Great video just have one question can we use the fine-tuned model as a pickle file?

@AI-Makerspace 5 месяцев назад

Thanks for the tag @Prompt Engineering! What else is your audience requesting the most these days? Would love to find ways to create some value for them together!

@engineerprompt 5 месяцев назад

Thanks for the amazing work you guys are doing! really appreciate it. I think deployment is a topic that will be really valuable to my audience. Let's explore how to collaborate.

@AI-Makerspace 5 месяцев назад

@@engineerprompt absolutely! We started delving deper into deployment with LangServe and vLLM events in recent weeks. We'll connect to figure out next steps!

@kaio0777 6 месяцев назад

Can you make this for home computer use in terms of my personal data and tech it to use tools on your system and online

@user-nl4ry3wb1x 2 месяца назад

3:37 format 4:15 follow a different format 4:26 Indicate the end of user input 4:33 special token Indicate the end of model response 4:39 you need to provide your data in this format 5:08 def create_prompt 5:31 System message 6:16 Load our based model

@alexxx4434 6 месяцев назад

Thanks for the guide! How to continue fine-tuning process such as in this case? Can you load previous work (Lora) and carry on, or do you need to restart?

@engineerprompt 6 месяцев назад

I think you can do that by storing different check points

@joaops4165 6 месяцев назад

Could you make a tutorial teaching how to convert a model to ggml format?

@sysadmin9396 6 месяцев назад

Can I use this to train a model to answer questions from a list of pdfs?

@lukeskywalker7029 5 месяцев назад

IM sceptical this actually is effectively training mixtral MoE model and not making it worse!

@caiyu538 6 месяцев назад

Great

@rishabhkumar4443 6 месяцев назад

How can I use a generative model to manipulate content of my website Ex. Showing response from my site based on prompt given by the user

@researchforumonline 5 месяцев назад

Thanks, what is the cost to do this? Server cost?

@garyhutson6270 5 месяцев назад

What were your VM instance specs. It is struggling with an A100?

@Tiberiu255 6 месяцев назад

why are you using packing in the SFTTrainer if you just said that you're going to pad the examples?

@big_sock_bully3461 5 месяцев назад

Can you explain ?

@user-ed2wf6wr5g 3 месяца назад

So with two 3090s this should work? And what about using multiple different gpus for training? Like I have one 3090ti 24g and one 4060 8g

@DistortedV12 6 месяцев назад

Awesome man, any idea of how to get this running on a colab gpu or inference cost down?

@engineerprompt 6 месяцев назад

Probably no way at the moment to run it on the colab gpu but you can look at the 2bit quantized version. If you are running this model as part of production pipeline, I would suggest to look at api providers such as together AI. They have really good pricing on it

@Ai-Marshal 4 месяца назад

That's a great video. Thanks for sharing. After pushing the model to hugging face, how to host it independently on runpod using VLLM ? When I try to do that, it gives me error. Tried searching a lot of videos and articles. But of no use so far.

@FunkyByteAcademy 4 месяца назад

did you come right?

@Akshatgiri 4 месяца назад

I've noticed that Mixtral 8x7b-instruct ( and other mistral models ) constantly repeat part of the system prompt. Have you noticed this / found a fix for it?

@shinygoomy2460 4 месяца назад

how do you format a prompt that has multiple requests and responses within the same context???????

@user-cu3dr6pt7s 3 месяца назад

Could you please share the requirement.txt, i am having version conflicts despite using A100 GPU!

@abdeldjalilmouaz 3 месяца назад

requires colab pro to work?

@divyagarh Месяц назад

Great video! Could you please consider training and deploying it in Sagemaker?

@engineerprompt Месяц назад

I am going to create a video on deployment soon

@DistortedV12 6 месяцев назад

Are you finetuning the mixtral instruct version they just released or base model??

@engineerprompt 6 месяцев назад

In this video, just the base version

@user-rm8hx5ih4q 5 месяцев назад

at 5:58, Why is the sample["response"] given as the input and sample["prompt"] is given as response

@VerdonTrigance 4 месяца назад

Hi, thanks for this step by step guide, but in case we want LLM to learn something new about our domain (let's say it will be book Lord of the Rings) and we later want to ask our model open questions about this book (like 'where Frodo gets his sword?') what should we do? We definetely cannot prepare dataset in form of QnA, so it should self-supervised training. But I never saw examples of doing this and I can't image how it supposed to be done? Is it even possible? Looks like we should start from base model, fine-tune it somehow with our book, and later we should apply fine-tuning for instruct on top of it, right? But in this case someone still should prepare this QnA? I'm frustrated.

@xXCookieXx98 3 месяца назад

Your use case sounds like a classic RAG one. It's not necessary to fine-tune for that. Although a fine-tuned model + RAG would probably create even better results, the effort here doesn't seem worth it. The video Building Corrective RAG from scratch with open-source, local LLMs from langchain (ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-E2shqsYwxck.html) might help you, it also incudes a web search option, in case the provided context isn't sufficient, which should work pretty good with things like popular books. So, it's not limited to that and can be used in basically any domain. But you could also just build a RAG app without that. I would suggest a combination of a MultiQueryRetriever and a ParentDocumentRetriever for retrieving your context. Nevertheless, if you still want to fine tune: From what I have learned so far it is possible to create datasets using LLMs: e.g. you prompt an instruct LLM to create questions based on context chunks and then use those questions and chunks to create answers. You will find similar methods on this channel e.g. "automate dataset creation for Llama-2 with GPT-4".

@kanshkansh6504 3 месяца назад

❤👍🏼

@lostInSocialMedia. 6 месяцев назад

can you finetune Uncensored Models of this with gemini pro ai ?

@PotatoMagnet 6 месяцев назад

The base model ofmistral is uncensored, but you can't fine tune one model with another model. Both are of different architecture, you can't even merge or fine tune between same models of different parameters like between 7B and 13B either, so forget completely different models.

@LeoAr37 6 месяцев назад

Can't we train the quantized version in a smaller GPU instead of training the full model?

@engineerprompt 6 месяцев назад

Even training the quantized version of the full model will need a powerful GPU. That's why LoRa is used to add extra layers that are trained instead of the actual model. Hope this helps

@electricskies1707 6 месяцев назад

Can you clairfy, 1 epoch would be one run of the full data (34333 steps of your trimmed data) Why would you run this 2 epochs, does going over the data twice improve it? Also how did you determine 32 was a good batch size for this data size? (this is about 0.9% of the data?)

@LeoAr37 6 месяцев назад

I think the companies that trained big LLMs usually used 2-3 epochs

@engineerprompt 6 месяцев назад

Batch size determines how much data is fed to your model at once. 32 is the max I could do on the available hardware. Usually you will see that to be much lower. In regards to the epochs, you are right. In one epoch, the model will see each example once. If you have small amount of data, you might want to go over multiple epoch so the model can actually learn from the data but you need to be careful that the model can also overfit. For large amount of data (billions or trillions of tokens) its very expensive and time consuming to have several epochs over the data, that's why you mostly see models trained for one more two epochs only. Hope this helps.

@AIEntusiast_ 4 месяца назад

i wish someone made a video from collecting data example pdf, conver that to working dataset tha can be used to train model, everyone is using huggingface models and just retrain another llm

@protimaranipaul7107 4 месяца назад

Thank you share such wonderfull video! Waiting for a video that discuss about finetuning. So that we can use higher than 32k token. Have you or any person worked with the folloing? 0) How did we measure performance after fine tuning? Did they perform well? Perplexity? 1) Json files? Creating graphs to store the context? 2) and or Large csv/sql file? As llama code sql code is not working well 3) Any image/diffusion models? Appreciate it!

@pallavggupta 6 месяцев назад

Hi, I am trying to build an organisation level AI trained on my company data I would to know how can I create dataset for my data to be trained on mistral AI I was unable to find any tutorial on how to create a dataset for large data

@conscious_yogi 5 месяцев назад

Did you found solution for this?

@nishhaaann 5 месяцев назад

Looking for same thing@@conscious_yogi

@user-yd3zk4hb1o 6 месяцев назад

So can't we run in colab or kaggle notebook?

@ilianos 6 месяцев назад

in the video descr it says no (not on T4)

@luciolrv 6 месяцев назад

I could not run it in A100 of Colab. It complains of lack of memory, not too much: actually less than 1GB. The "copilot" of colab gives some suggestions such as reducing batch size or the max_split_size_mb parameter, but that does not reduce enough. Any ideas? Good notebook

@jonjino 6 месяцев назад

@@luciolrv It complains of less than 1GB of memory, but that's because it's loading the model a bit at a time so the error message isn't accurate. Kaggle doesn't offer better GPU's either. You'll need to setup a VM with an A100 80GB or H100. Unfortunately you'll probably just have to go through the hassle of setting up a VM with one of those GPU's via GCP or AWS.

@scortexfire 3 месяца назад

How do I fine tune without prompt and instruction? I basically want the model to "know" about a thousand very recent web articles.

@engineerprompt 3 месяца назад

In this case, you probably want to further pretrain the base model with your dataset (you don't need prompt & instructions format) and then finetune it on a dataset. Or just use RAG.

@user-ig2og2yq3b 4 месяца назад

please let me know how to create a fixed forms with the below structures with special command to LLM: Give me score out of 4 for (based on the TOEFL rubric) without any explanation, just display the score. General Description: Topic Development: Language Use: Delivery: Overall Score: Identify the number of grammatical and vocabulary errors, providing a sentence-by-sentence breakdown. 'Sentence 1: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' 'Sentence 2: Errors: Grammar: Vocabulary: Recommend effective academic vocabulary and grammar:' .......

@tomski2671 6 месяцев назад

I think you can rent an H100 for $5/hour. So this would cost about $7

@hemeleh8683 6 месяцев назад

where?

@kunalr_ai 6 месяцев назад

64 gb vram kaha se laaoge pata nahi kaunse dataset par fine tune kiya hai bhai kisi kaam ka nahi hai ye video tere paise to view se aa gaye humare paise kaise banege