How to Create Custom Datasets To Train Llama-2

Подписаться 160 тыс.

Просмотров 93 тыс.

50% 1

In this video, I will show you how to create a dataset for fine-tuning Llama-2 using the code interpreter within GPT-4. We will create a dataset for creating a prompt given a concept. We will structure the dataset in proper format to fine tune a Llama-2 7B model using the HuggingFace auto train-advanced package.
Happy learning :)
#llama2 #finetune #llm
▬▬▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Support my work on Patreon: Patreon.com/PromptEngineering
🦾 Discord: / discord
▶️️ Subscribe: www.youtube.com/@engineerprom...
📧 Business Contact: engineerprompt@gmail.com
💼Consulting: calendly.com/engineerprompt/c...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
LINKS:
One-liner fine-tuning of Llama2: • LLAMA-2 🦙: EASIET WAY ...
ChatGPT as Midjourney Prompt Generator: • ChatGPT & MidJourney: ...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Timestamps:
Intro: [00:00]
Testing Vanila Llama2: [01:20]
Description of Dataset: [02:14]
Code Interpreter: [03:24]
Structure of the Dataset: [4:56]
Using Base model: [06:18]
Fine-tuning Llama2: [07:25]
Logging during training: [10:36]
Inference of the fine-tuned model: [12:44]
Output Examples: [14:36]
Things to Consider: [15:40]
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Наука

Опубликовано:

19 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 108

@chuckwashington6663 10 месяцев назад

Thanks, this gives me exactly what I needed to understand how to create a dataset for fine tuning. Most of the other videos skip over the details of the formatting and other parameters that go into creating your own dataset. Thanks again!

@engineerprompt 10 месяцев назад

Thank you for your support. I'm glad it was helpful 😊

@pareak 5 месяцев назад

Thank you so much! This just gives me a really good basis on how I can start finetuning my own model! Because the model will in the end be as good as the training set.

@umeshtiwari9249 10 месяцев назад

Thanks. very nice way you explained the concept. it gives boost to the knowledge and to the area where usually people have fear in mind to grasp but the way you explained it, to me it looks very easy. today i got the ability to fine tune the model myself. thanks a lot Sir. looking forward to more advanced topics from you.

@engineerprompt 10 месяцев назад

Thanks and welcome!

@SafetyLabsInc_ca 8 месяцев назад

Datasets are key for fine tuning. This is a great video!

@engineerprompt 8 месяцев назад

Yes! Thank you!

@kevon217 5 месяцев назад

Thanks for covering this topic!

@engineerprompt 5 месяцев назад

My pleasure!

@LeonvanBokhorst 10 месяцев назад

Wow. Thanks again, sir 🙏

@tarun4705 10 месяцев назад

Very informative

@abhijitbarman 10 месяцев назад

@Prompt Engineering. Wow, exactly what I was looking for . I have another request, Can you please make a video on Prompt-Tuning/P-Tuning which is also a PEFT technique ?

@techmontc8360 10 месяцев назад

Hey sir, thank you for the great tutorial. I've some question, it seems in this training you didn't define "--model_max_length" parameter. Is there any differences if you define this parameter or not ?

@derejehinsermu6928 10 месяцев назад

Thank you man , that is exactly what i am looking for

@engineerprompt 10 месяцев назад

Glad I could help

@stickmanland 9 месяцев назад

Thanks for the informative video. I am wondering: Is there a way to do this, but with local LLMs?

@haouarino 10 месяцев назад

Thank you very much for the video. In the case of plaintext, how the dataset could be formatted?

@DemoGPT 10 месяцев назад

Kudos on the excellent video! Your hard work is acknowledged. Could we expect a video about DemoGPT from you?

@rahulrajpvr7d 10 месяцев назад

thank you so much brother❤❤

@vbywrde 5 месяцев назад

Very coherent and well explained. Thank you kindly. I'm curious also if you have any advice about creating a dataset that would allow me to fine tune my model on my database schema? What I'd like to do is run my model locally, and ask it to interact with my database, and have it do so in a smooth and natural manner. I'm curious about how one would structure a database schema as a dataset for fine tuning. Any recommendations or advice would be greatly appreciated. Thanks again! Great videos!

@drbinxy9433 8 месяцев назад

You are a legend my man

@engineerprompt 8 месяцев назад

🙏🙏🙏

@oliversilverstein1221 10 месяцев назад

FYI you're the man. idk why it was so hard to find a good pipeline to train literally went througfh all the libs and no one mentioned autotrainer advanced lol

@engineerprompt 10 месяцев назад

Thank you!

@HarishRaoS 4 месяца назад

thanks for this video

@TheCloudShepherd 7 месяцев назад

Daaaamn bro that's brilliant

@engineerprompt 7 месяцев назад

Thank you. More to come on fine-tuning :)

@oxytic 10 месяцев назад

Great bro 👍

@samcavalera9489 10 месяцев назад

You're an AI champion. Thanks for the fine-tuning lectures 🙏🙏🙏

@engineerprompt 10 месяцев назад

Thank you for your kind words!

@samcavalera9489 10 месяцев назад

@@engineerprompt Welcome brother!

@chiachinghsieh2150 10 месяцев назад

Thanks SO MUCH for sharing this! Really helpful. I also trying to train on my own data on LLAMA 2. But I am facing a problem from deploy the model. I trained the model on AWS Sagenmaker and store the model in an S3 bucket. When I try to deploy the model and feed it with the prompt, I keep getting errors. my input follows the rule like ###Human....###Assistant. But I still have errors. I wonder if I use the wrong tokenizer. But I couldn't use AutoTokenizer.from_pretrained() in Sagemaker. Wonder if you have some advice!!

@valthrudnir 10 месяцев назад

Hello, thank you for sharing - is this method applicable to GGML / GPTQ models from say TheBloke's repo for example the 'Firefly Llama2 13B v1.2 - GPTQ' or would the training parameters need to be adjusted?

@engineerprompt 10 месяцев назад

I haven't tried this with Quantized models so I am not sure how that will behave. One thing to keep in mind is that you want to use the "base" model not the chat version for best results. Will look at it and see if it can be done.

@vobbilisettyjayadeep4346 10 месяцев назад

You are a saviour

@engineerprompt 10 месяцев назад

Thank you 😊

@LeKhang98 2 месяца назад

Thank you very much. Is 300 rows a good number for training? I know it depends on many factors but I don't know how to identify if my dataset is bad or it's just too small.

@am0x01 5 месяцев назад

Thanks for great service to the community. On my experiment, the Config.json is not created, is that normal?

@lrkx_ 10 месяцев назад

If you don’t mind sharing, what’s the performance of a Mac like when fine tuning? I’m quite keen to see how long it takes to fine tune a 7B vs a 13B parameter model on a consumer machine on a small/medium sized dataset. Thanks for the tutorial, very helpful!

@vedchaudhary1597 9 месяцев назад

7B with 4 bit quantization takes about 12.9 GBs of GPU RAM, i dont think mac will be able to run it locally

@MohamedElGhazi-ek6vp 10 месяцев назад

it's very helpful thanks, is it the same process to create a data from multiple documents for a Qusetion Answering model ?

@engineerprompt 10 месяцев назад

Yes, this will work

@prestonmccauley43 10 месяцев назад

I did something similar working on my test data set to get this a bit more understood. I created a python script to merge all the data sets together. I still seem to be struggling to grasp the core training approach using SFT and what models work with what. Its' like the last puzzle missing

@xiangyao9192 9 месяцев назад

I have a question. Why don't we use the conversation format given by llama2, which contains , something like that? thanks

@engineerprompt 9 месяцев назад

You will need to use that if you are using the instruct/chat version. Since I was fine tuning the base version, you can define your own format. Hope this helps

@ishaanshettigar1554 9 месяцев назад

How does this differ if I'm looking to fine-tune for Llama2 7b code instruct

@Phoenix-fr9ic 8 месяцев назад

Can I finetune llama 2 for pdf to question answers generation?

@user-nj7ry9dl3y 10 месяцев назад

For fine-tuning of the large language models (llama-2-13b-chat), what should be the format(.text/.json/.csv) and structure (like should be an excel or docs file or prompt and response or instruction and output) of the training dataset? And also how to prepare or organise the tabular dataset for training purpose?

@mohammedmujtabaahmed490 2 месяца назад

Hey, did you find the answer for your question? If yes, please tell me what format shouldnthe dataset be for fine tuning please.

@AGAsnow 10 месяцев назад

How could I limit it, for example I train it with several relevant paragraphs about the little prince novel, how do I limit it so that it only answers questions that are in the context of the little prince novel

@vitocorleon6753 9 месяцев назад

I need help please. I just want to be pointed in the right direction since I'm new to this and since I couldn't really find any proper guide to summarize the steps for what I want to accomplish. I want to integrate a LLama 2 70B chatbot into my website. I have no idea where to start. I looked into setting up the environment on one of my cloud servers(Has to be private). Now I'm looking into training/fine-tuneing the chat model using our data from our DBs(It's not clear for me here but I assume it involves two steps, first I have to have the data in a CSV format since it's easier for me, second I will need to format it in Alpaca or Openassistant formats). After that, the result should be a deployment-ready model ? Just bullet points I'd highly appreciate that.

@nqaiser 8 месяцев назад

What hardware specifications would be needed to fine tune a 70b model? Once the fine-tuning is complete, can you run the model using oogabooga?

@JJ-yw3ug 10 месяцев назад

I would like to ask, is the RTX4090 sufficient to fine-tune the 13B model, or can it only fine-tune the 7B model? Because I've noticed that the 13B model with default settings doesn't pose a problem for the RTX4090 in terms of parameter handling, but I'm uncertain whether a single RTX4090 is enough if data fine-tuning is required

@engineerprompt 10 месяцев назад

I don't think you can fine tune 13B with 24GB vRAM. Your best bet will be 7B

@LeoNux-um7tg 5 месяцев назад

Can I use my files for data sets? I'm just planning to train a model that can remind me of commands in linux and its options so I don't have to keep reading manuals everytime I use commands that aren't regularly use.

@user-pl9gm5qm1s 8 месяцев назад

How can I build my label with input and output? I found that llama 2 pieced the input and output together, can my label match the input_id?

@marcoabk 2 месяца назад

there is a way to do it with the 13b original from llama2 already in my hard drive?

@Shahawir 10 месяцев назад

I wonder if it is possible train LLAMA, on data where input are numbers and categorical variables(string), of fixed length, to predict a timer series of fixed length, anyone knows if this is possible? And how to fine the model if I have it locally

@medicationrefill 10 месяцев назад

Can I train my own LLM model using data generated by chatpgt, if the model is intended for academic/commercial use?

@engineerprompt 10 месяцев назад

Probably you cant use it commercial purposes. But most of the open source models out there (at least the initial versions) were trained on data generated by chatgpt

@Phoenix-fr9ic 8 месяцев назад

Can I use this technique for document based question answers generation dataset?

@topg4439 Месяц назад

Hey did you found any solution for Q & A model

@MichealAngeloArts 10 месяцев назад

Do you need the 3 columns (Concept, Description, text) in train.csv or just 1 column (text) is enough?

@engineerprompt 10 месяцев назад

Just one column

@godataprof 10 месяцев назад

Just the last text column

@nutCaseBUTTERFLY 10 месяцев назад

@@engineerprompt So I watch the video 5 times, and it is still not clear what columns go where. You didn't even bother to open the .csv file so that we can see the schema. But you did show us the log file!

@Enju-Aihara 10 месяцев назад

@@engineerprompt i wanna know too

@filippobistaffa5913 9 месяцев назад

You just need a "text" column present in your train.csv file, the other columns will be ignored... if you want you can change which column will be used with --text_column column_name

@fups8222 10 месяцев назад

why can't you fine-tune chat model of llama 2? the text completion of the fine tuned model I'm using is giving terrible results from my exact instructions in my prompt. I am using Puffin13B, but when feeding exact instructions it just cannot do them like I am prompting it to do.

@jamesljl 10 месяцев назад

would u pls give a sample of how the csv file looks like ? thanks a lot !

@engineerprompt 10 месяцев назад

Let me see what I can do, the format is shown in the video.

@Univers314 10 месяцев назад

Can Chatgpt3.5 generate files?

@gamingisnotacrime6711 9 месяцев назад

So if we are fine tuning the chat model, can we use same format as above? ; Human...., Assistant.....

@engineerprompt 9 месяцев назад

Yes

@md.rakibulhaque2262 6 месяцев назад

Getting this error with AutoModelForCausalLM: from transformers import AutoModelForCausalLM MJ_Prompts does not appear to have a file named config.json. instead i had to import from peft import AutoPeftModelForCausalLM and use the AutoPeftModelForCausalLM to inference from the model. and one more question, did we train an adapter model here? please tell me how can i solve this. I am using free collab.

@susteven4974 10 месяцев назад

how I can fine tuning llama-2-7b-chat ，can I use your dataset format？

@georgekokkinakis7288 10 месяцев назад

I am facing the following problem. The model gets uploaded to the hugging face repo but without a config.json file. Any solutions? Also can the finetuned model run on the free google colab or should we shard it?

@engineerprompt 10 месяцев назад

Are you fine-tuning it locally or on Google Colab? I am doing it locally without any issues.

@georgekokkinakis7288 10 месяцев назад

@@engineerprompt I am fine-tuning it on Google-Colab. In a post I made on your other video about fine-tuning Llama2 you mentioned that it seems to be a problem with the free tier of colab. I hope 🙏 you will find a fix , because not everyone owns a gpu.

@sauravmukherjeecom 10 месяцев назад

Is it possible to directly finetune gptq models?

@stephenf3838 10 месяцев назад

Qlora

@xiangye524 7 месяцев назад

Getting the error: ValueError: Batch does not contain any data (`None`). At the end of all iterable data available before expected stop iteration. Does anyone know what is the issue ? running on google collab thanks

@muhannadobeidat 2 месяца назад

Thanks for the video. Two things please: 1. When you use autotrain package, then all details are hidden and one is not able to see what is being done and in what exact steps. I would suggest a video like that please if you have even same example. 2. Secondly, it is not clear to me what is the data vs label being fed into the model training phase, what is the loss function, how it is being calculated, etc...

@engineerprompt 2 месяца назад

I agree with you. Autotrain abstracts alot of details but if you are interested in more detailed setup. I would recommend to look for "fine-tune" videos on my channel. Here is one example: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-lCZRwrRvrWg.html

@brunapupoo4809 Месяц назад

when I try to run the command in the terminal it gives error: autotrain [] llm: error: the following arguments are required: --project-name

@emrahe468 10 месяцев назад

how or why do we decide on ###Human: ?? i see lots of variations on different videos. some use ->: , others use ###Input: etc. etc.

@engineerprompt 10 месяцев назад

It's really upto you how you want to define the format. Some models accepts instructions along with the user input, so really you get to decide based on your application.

@user-ek5mv1fp7y 8 месяцев назад

Hola, soy novata en esto, lo que trato de hacer es un chat personalizado con llama2, por ejemplo tengo datos de una empresa x que son como 15 columnas X 300filas. Hasta ahora me responde aunque aún esta con sus respuestas ilógicas , EN FIN quiero saber es que si creo la columan text de {humano, assitant } para cada columna o pregunta posible Y como se prepararía los datos para entrenar el modelo Porfa, alguien me guía en esto?

@fernando88to 10 месяцев назад

How do I use this local template in the localGPT project?

@engineerprompt 10 месяцев назад

The localGPT code has support for custom prompt template. You will need to provide your template there.

@manavshah9062 10 месяцев назад

Hey, i have fine-tuned the model using my own dataset but when running the bash command somehow the model did not get uploaded to hugginface but after the training completed i zipped the model and downloaded it. Now is there a way by which i can upload this fine-tuned model to hugginface now?

@dmitrymalyshev3810 10 месяцев назад

So, if you have a problem, same on my, in google colab this code not will work, because on free google colab this script don't end job and don't create config.json, and you will have aproblems. And i think, that is reason, why this script don't push my model on huggingface hub. But your work is great, so thanks for that.

@georgekokkinakis7288 10 месяцев назад

I am also facing the same problem. Actually in my case the model is uploaded to the huggingface repo but it is missing the config.json file. Any sollutions?

@user-qb9ku5ye4v Месяц назад

Sir, I don't have ChatGPT Plus. Are there any alternatives?

@engineerprompt Месяц назад

Look into the Groq, its a free API (at the moment).

@user-qb9ku5ye4v Месяц назад

@@engineerprompt Thank you so much sir. But, I decided I'd do it without any LLMs. So, I wrote my own code using python and pandas. If you want, I could share the code to you?

@ruizhou1243 4 месяца назад

there is no code and sinppet for how to it work? I don't know the meaning

@srikrishnavamsi1470 7 месяцев назад

How can i contact you sir.

@engineerprompt 7 месяцев назад

Check out my email

@jojomama3028 8 месяцев назад

hmmm. ok. But I have 156Go of pdf about norms and rules of art construction of France. Who to put all of that documentation in the datasets ? I can transfert all that pdf in text files this is not the issue. How to make llama search response in THAT particulare dataset ? I have to do all Q&A of all the document to be able to train the Llama ? What is the point of that ? if I have to do the work before and answer the question, what is the benefit ? It will took me years to write all questions... Perhaps I don't understand the video ? I just want to fill the existing generating model language with specific dataset, not oriantate how he answer me.

@engineerprompt 8 месяцев назад

In that case check out something like localgpt or other RAG solutions

@milesbarn 4 месяца назад

It is not allowed to use GPT-4's output, any of it even if the data fed into it is yours, to train other models than OpenAI's according to their terms.

@phat80 Месяц назад

Who cares? Who will know? 😅

@BHAVYASRIPOLISHETTI 12 дней назад

your session crashed after using all ram error

@islamicinterestofficial 8 месяцев назад

Getting the error: FileNotFoundError: Couldn't find a dataset script at /content/train.csv/train.csv.py or any data file in the same directory. Even though I'm running the autotrain code line in the same directory where my train.csv file present. I'm running on colab btw

@islamicinterestofficial 8 месяцев назад

The solution is to just provide the path of folder where the csv file is present. But don't write the csv file name...