Code Here including short explanation on how to get dataset. github.com/adidror005/youtube-videos/blob/main/LLAMA_3_Fine_Tuning_for_Sequence_Classification_Actual_Video.ipynb
Hello, thanks for the really informative walkthough. I was looking to go back through your notebook for further review, however the notebook no longer available from the link.
Wow thanks for nice comments ! Share video lol 🤣 I want to make more such videos but am looking for the right topic. I am thinking rlhf or something up for suggestions
I have a doubt regarding the modules, is there a methodology for picking those specific ones? can I read more about them somewhere perhaps?? Great video btw. Very well put together and to the point.
@@MLAlgoTrader Also, please consider that knowing what you are working on helps me to plan for the next steps of my development Currently, I use and pay for OpenAI API, but I do plan for implementing a LLama in my home-lab. Once I start to learn and practice LLama, I will go through your videos again.
Honestly, it is completely random. My next videos are on sequential bootstrap, implementing a gap trading strategy both with stocks and with options, the dangers of backtesting, and then I also plan to do ib_insnyc for begginers. ...I think it llama 3 8b params works free version of colab for a bit until you get kicked of gpu. There is also this api I used I think you get quite a bit free at first. docs.llama-api.com/quickstart .
@MLAlgoTrader Hi, can I have one question for you please? I am trying to use the Llama3 8b model for text classification. I have about 170k records and 11 categories. The maximum accuracy I was able to achieve was 68%. The data is properly preprocessed, I also used for example the Bert and Roberta model where both models had over 90% accuracy. I would expect better results from a model type like Llama3 8b. I used both 8bit and 4bit quantization (both have similar results) and LoRA. I also played with different hyperparameters, but the results were hardly different. Do you think that, in short, this model might be a bad choice for text classification? Couldn't we discuss this somewhere in a private message with more details? Thank you.
Hey thanks for the message. I just don't have time since I need to move apartments. Roberta is more direct for classification for it so it isn't that surprising.. I will try to get back to you sometime just don't have time at all now
@R8man012 @MLAlgoTrader I am working with a classification problem and using llama3 q-lora. On 10k rows(data) it's performance around 98% accuracy. What I am facing for 10k data it's taking 1 hour (trainable params: 13,639,680 || all params: 7,518,572,544 || trainable%: 0.1814). How do I make it fast and it work for the whole dataset(2.4 million rows) for a reasonable amount of time?
Thank you so much:) I have a question! when doing fine-tuning with sequence classification head, why don't you use apply_chat_template? Is there any reason?👀
Thank you so much for the toturial, it's so clear. I'm wondering If I can add some context to each training text, such as some explanation of how to classify different sentiment, I don't know if it works, LLMs like llama hava ablebility to understanding context, Maybe it would help, What's your opinion?
@@MLAlgoTrader Yes, I would add the prompt at the begining of each text, something like "Classify the text messages to 1. positive, explanation: xxxxxxx. Example: xxxxx 2. negative, explanation:xxxx, example:xxxxxx. the message is "Tesla's market cap soared to over $1 trillion ...""
For some llms it does better even that way before fine tuning but fine tuning makes that less necessary. Check out deeplearning.ai course on llama index he does it similar to what you suggest
Thank you for your video! I have the following question for you: when you're making predictions before fine-tuning the model, are you evaluating the capabilities of the model with a zero-shot learning?
@@MLAlgoTrader in that case how could I implement this linear layer then a classification layer in your code? I'm interested in comparing 0 shot and few shot learning with this model.
@@michelejoshuamaggini3822 So this automatically adds these layers. To be precise. "The LLaMa Model transformer with a sequence classification head on top (linear layer). LlamaForSequenceClassification uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do. Since it does classification on the last token, it requires to know the position of the last token. If a pad_token_id is defined in the configuration, it finds the last token that is not a padding token in each row. If no pad_token_id is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when inputs_embeds are passed instead of input_ids, it does the same (take the last value in each row of the batch). " See huggingface.co/docs/transformers/main/en/model_doc/llama2 I don't have time now, but I can show you in code sometime how you would see it.
@@michelejoshuamaggini3822 As for the 0 shot classification, that is something like a car is something with 4 wheels and a motorcycle is something with 2 can you please classify car or motorcycle? What I do here is like 0 shot classification for sentiment analysis ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-nMKYuSALmpc.html
So I literally was about to share video but I had a bug so needed to restart. Must wait 24 hours due to api limit. So I'll send it 25 hours from now lol!
@@salmakhaled-hn6gw No problem. There are a few more things I left out hopefully we can cover them in another video like loading the model and merging with QlORA weights. Does the part about getting the data make sense? You need that to run the notebook!
Being on 0 sleep I'll quote chatgpt and get back to answering you later lol.... Turning Llama-3 into an encoder-only transformer like BERT, by removing the attention mask, is theoretically possible but involves more than just altering the attention mechanism. Here are the steps and considerations for this transformation:Modify Attention Mechanism: In Llama-3, which is presumably an autoregressive transformer like GPT-3, each token can only attend to previous tokens. To make it behave like BERT, you need to allow each token to attend to all other tokens in the sequence. This involves changing the attention mask settings in the transformer's layers.Change Training Objective: BERT uses a masked language model (MLM) training objective where some percentage of the input tokens are masked, and the model predicts these masked tokens. You would need to implement this training objective for the modified Llama-3.Adjust Tokenizer and Inputs: BERT is trained with pairs of sentences as inputs (for tasks like next sentence prediction), and uses special tokens (like [CLS] and [SEP]) to distinguish between sentences. You would need to adapt the tokenizer and data preprocessing steps to accommodate these requirements.Retraining the Model: Even after these modifications, the model would need to be retrained from scratch or fine-tuned extensively on a suitable dataset because the pre-existing weights were optimized for a different architecture and objective.Software and Implementation: You need to ensure that the transformer library you're using supports these customizations. Libraries like Hugging Face Transformers are quite flexible and might be useful for this purpose.This transformation essentially creates a new model, leveraging the architecture of Llama-3 but fundamentally changing its operation and purpose. Such a project would be substantial and complex but interesting from a research and development perspective.
@@MLAlgoTrader Thank you so much, appreciate the response! Since its a classification task it makes sense to remove the mask (make it encoder-only) and retrain the model to another objective function. I was just wondering technically how would you remove the mask from llama-3? and maybe also add a feedforward layer? Is it possible to edit the architecture like that?
Hey thanks for nice words! This was so long ago I forget where I have it. Their documentation sucks for this I'll try to find my example but it might take me a few days.
@@MLAlgoTrader Hello ! I hope everything goes well ! I come back to you to know if you know how soon you can find this, my internship ends in a few days but I still can’t load my registered model. Your help would really help me a lot. Thanks again!
Hello! Thank you so much for your tutorial, it is very helpful and easy to follow. I started applying it in on my custom binary dataset, but stumbled on the training step. I get the error with this line of code: labels = inputs.pop("labels").long() KeyError: 'labels' My inputs look like this: ['input_ids', 'attention_mask'] and I don't understand which "labels" are you referring to in that line. If it is not difficult for you, could you explain what it means? I would be most grateful! UPD: I renamed the columns of my dataset to "text" and "labels", and it solved the issue! 😀
@@MLAlgoTrader hi! I actually updated my comment that I found the workaround for that issue, although I still vaguely understand how it helped. Need to read more documentation, I guess. Anyways, thank you for your tutorial, it helped me with my thesis 😊
Always down for new idea but I don't know if I can get to that soon..I had an idea to do text summarization which can be done in a similar architecture way to machine translation but different metrics of course.