Model Distillation: Same LLM Power but 3240x Smaller

Adam Lucek

Подписаться 4 тыс.

Просмотров 7 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

29 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 25

@KevinKreger Месяц назад

Nice to find you! Great topic

@AdamLucek Месяц назад

Thanks!

@gunaysoni6792 Месяц назад

Strictly speaking this is just fine-tuning using synthetic data and not distillation. Distillation for language models trains the student model on the entire probability distribution of the Teacher model and not just SFT with next token prediction.

@muhannadobeidat Месяц назад

He distilled into a fine tuned model with his own use case. I think it is an excellent use case albeit specific to a classification sentiment analysis example.

@gramnegrod Месяц назад

Wow very interesting agent builder for many use cases! I wonder if using DSPY would help the teacher to make an even better dataset to approximate the 65.4?

@AdamLucek Месяц назад

Certainly! In Moritz Laurer's blog here huggingface.co/blog/synthetic-data-save-costs he uses chain of thought, few shot, and self consistency- where I only used chain of thought and few shot prompting. Using DSPY instead for prompting could optimize it further!

@cosmockips907 Месяц назад

Vanaduke is calling your name

@AdamLucek Месяц назад

What's this? More wolves hungry for the blood of Almire? Our great kingdom shall never fall to the likes of beasts!

@cosmockips907 Месяц назад

@@AdamLucek Glad youre doing well, miss the streams

@fabriziocasula Месяц назад

can I use Roberta with Ollama?? How can I download It :-)

@drkvaladao776 Месяц назад

Is there any gguf model distilled to use?

@unclecode Месяц назад

Fascinating. Very well organized and clearly explained. I have a few questions: 1. Have you tried fine-tuning the RoBERTa model using human-annotated labels? It would be interesting to compare the accuracy of a model trained on synthetic labels versus one trained on human-annotated data. Is there a significant difference? 2. I understand we have a dataset labeled by a larger model, which we then use to train a smaller model. But I'm curious if, instead of just labeling, the large model can generate the entire dataset, especially for customized data that doesn't necessarily exist. For example, instead of tweets, we could generate business data from customer reviews. We could fine-tune a large model with a sample of customer reviews, to teach the model about the tone and style, then we use the new model to create, say, 5,000 customer reviews and annotate them, then use this to fine-tune a smaller model. This would be an extreme version of model distillation where both data and labels are generated by the large model. 3. Have you considered trying this with a smaller model, less than 100 million parameters? Since this is a sentiment analysis, using an even smaller model might yield faster results and keep the accuracy high.

@AdamLucek Месяц назад

Thanks! And great questions, some thoughts: 1. Yes if you wanted a lightweight specialized classification model then just using the human annotated labels would be the traditional way. There exist plenty of RoBERTa base models trained on the same set I used- your goal then is to do exactly what you wish to measure, just direct accuracy of your model on the dataset. This is valid but also not quite the mark of the demonstration here which used the accuracy as a baseline to compare two models, so a higher "accuracy" in this case doesn't actually represent better success from the model training 2. You're on the right track for this one, and it's definitely a technique that many are using, especially when it comes to data augmentation. I highly reccomend reading through "A Survey of Knowledge Distillation of Large Language Models" arxiv.org/pdf/2402.13116, of which many of the examples in this video come from. They go over a plethora of different ways, outside of just classification, that this is being used. 3. I have not! Kept it light to turn this video around faster, but many optimizations can be made to this model. Can try different base language models, can do different fine tuning hyperparameters, etc. While the accuracy metric is the same, the distribution of labels is only ~75% accurate to Llama 3.1 405B's on that current model. Different sizes, methods, and iterations could be improved upon!

@HenockTesfaye Месяц назад

Very clear. Thank you.

@i2c_jason Месяц назад

Would it be possible to achieve the same end result with a bunch of conditional iterating and API calls to pay-for-play LLMs, if money is not an issue and you are trying to prototype a scalable generative AI application (SW 3.0 application let's say) for a pitch deck? I love the distillation idea as it parallels something I'm working on, but I'm concerned that I'll put too many time resources into something that won't scale as these capabilities become native to the API calls of the paid LLMs. What are your opinions on scaling something as a very small bootstrapped startup?

@muhannadobeidat Месяц назад

Excellent video especially the white paper review at the top and how you used knowledge from llama3.1 to train a smaller model

@AdamLucek Месяц назад

Thanks! 😊

@simonlove99 Месяц назад

Good insights and intro - one key challenge though for me was the conclusion of transfer - without a vanilla roberta rating, we don't know if there was any material influence on the output. How would roberta have scored on the task pre-finetuning ?

@AdamLucek Месяц назад

Very good points! My method here is very basic with many optimizations and evaluations yet to be performed. While the accuracy was similar, the distribution of classifications was roughly 75% similar on RoBERTa's side- many improvements along the way can still be made. We do however know that the RoBERTa model learned something tho, as we used the base model which cannot perform this task in its original state!

@TheShreyas10 Месяц назад

Exciting stuff, interesting to see but does it also supports summarization or only text classification?

@AdamLucek Месяц назад

My example was classification, in theory you can do this with anything! Googles gemma 2B is an entire regular language model that’s trained using distilled data

@GNARGNARHEAD Месяц назад

exciting stuff, thanks for sharing

@SussyBacca Месяц назад

Uh, 125m params for sentiment analysis is HUGE. You don't even need AI for it you can use Bayseian statistics. Qualitatively speaking this is a bit of cold oatmeal.

@gramnegrod Месяц назад

Well ok,,, maybe a bad example dataset. But the point is that the micro llm did it