Тёмный
No video :(

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math 

Umar Jamil
Подписаться 39 тыс.
Просмотров 10 тыс.
50% 1

Опубликовано:

 

26 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 58   
@Patrick-wn6uj
@Patrick-wn6uj 4 месяца назад
The legend returns, Always excited for your videos. I am an international student at Shanghai Jiao Tong daxue. Your videos have given me a very strong foundation of transformers. Much blessings your way
@umarjamilai
@umarjamilai 4 месяца назад
我们在领英联系吧,我有个微信小群,你可以参加
@user-kg9zs1xh3u
@user-kg9zs1xh3u 4 месяца назад
​@@umarjamilai我也想加
@user-kg9zs1xh3u
@user-kg9zs1xh3u 4 месяца назад
​@@umarjamilai我看到b站也有你的账号
@sauravrao234
@sauravrao234 4 месяца назад
I humbly request you to make videos on how to build a career in machine learning and AI. I am a huge fan of your videos and i thank you for all the knowledge that you have shared
@umarjamilai
@umarjamilai 4 месяца назад
Hi! I will for sure make a video in the future about my personal journey. I hope that can help more people in navigating their own journeys. Have a nice day!
@amankhurana2154
@amankhurana2154 День назад
Awesome, thank you so much for putting this out, super helpful!
@mlloving
@mlloving 4 месяца назад
Thank you! It's very clear explaination. It helps for reading the original paper. Looking forward to new topics.
@luxorska5143
@luxorska5143 4 месяца назад
wow your explanation is so clear and complete... you are godsend, keep doing it. Sei un fenomeno
@user-hd7xp1qg3j
@user-hd7xp1qg3j 4 месяца назад
Legend is back, the GOAT, if my guess is right next will be ORPO or Q*
@umarjamilai
@umarjamilai 4 месяца назад
Actually, the next video is going to be a totally new topic not related specifically to language models. Stay tuned!
@olympus8903
@olympus8903 4 месяца назад
@@umarjamilai waiting
@amanattheedge9056
@amanattheedge9056 Месяц назад
Very clear explanations!! Please, continue making such good videos!
@k1tajfar714
@k1tajfar714 22 дня назад
Awesome Video. please Continue.
@cken27
@cken27 4 месяца назад
Thanks for making these videos. Concise and clear
@nwanted
@nwanted 3 месяца назад
Thanks so much Umar, always learn a lot from your video!
@yinghaohu8784
@yinghaohu8784 Месяц назад
You explained very clearly. Thanks!
@vanmira
@vanmira 3 месяца назад
These lectures are amazing. Thank you!
@olympus8903
@olympus8903 4 месяца назад
My Kind Request Please Increase volume little bit , just little bit. Otherwise your videos Outstanding . Best I can say.
@binjianxin7830
@binjianxin7830 Месяц назад
I believe the most evident insight of DPO is to change a RL problem to an equivalent MLE, while the optimal reward model is guarranteed by the human input as definition. That's the meat. But the efficiency depends still on the human annotater's consistency.
@lukeskywalker7029
@lukeskywalker7029 4 месяца назад
New video🎉 can't wait to watch. Although having used DPO in production for a while now!
@kmalhotra3096
@kmalhotra3096 4 месяца назад
Amazing! Great job once again!
@mrsmurf911
@mrsmurf911 3 месяца назад
Love from India sir, you are a legend 😊😊
@abdullahalsaadi5991
@abdullahalsaadi5991 4 месяца назад
Amazing explanation. Would it be possible to make a video on the theory and implementation of automatic differentiation (autograd).
@SaiKiran-jc8yp
@SaiKiran-jc8yp 4 месяца назад
Best explanation so far !!!!...
@mahdisalmani6955
@mahdisalmani6955 3 месяца назад
Thank you very much for this video, please make ORPO as well.
@Mortazaghafaripour
@Mortazaghafaripour 2 месяца назад
Great 👍
@jak-zee
@jak-zee 4 месяца назад
Enjoyed the style in which the video is presented. Which video editor/tools do you use to make your videos? Thanks.
@umarjamilai
@umarjamilai 4 месяца назад
I use PowerPoint for the slides, Adobe Premiere for video editing
@jak-zee
@jak-zee 4 месяца назад
@@umarjamilai What do you use to draw on your slides? I am assuming you connected an ipad to your screen.
@AptCyborg
@AptCyborg 4 месяца назад
Amazing Video! Please do one on SPIN (Self Play Fine-tuning) as well
@koiRitwikHai
@koiRitwikHai 9 часов назад
Great explanation but I have some doubts, please help 36:50 in Ldpo π* was replaced with π theta... why π theta is considered as optimal policy? 44:13 You said "each hidden state contains information about itself and all the tokens that comes before it", but this is applicable only to decoder part of the transformer. So this transformer layer is actually a decoder layer? like GPT
@mohammadsarhangzadeh8820
@mohammadsarhangzadeh8820 3 месяца назад
I love ur videos so much. please make a video about mamba or mamba vision
@umarjamilai
@umarjamilai 3 месяца назад
There's already a video about Mamba, check it out
@DiegoSilva-dv9uf
@DiegoSilva-dv9uf 4 месяца назад
Valeu!
@vardhan254
@vardhan254 4 месяца назад
love ur videos umar !!
@tommysnowy3068
@tommysnowy3068 4 месяца назад
Amazing video. Would it be possible for you to explain video-transformers or potential guesses at how Sora works? Another exciting idea is explaining GFlowNets
@tuanduc4892
@tuanduc4892 3 месяца назад
Thanks for your lecture. I wonder could you explain the vision language models
@trungquang1581
@trungquang1581 4 месяца назад
thank you so much for your effort! could you make a video about tokenizers like BPE and sentencepiece from scratch? I would be very appreciate of it!
@TemporaryForstudy
@TemporaryForstudy 4 месяца назад
great video. love from india.
@elieelezra2734
@elieelezra2734 2 месяца назад
Hello Umar, Great as usual, however why do you say at 46:11, that you need to sum log probabilities up? The objective function is the expectation of logarithm of the difference of two weighted log probabilities ratios. I don't get what do you want to sum up exactly? Thank you
@ernestbeckham2921
@ernestbeckham2921 4 месяца назад
Thank you. can you make video about liquid neural network?
@user-if9tm1co9e
@user-if9tm1co9e 4 месяца назад
great explaination, thanks. how about the recent work: KTO: Model Alignment as Prospect Theoretic Optimization? can you compare it with DPO?😁
@sidward
@sidward 4 месяца назад
Thanks for the great video! Very intuitive explanation and particular thanks for the code examples. Question: at 37:41, how do we know that the solving the optimization problem will yield the pi_*? Is there a guaranteed unique solution?
@umarjamilai
@umarjamilai 4 месяца назад
Please check the paper I linked in the description for a complete derivation of the formula. It is also done in the DPO paper, but in my opinion the other paper is better suited for this particular derivation.
@samiloom8565
@samiloom8565 4 месяца назад
I enjoy your videos umar on my phone while commuting or sitting in a coffe. Only the small fint on a phone is tiring me ..if you make them a bit bigger that will be better
@umarjamilai
@umarjamilai 4 месяца назад
Sorry for the trouble, I'll keep it in mind for the next videos!
@ai.mlvprasad
@ai.mlvprasad 4 месяца назад
what is the ppt software you are using sir ?
@lokeshreddypolu250
@lokeshreddypolu250 3 месяца назад
Thanks for the video. Do you know any way on how we can create a dataset for DPO training. I currently have only question, answer pairs. Is it fine if i take y_w as answer and y_l as some random text(which would obviously have lower preference than answer) and then train it?
@lokeshreddypolu250
@lokeshreddypolu250 3 месяца назад
The potential problem that I think could happen is that having random text may decrease the loss and the policy may not even change much
@nguyenhuuuc2311
@nguyenhuuuc2311 4 месяца назад
Hi Umar, If I use LoRA for fine-tuning a chat model with DPO loss, what should I use as a reference model? - The chat model applied LoRA - Or the chat model itself without LoRA?
@umarjamilai
@umarjamilai 4 месяца назад
Considering LoRA is just a way to "store" fine-tuned weights with a smaller computation/memory footprint, the model WITHOUT LoRA should be used as the reference model.
@nguyenhuuuc2311
@nguyenhuuuc2311 4 месяца назад
@@umarjamilai With my limited GPU, I can only fine-tune by combining a 4-bit-quantized model + LoRA. Surprisingly, using just the 4-bit model leads to NaN weight updates after one batch. But once LoRA is added, my loss updates smoothly without any problems.
@nguyenhuuuc2311
@nguyenhuuuc2311 4 месяца назад
Thank you SO much for the quick answer and your excellent video. I did get the hang of DPO loss and be able to implement DPO loss + training loop with vanilla PyTorch code.
@OGIMxGaMeR
@OGIMxGaMeR 4 месяца назад
Thank you very much for the explanation. I had one questions. Are the dataset of preferences always made of two and only two answers?
@umarjamilai
@umarjamilai 4 месяца назад
According to the Hugging Face library, yes, looks like you need a dataset with prompt and two answers, one is called the "chosen" one and the other is the "rejected" one. I'm pretty sure there are ways to convert more than two preferences into a dataset of two preferences.
@OGIMxGaMeR
@OGIMxGaMeR 4 месяца назад
@@umarjamilai thank you! Yes of course. I am just wondering why it wouldn’t help to have more than 1 rejected for 1 accepted. I guess the formula does not consider this case but may add value.
@kevon217
@kevon217 4 месяца назад
“digital biscuits”, lol
Далее
Aligning LLMs with Direct Preference Optimization
58:07
I Built a WATERPARK In My House!
26:28
Просмотров 15 млн
SIGMA ENVY IS UNTOUCHABLE 🔥 #insideout2
00:10
Просмотров 1,4 млн
RLHF vs DPO (and KTO) Explained
19:38
Просмотров 1 тыс.
I Built a WATERPARK In My House!
26:28
Просмотров 15 млн