Direct Preference Optimization: Your Language Model is Secretly a Reward Model | DPO paper explained

AI Coffee Break with Letitia

Подписаться 49 тыс.

Просмотров 23 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

4 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 82

@hayatisschon 11 часов назад

Letitia, you are an amazing teacher/instructor!

@AICoffeeBreak 10 часов назад

Aww, thank you!

@DerPylz 9 месяцев назад

Wow, two videos in one week? You're spoiling us!!

@kayvanshah994 2 месяца назад

Great explanation, easy to follow through, pretty simplified to understand

@KPreddiePWSP2 9 месяцев назад

Thanks for. pretty useful and timely explanation

@AICoffeeBreak 9 месяцев назад

@kevon217 8 месяцев назад

Great overview and comparison!

@AICoffeeBreak 8 месяцев назад

Glad you like it!

@alexkubiesa9073 2 месяца назад

I’m no expert, but when RLHF was new, the most common justification I heard in explainer articles and videos was that the reward model was smaller than the LLM, so less likely to overfit on the human labels, and could be used to produce more data for the LLM to train on compared to just the expensive human-annotated data. So pretty much your second hypothesis.

@SrikanthIyer 8 месяцев назад

Thanks for the video! Very well explained, I just began looking into DPO and your video gives a great context.

@AICoffeeBreak 8 месяцев назад

@IbrahimSobh 7 месяцев назад

Thank you so much for such clear high level explanations.

@AICoffeeBreak 7 месяцев назад

Thank you for your visit and wonderful comment!

@xiaobaodabao 9 месяцев назад

Thanks!

@AICoffeeBreak 9 месяцев назад

Wow, thanks!

@guimaraesalysson 9 месяцев назад

Great video

@AICoffeeBreak 9 месяцев назад

Thanks!

@Neomadra 9 месяцев назад

I have not read the paper yet, but this sounds like supervised contrastive learning. If it is, then it's really astonishing that nobody came up with it before. I implemented some supervised contrastive learning myself... missed opportunity 😢

@AICoffeeBreak 9 месяцев назад

Exactly! Accept this coffee for consolation .

@Neomadra 9 месяцев назад

@@AICoffeeBreakthanks! ❤

@정래혁-c8y 5 месяцев назад

Good video :) really enjoyed watching it

@AICoffeeBreak 18 дней назад

Thanks for your visit!

@poketopa1234 9 месяцев назад

These videos are great! Really well explained, thank you so much for the effort you put into them :)

@AICoffeeBreak 9 месяцев назад

Thanks, this means a lot to us!

@Thomas-gk42 8 месяцев назад

Hi, just came here cause I saw you´re a member of Sabine´s channel. Wow, did not expect a successful channel here too. Your videos are made very good, though computer science is not my field. I´ll recommend your channel, all the best.

@AICoffeeBreak 8 месяцев назад

Hi, yes I am a huge fan of Sabine's channel as I studied physics in my bachelor's and master's. I watch her channel to stay up to date with what happens in physics. I also love the other topics she addresses, her takes on things, and the humour in her videos. Lovely that you found my channel and I'm glad you wrote to say hi. 😁

@Thomas-gk42 8 месяцев назад

@@AICoffeeBreakt Thanks for your attention, surely we meet again 🙂

@juanmanuelcirotorres6155 9 месяцев назад

Love your videos, thank you

@AICoffeeBreak 9 месяцев назад

@ShihgianLee 9 месяцев назад

I really enjoy your videos. Please keep up the good work! My theory for no one thought of this is your first reason; they thought there is no closed form loss function. That is where RL comes in.

@AICoffeeBreak 9 месяцев назад

Thank you!

@TheRyulord 9 месяцев назад

On why RLHF came first, it was invented by OpenAI, which had focused almost exclusively on RL stuff prior to GPT. "When all you have is a hammer..." as the saying goes.

@MaJetiGizzle 9 месяцев назад

Really excellent breakdown of DPO! Given what we’ve seen with gains from the use of DPO in the open source community, it makes complete sense that it was at least one of the runners up at NeurIPS this year. I’m really enjoying your explainer videos! Thank you for taking the time to make them!

@AICoffeeBreak 9 месяцев назад

Thank you so much for your positive feedback!

@paprikar 9 месяцев назад

What about explanation video about MoE arch (mistral)?

@AICoffeeBreak 9 месяцев назад

Oh, interesting! I thought there were enough blogs and explainers about MoE and decided not to cover it . Now I consider it again. :)

@rufex2001 9 месяцев назад

I second this, even if there are explanations out there, it’s more about discussing with someone we’re interested in discussing it with

@AICoffeeBreak 9 месяцев назад

@@rufex2001

@leeme179 9 месяцев назад

yeah agree with this, IMO since you are more technically grounded than other channels

@mkamp 9 месяцев назад

Great video and wonderful explanation. Thanks for covering the differences and thoughts about the limitations of just using DPO. I am wondering why instruction finetuning was not mentioned? Wouldn’t SFT make the whole DPO process more efficient? Especially when sampling directly from a pre-trained model, it should be hard to even get good samples when the model hasn’t yet learned what questions and answers look like? No?

@AICoffeeBreak 9 месяцев назад

Instruction tuning is a separate procedure. It is normal supervised learning with task descriptions and it is now independent of the DPO vs RLHF discussion. We mentioned it in another video and left it out for this one focusing specifically on DPO.

@8eck 9 месяцев назад

So the main logic lies in the custom loss function, which is calculating higher loss for next token if it is far from the positive example?

@광광이-i9t 9 месяцев назад

Thank you so much for your clear explanation ~~ it is really helpful :)) hope you review other NeurIPS papers haha THanks ~~~

@AICoffeeBreak 9 месяцев назад

Glad it was helpful! Do you have a concrete paper recommendation from NeurIPS? :)

@광광이-i9t 9 месяцев назад

@@AICoffeeBreak So happy to leave the comment. Hope you're doing awesome! I stumbled upon this paper, 'Abide by the law and follow the flow: conservation laws for gradient flows,' and it's blowing my mind. 🤯 It got spotlighted at DL theory's oral session. I tried diving into it and even watched the presentation, but I'm kinda lost. The concept of conservation laws and the Lie algebra algorithm is intriguing but tricky for me. Thanks for you interest ~~~~

@kristoferkrus 9 месяцев назад

Another question I got when reading the paper subtitle "Your Language Model is Secretly a Reward Mode" was, in what way do they mean the language model is a reward model? To me it seems like they're not using a reward model at all, because they figured out that after starting to use a contrastive loss, they don't need one.

@AICoffeeBreak 9 месяцев назад

Great observation! I do not have a good guess on this, especially since even in RLHF, the model itself "is" (rather becomes) a reward model because the reward model is initialized with a copy of the original LLM. Now, they do not do reward modelling at all, so nothing is secretly a reward model. Maybe they just wanted a catchy title?

@AICoffeeBreak 8 месяцев назад

Or maybe "secretly" is a catchy way of saying "implicitly". Because implicitly, the language model itself can increase the reward by maximizing the likelihood of preferred examples and minimizing the likelihood of dispreferred examples. So implicitly, there is a "reward" defined by the likelihood and it comes directly from the language model (not from a reward model).

@kristoferkrus 8 месяцев назад

@@AICoffeeBreak Yes, maybe that's what they meant! It would make sense.

@learnsomethingnew1651 4 месяца назад

Great video! Got a follow up question: what kind of finetuning is the finetuning provided by the openai API, where it finetunes a model based on a training set of Q&A pairs provided by the user?

@AICoffeeBreak 4 месяца назад

Ask Sam Altman? 😅 I didn't see any recent technical paper about fine-tuning from OpenAI, nor do they explain on their website what they do. They are too open for us to comprehend. Since they have large GPUs, it is safe to assume they are not forced to do parameter-efficient tuning like us noobs with Gaming GPUs.

@Micetticat 9 месяцев назад

It appear to me that like this method is superior to RLHF because it demands for good curated human annotated examples. Did I miss something?

@AICoffeeBreak 9 месяцев назад

RLHF also trains on human annotations.

@Micetticat 9 месяцев назад

@@AICoffeeBreak Thanks for the comment. I see... it is essentially equivalent but more efficient since you don't have to train an extra model.

@AICoffeeBreak 9 месяцев назад

@@Micetticat exactly! 🎯

@zyzhang1130 Месяц назад

Why DPO is stable to being with? Keeping updating the model will make the reward model (i.e., itself) non-stationary too

@thipoktham5164 9 месяцев назад

I was wondering, can we apply a similar concept of DPO to a more generic ranking task? This made me think about how triplet loss is equivalent to a softmax loss (in Soft Triplet Loss), but DPO seems to only deal with one dimensional ranking. What if instead of ranking the output relative to each other as better, or worse. Can we apply the model to generate multiple aspect instead i.e. non-binary ranking?

@leeme179 9 месяцев назад

"Exploring connections between DPO and other ranking methods like Borda ranking or Condorcet voting could be fruitful for developing more sophisticated preference learning frameworks." by Google Bard

@AICoffeeBreak 9 месяцев назад

@thipoktham5164 I think it should be possible to adapt the loss function that way.

@AICoffeeBreak 9 месяцев назад

@leeme179 😂 lol

@sfsft11 9 месяцев назад

Does this only apply to transformers or would it also work with Mamba?

@AICoffeeBreak 9 месяцев назад

This training procedure is architecture independent, so I would say it works with Mamba as well. But of course, one would need to try it out as ML is a very empirical science. 🔭

@kristoferkrus 9 месяцев назад

Wow! DPO was really simple. If its loss function is equivalent to RLHF, does that mean it theoretically will give as good results as RLHF, or can we draw some other conclusion from it? Good video explanation btw!

@AICoffeeBreak 9 месяцев назад

Yes, DPO and RLHF should be equivalent as long as you apply them on the same pool of human feedback and you do not use the reward model to annotate more data. Because this is the upside of training a reward model: after having been trained on human labelled pairs, it can annotate on its own more data and simulate human feedback . Also, a reward model has the upside that we can train it on more fine-grained human annotations (which have been shown to improve RLHF), such as 3-way ranking. DPO works on just pairwise ranking and would have to be adapted for a more fine-grained setting.

@kristoferkrus 9 месяцев назад

@@AICoffeeBreak Thanks for you answer! It makes sense that they are equivalent only if you don't use it to annotate more data, since as you mentioned, RLHF can lead to award hacking (I guess only in those cases where you do automatically annotate more data). What is 3-way ranking?

@AICoffeeBreak 9 месяцев назад

@@kristoferkrus when you have three answers and humans annotated which one is the best, the second best and the third best. It's more fine-grained than just saying which one of two answers is best.

@kristoferkrus 9 месяцев назад

@@AICoffeeBreak Makes sense. Thank you! But I guess you could use a modified DPO loss for a three-way ranking where you take into consideration all three likelihoods. If L(A, B) is the DPO loss function when you have two pieces of generated text with likelihoods A and B, respectively, where the first piece is preferred to the second, and now you instead have A, B and C, I guess you could use something like L(A,B,C) = L(A,B) + L(B,C) (+ L(A, C)).

@rg6960 9 месяцев назад

we can train the reward model to get us scores and use them in DPO too? This would let us avoid the messy RL procedure. Would also be interesting to see if a ranking loss instead of contrastive loss do performs with DPO. @@AICoffeeBreak

@prabhavmorje 7 месяцев назад

I am a newbie. I don’t understand how evaluating the paper’s results with GPT-4 is considered formal, rigorous analysis. Do you feel comfortable with it ? There should be at least a claim about DPO error rate in terms of GPT-4 error rate.

@AICoffeeBreak 7 месяцев назад

I completely understand your concern and there is no need to debate that a thorough human evaluation is preferred. Such human evaluation is costly though, so to the author's defence: they did a small human study of their GPT-4 evaluation and showed that humans agreed with GPT4 as much as they agree with each other. Also, LLMs are much better at evaluating things than at generating, in the same way in which it is easier for you to correct a thesis than to produce it.

@djjazzek 4 месяца назад

Great video! Question: is it possible to train an LLM with human feedback that is more complex than a simple positive/negative? For example, my data has 10 different possible values for feedback

@AICoffeeBreak 4 месяца назад

Three values would still be possible with a triple loss. But for 10, I wouldn't know how to implement it contrastively. :(

@leeme179 9 месяцев назад

How about if a model was trained using existing apex LLM i.e. gpt4 to distil a model that can consume content about a topic like an article and to be able to generate questions that may or may not be covered in the content/article, this model would be akin to reward model, but would help train the model to become more logical in its answer about the topic, together with reward model would this allow an end to end training process to generate an LLM expert in a specific field? or I guess similar methods already exist...

@gordoneldest8462 2 месяца назад

Why not using the number of exchange during a conversation with human to catch the appropriate answers ? RLHF is safer than DPO however, yes , RLHF is difficult to control without a classification of the users However implicit HF is already a model used by search engine.

@AICoffeeBreak 2 месяца назад

Are you suggesting that the number of interactions in a conversation is a good measure of conversation quality? How would you account for trivial interactions that inflate the exchange count, as opposed to scenarios where a single question is followed by a single, exceptionally good answer (such as a perfect summary of a news article) after which the user leaves being satisfied?

@gordoneldest8462 Месяц назад

@@AICoffeeBreak Do we agree that HF is without user interacting to rate the answers? Otherwise it’s a different issue, way simpler Hidden ratings Many ways Simple one is to rate the answers itself as the numbers of interaction increases. But another approach is to stores the questions or more precisely the conceptual difference between a succession of question Above a certain level of difference the topics is no longer the same and ratings should be discarded Otherwise still on same topic then see how user progressively refine using more precise wording and how much is taken out of the answers. After a certain number of passes a first prompted question can be substituted directly by the last prompt If user is stopping interacting then the substitution is rated ++ else it’s given a provisional rating waiting for confirmation that it was really of topic