No video :(

Reinforcement Learning from Human Feedback: From Zero to chatGPT

HuggingFace

Подписаться 63 тыс.

Просмотров 168 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

26 авг 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 47

@nadinelinterman268 Год назад

A wonderfully talented presentation that was easy to listen to. Thank you very much.

@mike_skinner Год назад

I asked chatGPT questions about a skill that I was expert on. I am from England but it said that the companies that I worked with were American and even what false states they were. It came up with names of people I knew but I think that it thought England was a state of the US.

@burgermind802 Год назад

Probably Chatgpt was trained on text disproportionately based in the United States, so it is biased to be American centric. It doesn't know things, just guesses semantically, which often but not always also happens to be true.

@Silly.Old.Sisyphus Год назад

the reason it talked complete garbage to you is that it doesn't have a clue about what it's saying - let alone what you said - because it's "language" model has nothing to do with language, but is effectively probabilistic statistical regurgitation of fragments of things that have been said before (it's "training" data).

@erener7897 5 месяцев назад

Thank you for lecture! But I personally found it very hard to follow. Voice is so monotonic and material is not catchy or interactive It was hard not to fall sleep. Nevertheless you are making important work, keep going!

@ashishj2358 9 месяцев назад

offline RL is quite unstable more often than not. PPO is simple and excellent algorithm if tuned well achieves really great results. Even after many papers coming with other approaches like DPO, Open AI has stuck with PPO.

@user-ch3gs7el5k 8 месяцев назад

Wonderful presentation. Question: why would reward function reward gibberish? I though reward function is sophisticate enough to only reward human-like speech

@IoannisNousias 8 месяцев назад

Given the reward model is differential, why not just use it to do backprop through it, in a self-supervised manner rather than RL. Think discriminator of a GAN providing gradients to the generator, here the discriminator is the reward model that can be frozen and the generator is the LLM. Still use the KL regularizer to keep it in the desired optimization regime.

@johntanchongmin Год назад

I liked the content. I personally do not agree with using RLHF as the main tool for learning because it is just too costly to use human feedback, but perhaps using this with a combination of self-supervised learning could help to scale up the utility of human annotation to a wider domain. I also wonder how robust RLHF is to outliers that are not within the training distribution. Perhaps the key is in more generalizable structural retrieval, i.e. making sure output is coherent according to a knowledge graph in memory, rather than human feedback as a reward for the output text.

@mohammadaqib4275 Год назад

hey do you have any hands on RLHF. I have few questions to ask

@johntanchongmin Год назад

@@mohammadaqib4275 Thanks for the question. I personally do not perform my own RLHF - it is too costly. In my opinion, just doing the SFT step may already be enough for most use cases, and I typically just do that.

@mohammadaqib4275 Год назад

Any resources in form of blog or guided project that you can suggest?

@johntanchongmin Год назад

@@mohammadaqib4275 Can try some of HuggingFace or Weights and Biases implementations. Maybe can take a look at StackLlama, which does RLHF on Llama (the non-2 version)

@present-bk2dh 8 месяцев назад

@@mohammadaqib4275 did you find any?

@1südtiroltechnik Год назад

Now i get it, thank you for the video!

@teddysalas3590 Год назад

I only come to know today that there is reinforcement learning course in hugging face, Is there will re enforcement course in hugging face after September?

@muskduh Год назад

Thanks for the video.

@akeshagarwal794 11 месяцев назад

Can we fine tune Encoder-Decoder model like T5 with RLHF? if we can please link a source code

@nlarchive Год назад

great job! we need people to create content until AI does the content XD

@pariveshplayson 6 месяцев назад

Theta should be a learnable policy of the language model and not the reward model. By this time, parameters of the reward function are already learned.

@abhinandanwadhwa2605 Год назад

how we could add RLHF to our own LLM model?

@stevenjordan5795 Год назад

Would you recommend RLHF for Thoroughbred handicapping?

@meghanarb7724 Год назад

can chatgpt is safe from hacking or cyber threat

@nitroyetevn Год назад

I find the constant "If you have any questions, post them in the comments" ticker kind of distracting. Maybe just display it every 2 minutes or every 5 minutes.

@dlal8042 9 месяцев назад

Would be better if you can show small implementation

@RD-AI-ROBOTICS Год назад

You mentioned about the ChatGPT paper that was going to release tomorrow viz 14th Dec,22. But I have not been able to find it. Can anyone pls guide me to it. thanks.

@parthshah4339 Год назад

he was joking

@FlipTheTables 11 месяцев назад

I ran across a video where someone was doing the breakdown it was built

@serkhetreo2489 Год назад

Please, what is the link for the discord

@st3ppenwolf Год назад

I thought I was going to see code. Good talk though

@weiwuWonderfulLife Год назад

The ads of "writing a comment" at the bottom of this video is so annoying.

@KeqiChen-ds2co Год назад

Discord invitation is invalid now T_T

@preston748159263 9 месяцев назад

“People are here to learn about language models.” The zeitgeist of language as insight to human behavior may be functional but it does not provide us with a better understanding. Language is to far away from what’s going on. Human1 has to interpret perception, reason, incorporate it into their own mental model, articulate it, communicate it, human2/ai model has to perceive it, reason with it, incorporate it, then produce an action. There is to much room for error and a bold strategy because it’s so far removed. it’s simple convenient because we have a lot of text data. RL is unique though because it is based on operant conditioning, so there is some opportunity here that is not being taken advantage of. We need to go back to a cognitive perspective on AI, that’s the only way we better replicate and model human cognition, and we certainly can’t create robust AI if we don’t first understand our underlying cognitive mechanisms.

@preston748159263 9 месяцев назад

Tldr RLHF seems to be a bandaid on a large wound caused by language based AI.