Тёмный
No video :(

Reinforcement Learning from Human Feedback: From Zero to chatGPT 

HuggingFace
Подписаться 63 тыс.
Просмотров 168 тыс.
50% 1

Опубликовано:

 

26 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 47   
@nadinelinterman268
@nadinelinterman268 Год назад
A wonderfully talented presentation that was easy to listen to. Thank you very much.
@mike_skinner
@mike_skinner Год назад
I asked chatGPT questions about a skill that I was expert on. I am from England but it said that the companies that I worked with were American and even what false states they were. It came up with names of people I knew but I think that it thought England was a state of the US.
@burgermind802
@burgermind802 Год назад
Probably Chatgpt was trained on text disproportionately based in the United States, so it is biased to be American centric. It doesn't know things, just guesses semantically, which often but not always also happens to be true.
@Silly.Old.Sisyphus
@Silly.Old.Sisyphus Год назад
the reason it talked complete garbage to you is that it doesn't have a clue about what it's saying - let alone what you said - because it's "language" model has nothing to do with language, but is effectively probabilistic statistical regurgitation of fragments of things that have been said before (it's "training" data).
@erener7897
@erener7897 5 месяцев назад
Thank you for lecture! But I personally found it very hard to follow. Voice is so monotonic and material is not catchy or interactive It was hard not to fall sleep. Nevertheless you are making important work, keep going!
@ashishj2358
@ashishj2358 9 месяцев назад
offline RL is quite unstable more often than not. PPO is simple and excellent algorithm if tuned well achieves really great results. Even after many papers coming with other approaches like DPO, Open AI has stuck with PPO.
@user-ch3gs7el5k
@user-ch3gs7el5k 8 месяцев назад
Wonderful presentation. Question: why would reward function reward gibberish? I though reward function is sophisticate enough to only reward human-like speech
@IoannisNousias
@IoannisNousias 8 месяцев назад
Given the reward model is differential, why not just use it to do backprop through it, in a self-supervised manner rather than RL. Think discriminator of a GAN providing gradients to the generator, here the discriminator is the reward model that can be frozen and the generator is the LLM. Still use the KL regularizer to keep it in the desired optimization regime.
@johntanchongmin
@johntanchongmin Год назад
I liked the content. I personally do not agree with using RLHF as the main tool for learning because it is just too costly to use human feedback, but perhaps using this with a combination of self-supervised learning could help to scale up the utility of human annotation to a wider domain. I also wonder how robust RLHF is to outliers that are not within the training distribution. Perhaps the key is in more generalizable structural retrieval, i.e. making sure output is coherent according to a knowledge graph in memory, rather than human feedback as a reward for the output text.
@mohammadaqib4275
@mohammadaqib4275 Год назад
hey do you have any hands on RLHF. I have few questions to ask
@johntanchongmin
@johntanchongmin Год назад
@@mohammadaqib4275 Thanks for the question. I personally do not perform my own RLHF - it is too costly. In my opinion, just doing the SFT step may already be enough for most use cases, and I typically just do that.
@mohammadaqib4275
@mohammadaqib4275 Год назад
Any resources in form of blog or guided project that you can suggest?
@johntanchongmin
@johntanchongmin Год назад
@@mohammadaqib4275 Can try some of HuggingFace or Weights and Biases implementations. Maybe can take a look at StackLlama, which does RLHF on Llama (the non-2 version)
@present-bk2dh
@present-bk2dh 8 месяцев назад
@@mohammadaqib4275 did you find any?
@1südtiroltechnik
@1südtiroltechnik Год назад
Now i get it, thank you for the video!
@teddysalas3590
@teddysalas3590 Год назад
I only come to know today that there is reinforcement learning course in hugging face, Is there will re enforcement course in hugging face after September?
@muskduh
@muskduh Год назад
Thanks for the video.
@akeshagarwal794
@akeshagarwal794 11 месяцев назад
Can we fine tune Encoder-Decoder model like T5 with RLHF? if we can please link a source code
@nlarchive
@nlarchive Год назад
great job! we need people to create content until AI does the content XD
@pariveshplayson
@pariveshplayson 6 месяцев назад
Theta should be a learnable policy of the language model and not the reward model. By this time, parameters of the reward function are already learned.
@abhinandanwadhwa2605
@abhinandanwadhwa2605 Год назад
how we could add RLHF to our own LLM model?
@stevenjordan5795
@stevenjordan5795 Год назад
Would you recommend RLHF for Thoroughbred handicapping?
@meghanarb7724
@meghanarb7724 Год назад
can chatgpt is safe from hacking or cyber threat
@nitroyetevn
@nitroyetevn Год назад
I find the constant "If you have any questions, post them in the comments" ticker kind of distracting. Maybe just display it every 2 minutes or every 5 minutes.
@dlal8042
@dlal8042 9 месяцев назад
Would be better if you can show small implementation
@RD-AI-ROBOTICS
@RD-AI-ROBOTICS Год назад
You mentioned about the ChatGPT paper that was going to release tomorrow viz 14th Dec,22. But I have not been able to find it. Can anyone pls guide me to it. thanks.
@parthshah4339
@parthshah4339 Год назад
he was joking
@FlipTheTables
@FlipTheTables 11 месяцев назад
I ran across a video where someone was doing the breakdown it was built
@serkhetreo2489
@serkhetreo2489 Год назад
Please, what is the link for the discord
@st3ppenwolf
@st3ppenwolf Год назад
I thought I was going to see code. Good talk though
@weiwuWonderfulLife
@weiwuWonderfulLife Год назад
The ads of "writing a comment" at the bottom of this video is so annoying.
@KeqiChen-ds2co
@KeqiChen-ds2co Год назад
Discord invitation is invalid now T_T
@preston748159263
@preston748159263 9 месяцев назад
“People are here to learn about language models.” The zeitgeist of language as insight to human behavior may be functional but it does not provide us with a better understanding. Language is to far away from what’s going on. Human1 has to interpret perception, reason, incorporate it into their own mental model, articulate it, communicate it, human2/ai model has to perceive it, reason with it, incorporate it, then produce an action. There is to much room for error and a bold strategy because it’s so far removed. it’s simple convenient because we have a lot of text data. RL is unique though because it is based on operant conditioning, so there is some opportunity here that is not being taken advantage of. We need to go back to a cognitive perspective on AI, that’s the only way we better replicate and model human cognition, and we certainly can’t create robust AI if we don’t first understand our underlying cognitive mechanisms.
@preston748159263
@preston748159263 9 месяцев назад
Tldr RLHF seems to be a bandaid on a large wound caused by language based AI.
@MichaelNeumann-n2v
@MichaelNeumann-n2v 9 дней назад
Moore Christopher Lopez Kimberly Gonzalez Lisa
@TheresaLopez-r7t
@TheresaLopez-r7t 18 часов назад
Lopez Mary Taylor Edward Martinez Jeffrey
@BruceWheeler-c9o
@BruceWheeler-c9o 6 дней назад
Jackson John Hall Barbara Clark Steven
@Shalaginov_com
@Shalaginov_com Год назад
The first 17 minutes are waste of time
@franciscofreitas6695
@franciscofreitas6695 Год назад
Thank you bro
@xasopheno
@xasopheno Год назад
Thanks
@seeusoon07
@seeusoon07 8 месяцев назад
Thanks for saving time
@indramal
@indramal Год назад
OPS I missed it
@doulaishamrashikhasan8425
@doulaishamrashikhasan8425 Год назад
you can still watch it lol
@indramal
@indramal Год назад
@@doulaishamrashikhasan8425 but can not discuss with others like online
@antonderoest9462
@antonderoest9462 Год назад
Please use better microphone and add some damping in your room.
@dontwannabefound
@dontwannabefound 10 месяцев назад
Too much hand waving
Далее
[1hr Talk] Intro to Large Language Models
59:48
Просмотров 2,1 млн
СЕРЕГА ПИРАТ - TEAM SPIRIT
02:37
Просмотров 291 тыс.
Музыкальные пародии
00:28
Просмотров 22 тыс.
The Turing Lectures: The future of generative AI
1:37:37
Просмотров 590 тыс.
Deep Learning: A Crash Course (2018) | SIGGRAPH Courses
3:33:03
Stanford CS25: V4 I Aligning Open Language Models
1:16:21
This is why Deep Learning is really weird.
2:06:38
Просмотров 383 тыс.
A Hackers' Guide to Language Models
1:31:13
Просмотров 521 тыс.
ChatGPT for Data Analytics: Full Course
3:35:30
Просмотров 254 тыс.