Stanford CS25: V4 I Hyung Won Chung of OpenAI

Подписаться 605 тыс.

Просмотров 100 тыс.

50% 1

April 11, 2024
Speaker: Hyung Won Chung, OpenAI
Shaping the Future of AI from the History of Transformer
AI is developing at such an overwhelming pace that it is hard to keep up. Instead of spending all our energy catching up with the latest development, I argue that we should study the change itself. First step is to identify and understand the driving force behind the change. For AI, it is the exponentially cheaper compute and associated scaling. I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute. This analysis will help us connect the past and present in a unified perspective, which in turn makes it more manageable to project where the field is heading. Slides here: docs.google.com/presentation/...
0:00 Introduction
2:05 Identifying and understanding the dominant driving force behind AI.
15:18 Overview of Transformer architectures: encoder-decoder, encoder-only and decoder-only
23:29 Differences between encoder-decoder and decoder-only, and rationale for encoder-decoder’s additional structures from the perspective of scaling.
About the speaker:
Hyung Won Chung is a research scientist at OpenAI ChatGPT team. He has worked on various aspects of Large Language Models: pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, multilinguality, parallelism strategies, etc. Some of the notable work includes scaling Flan paper (Flan-T5, Flan-PaLM) and T5X, the training framework used to train the PaLM language model. Before OpenAI, he was at Google Brain and before that he received a PhD from MIT.
More about the course can be found here: web.stanford.edu/class/cs25/
View the entire CS25 Transformers United playlist: • Stanford CS25 - Transf...

Опубликовано:

1 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 48

@hyungwonchung5222 19 дней назад

I very much enjoyed giving this lecture! Here is my summary: AI is moving so fast that it's hard to keep up. Instead of spending all our energy catching up with the latest development, we should study the change itself. First step is to identify and understand the dominant driving force behind the change. For AI, a single driving force stands out; exponentially cheaper compute and scaling of progressively more end-to-end models to leverage that compute. However this doesn’t mean we should blindly adopt the most end-to-end approach because such an approach is simply infeasible. Instead we should find an “optimal” structure to add given the current level of 1) compute, 2) data, 3) learning objectives, 4) architectures. In other words, what is the most end-to-end structure that just started to show signs of life? These are more scalable and eventually outperform those with more structures when scaled up. Later on, when one or more of those 4 factors improve (e.g. we got more compute or found a more scalable architecture), then we should revisit the structures we added and remove those that hinder further scaling. Repeat this over and over. As a community we love adding structures but a lot less for removing them. We need to do more cleanup. In this lecture, I use the early history of Transformer architecture as a running example of what structures made sense to be added in the past, and why they are less relevant now. I find comparing encoder-decoder and decoder-only architectures highly informative. For example, encoder-decoder has a structure where input and output are handled by separate parameters whereas decoder-only uses the shared parameters for both. Having separate parameters was natural when Transformer was first introduced with translation as the main evaluation task; input is in one language and output is in another. Modern language models used in multiturn chat interfaces make this assumption awkward. Output in the current turn becomes the input of the next turn. Why treat them separately? Going through examples like this, my hope is that you will be able to view seemingly overwhelming AI advances in a unified perspective, and from that be able to see where the field is heading. If more of us develop such a unified perspective, we can better leverage the incredible exponential driving force!

@s.7980 18 дней назад

Thank you so much for giving us this lecture! Enabling us to think in this new perspective!

@frankzheng5221 18 дней назад

Nice of u

@shifting6885 18 дней назад

very nice explanation, thx a lot!

@ansha2221 18 дней назад

Amazing Content! Thank you

@shih-binshih9889 15 дней назад

tks a lotssss!!

@kitgary День назад

There are so many genius in this field! Really amazing!

@user-qt8pb7pj6f 2 дня назад

와우 좋은 강의 잘 들었습니다.

@junxunpei 16 дней назад

Nice talk. Before this talk, i was confused about the scaling low and design in GPT. Now i understand the source of this wonderful work.

@g111an 20 дней назад

Thanks for giving me a new direction to think in. Learnt something new

@labsanta 20 дней назад

Short Summary for [Stanford CS25: V4 I Hyung Won Chung of OpenAI](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html) "History and Future of Transformers in AI: Lessons from the Early Days | Stanford CS25 Lecture" [00:07](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=7) Hyung Won Chung works on large language models and training frameworks at OpenAI. [02:31](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=151) Studying change to understand future trajectory [07:00](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=420) Exponentially cheaper compute is driving AI research [09:25](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=565) Challenges in modeling human thinking for AI [13:43](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=823) AI research heavily relies on exponentially cheaper compute and associated scaling up. [15:59](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=959) Understanding the Transformer as a sequence model and its interaction mechanism [19:58](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1198) Explanation of cross attention mechanism [21:55](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1315) Decoded only architecture simplifies sequence generation [25:41](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1541) Comparing differences between the decoder and encoder-decoder architecture [27:38](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1658) Hyung Won Chung discusses the evolution of language models. [31:33](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1893) Deep learning hierarchical representation learning discussed [33:34](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=2014) Comparison between bidirectional and unidirectional fine-tuning for chat applications --------------------------------- Detailed Summary for [Stanford CS25: V4 I Hyung Won Chung of OpenAI](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html) "History and Future of Transformers in AI: Lessons from the Early Days | Stanford CS25 Lecture" [00:07](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=7) Hyung Won Chung works on large language models and training frameworks at OpenAI. - He has worked on various aspects of large language models, including pre-training, instruction fine-tuning, reinforcement learning with human feedback, and reasoning. - He has also been involved in the development of notable works such as the scaling flan papers like flan T5, flan Palm, and T5x, the training framework used to train the Palm language model. [02:31](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=151) Studying change to understand future trajectory - Identifying dominant driving forces behind the change - Predicting future trajectory based on understanding driving force [07:00](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=420) Exponentially cheaper compute is driving AI research - Compute costs decrease every five years, leading to AI research dominance - Machines are being taught to think in a general sense due to cost-effective computing [09:25](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=565) Challenges in modeling human thinking for AI - Attempting to model human thinking without understanding it poses fundamental flaws in AI. - The AI research has been focused on scaling up with weaker modeling assumptions and more data. [13:43](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=823) AI research heavily relies on exponentially cheaper compute and associated scaling up. - Current AI research paradigm is learning-based, allowing models to choose how they learn, which initially leads to chaos but ultimately leads to improvement with more compute. - Upcoming focus of the discussion will be on understanding the driving force of exponentially cheaper compute, and analyzing historical decisions and structures in Transformer architecture. [15:59](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=959) Understanding the Transformer as a sequence model and its interaction mechanism - A Transformer is a type of sequence model that represents interactions between sequence elements using dot products. - The Transformer encoder-decoder architecture is used for tasks like machine translation, involving encoding input sequences into dense vectors. [19:58](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1198) Explanation of cross attention mechanism - Decoder attends to output from encoder layers - Encoder only architecture simplified for specific NLP tasks like sentiment analysis [21:55](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1315) Decoded only architecture simplifies sequence generation - Decoder only architecture can be used for supervised learning by concatenating input with target - Self-attention mechanism serves both cross-attention and sequence learning within each, sharing parameters between input and target sequences [25:41](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1541) Comparing differences between the decoder and encoder-decoder architecture - The decoder attends to the same layer representation of the encoder - The encoder-decoder architecture has additional built-in structures compared to the decoder-only architecture [27:38](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1658) Hyung Won Chung discusses the evolution of language models. - Language models have evolved from simple translation tasks to learning broader knowledge. - Fine tuning pre-trained models on specific data sets can significantly improve performance. [31:33](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=1893) Deep learning hierarchical representation learning discussed - Different levels of information encoding in bottom and top layers of deep neural nets - Questioning the necessity of bidirectional input attention in encoders and decoders [33:34](ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-orDKvo8h71o.html&t=2014) Comparison between bidirectional and unidirectional fine-tuning for chat applications - Bidirectional fine-tuning poses engineering challenges for multi-turn chat applications requiring re-encoding at each turn. - Unidirectional fine-tuning is more efficient as it eliminates the need for re-encoding at every turn, making it suitable for modern conversational interfaces.

@luqmanmalik9428 20 дней назад

Thankyou for these videos.i am learning generative ai and llms and these videos are so helpful❤

@labsanta 20 дней назад

00:07 Hyung Won Chung works on large language models and training frameworks at OpenAI. 02:31 Studying change to understand future trajectory 07:00 Exponentially cheaper compute is driving AI research 09:25 Challenges in modeling human thinking for AI 13:43 AI research heavily relies on exponentially cheaper compute and associated scaling up. 15:59 Understanding the Transformer as a sequence model and its interaction mechanism 19:58 Explanation of cross attention mechanism 21:55 Decoded only architecture simplifies sequence generation 25:41 Comparing differences between the decoder and encoder-decoder architecture 27:38 Hyung Won Chung discusses the evolution of language models. 31:33 Deep learning hierarchical representation learning discussed 33:34 Comparison between bidirectional and unidirectional fine-tuning for chat applications

@hayatisschon 20 дней назад

Good lecture and insights!

@AoibhinnMcCarthy 20 дней назад

Great guest lecture

@kylev.8248 19 дней назад

Thank you so much

@claudioagmfilho 13 дней назад

🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, Thanks for the info! A little complex for me, but we keep going!

@_rd_kocaman 5 дней назад

I keep watching this. This is my eleventh time.

@puahaha07 18 дней назад

Fascinating! Really curious what kinds of Q&A was exchanged in the classroom after this presentation.

@user-lb2gu7ih5e 9 дней назад

By YouSum Live 00:02:00 Importance of studying change itself 00:03:34 Predicting future trajectory in AI research 00:08:01 Impact of exponentially cheaper compute power 00:10:10 Balancing structure and freedom in AI models 00:14:46 Historical analysis of Transformer architecture 00:17:29 Encoder-decoder architecture in Transformers 00:19:56 Cross-attention mechanism between decoder and encoder 00:20:10 All decoder layers attend to final encoder layer 00:20:50 Transition from sequence-to-sequence to classification labels 00:21:30 Simplifying problems for performance gains 00:22:24 Decoder-only architecture for supervised learning 00:22:59 Self-attention mechanism handling cross-attention 00:23:13 Sharing parameters between input and target sequences 00:24:03 Encoder-decoder vs. decoder-only architecture comparison 00:26:59 Revisiting assumptions in architecture design 00:33:10 Bidirectional vs. unidirectional attention necessity 00:35:16 Impact of scaling efforts on AI research By YouSum Live

@aniketsakpal4969 15 дней назад

Amazing!

@ShowRisk 8 дней назад

I keep watching this. This is my fourth time.

@bayptr 19 дней назад

Awesome lecture and interesting insights! One small remark: the green and red lines in the Performance vs. Compute graph should probably be monotonically increasing.

@varshneydevansh 15 дней назад

He is so well oriented with his thoughts and philosophical too. The way he correlated it with the Pen dropping and Gravity(Force) to map the AI(Linear Algebra).

@sucim 16 дней назад

Good talk but too short. I love the premise of analysing the rate of change! Unfortunately the short time only permitted the observation of one particular change, it would be great to observer the changes in other details over time (architectural but also hardware, infra like FlashAttention) and also more historical changes (depth/resnets, RNN -> Transformers) and then use this library to make some predictions about possible/likely future directions for change

@pratyushparashar1736 12 дней назад

Amazing talk! I was wondering why the field has moved closer to decoder-only models lately and whether there's an explanation to it.

@vvc354 18 часов назад

한국에도 이런 인재가 있다니!

@xhkaaaa 19 дней назад

Go K-Bro!!

@ryan2897 19 дней назад

I wonder if GPT-4o is a decoder only model with causal (uni-directional) attention 🤔

@lele_with_dots 15 дней назад

Less structure, it is just a huge mlp

@GatherVerse 18 дней назад

You should invite Christopher Lafayette to speak

@MJFloof 18 дней назад

Awesome

@sortof3337 14 дней назад

Okay who else though man named Huyng won Chung of openai. :D

@pierce8308 19 дней назад

I find this talk a bit unsatisfactory. He mentions how for encoder-decoder models, the decoder only attends to the last layer. also he mentions how we treat input and output seperately in encdoer-decoder. However thats not the point at all of encoder-decoder models right ? Its just that the encoder-decoder model has a intermediate encoder objective (to represent the input), thats all. The decoder attending to only last layer, or seperating input-output is just how the orignal transformer did it. Clearly its possible to just attend to layer wise encodings instead of only last layer encodings, just an example. Also its possible to mimic generation decoder model style by adding new input to encoder rather than decoder. I would have really liked some experiments, even if toy, because its incredibly unconvincing. Specfically how he mentions a couple times that encoder final layers are an information bottleneck, but I mean, just attend to layer wise embeddings if you want. Or put some MLP on top of encoder last states. Id argue, we are putting more structure in "decoder-only" model (by that I mean causal attention decoder, which is what he describes). The reason being causal attention, where we restrict the model to only attend to past, both during training and inference, even for part of output that is already generated.