Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman

Подписаться 1,4 млн

Просмотров 385 тыс.

50% 1

Lex Fridman Podcast full episode: • Andrej Karpathy: Tesla...
Please support this podcast by checking out our sponsors:
- Eight Sleep: www.eightsleep... to get special savings
- BetterHelp: betterhelp.com... to get 10% off
- Fundrise: fundrise.com/lex
- Athletic Greens: athleticgreens... to get 1 month of fish oil
GUEST BIO:
Andrej Karpathy is a legendary AI researcher, engineer, and educator. He's the former director of AI at Tesla, a founding member of OpenAI, and an educator at Stanford.
PODCAST INFO:
Podcast website: lexfridman.com...
Apple Podcasts: apple.co/2lwqZIr
Spotify: spoti.fi/2nEwCF8
RSS: lexfridman.com...
Full episodes playlist: • Lex Fridman Podcast
Clips playlist: • Lex Fridman Podcast Clips
SOCIAL:
- Twitter: / lexfridman
- LinkedIn: / lexfridman
- Facebook: / lexfridman
- Instagram: / lexfridman
- Medium: / lexfridman
- Reddit: / lexfridman
- Support on Patreon: / lexfridman

Опубликовано:

30 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 243

@LexClips Год назад

Full podcast episode: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-cdiD-9MMpb0.html Lex Fridman podcast channel: ru-vid.com Guest bio: Andrej Karpathy is a legendary AI researcher, engineer, and educator. He's the former director of AI at Tesla, a founding member of OpenAI, and an educator at Stanford.

@fezkhanna6900 Год назад

I hope to get to see Jacob Devlin, or more relevantly Ashish Vaswani on Lex Clips. Id love to hear Jacobs foresight on the masking technique

@WALLACE9009 Год назад

Please, interview Vadswani

@mauricemeijers7956 Год назад

Andrej speaks at 1.5x speed and Lex, as always, at 3/4x. Yet, somehow they understand each other.

@Pixelarter Год назад

And I listen both at 1.5x (2x is a bit too much to absorb the dense content)

@zinyang8213 Год назад

through a transformer

@jaiv Год назад

do you mean lex speaks at like 0.3/0.25x

@frkkful Год назад

golden comment

@WahranRai Год назад

They used transformer !

@oleglevchenko5772 6 месяцев назад

Why Lex doesn't invite actual inventor of Transformers, e.g. Ashish Vaswani? All these people like Sam Altman, Andrej Karpathy are reaping the harvest of the invention by that paper "Attention is all we need", yet they are not invited even once to Lex talks.

@endingalaporte 4 дня назад

My guess is Altam and Karpathy are popular to the general public have the experience of industrial applications and are more "marketable product", focused, whereas Vaswani is a researcher and might be specialized in the technicities. Nevertheless would love an interview of him :)

@totheknee Год назад

Damn. That last sentence. Transformers are so resilient that they haven't been touched in the past *FIVE YEARS* of AI! I don't think that idea can ever be overstated given how fast this thing is accelerating...

@MrMcSnuffyFluffy Год назад

Optimus Prime would be proud.

@alexforget Год назад

Amazing how one paper can change the course of humanity. I like that kind of return on investment, let’s get more weird ambitious.

@tlz8884 Год назад

I double checked if i was listening at 1.25x speed when Andrej was speaking

@abramswee Год назад

Need a phd in AI before I can understand what he is saying

@maxflattery968 5 месяцев назад

No , no one really understands why it works. For example we know how an airplane wing works and it is a verifiable theory through experiment. You will never get an explanation of detail of why these algorithms work, it’s just full of jargon.

@diedforurwins Год назад

6:30 😂 imagine how fast this sounds to lex

@MyGroo Год назад

I literally went to check if my playback speed got changed to 1.5x listening to this guy

@omarnomad Год назад

2:18 Meme your way to greatness

@wasp082 11 месяцев назад

The attention name was surrounding there in the past on other different architectures. It was common to see recurrent bidirectional neural networks with "attention" on the encoder side. That's why the name "attention is all you need" comes from. That because it basically deletes the need of a recurrent or sequentially architecture.

@SMH1776 Год назад

It's amazing to have a podcast where the host can hold their own with Kanye West in a manic state and also have serious conversations about state-of-the-art deep learning architectures. Lex is one of one.

@1anre Год назад

The fact that he’s from both parts of the world helps me I’d assume

@kamalmanzukie Год назад

1/ n

@2ndfloorsongs Год назад

Lex is one of one, two of one, one to one, and two to one all at the same time.

@totheknee Год назад

Okay, but tbh Kayne is a racist, incompetent d-bag. So the only people who couldn't "hold their own" would be even more incompetent rubes like Trump or Bill O'Reilly who cry their way into and out of every situation like the snowflakes they are.

@ZombieLincoln666 4 месяца назад

He’s not asking any technical questions lol

@amarnamarpan Год назад

Dr. Ashish Vaswani is a pioneer and nobody is talking about him. He is a scientist from Google Brain and the first author of the paper that introduced TANSFORMERS, and that is the backbone of all other recent models.

@GlagadineKapisce Год назад

Agreed!

@dotnet364 Год назад

Its political.

@GlagadineKapisce Год назад

@@dotnet364 agreed

@dotnet364 11 месяцев назад

Even Ilya admitted that attention is all you need was the breakthrough. 2 hrs of that produced more results than 2 years of their own work. Now these guys are worth 90b because of Vaswani

@AnOzymandias 9 месяцев назад

im not disagreeing, but when i read the paper im pretty sure it said the author order was decided randomly so i think vaswani just got lucky, and was part of a super important team of researchers@@dotnet364

@baqirhusain5652 Год назад

My professor Dr Sageeve Oore gave a very good intuition about residual connection. He told me that residual connections allow a network to learn the simplest possible function. No matter how many complex layer we start by learning a linear function and the complex layers add in non linearity as needed to learn true function. A fascinating advantage to this connection is that it provides great generalisation. ( Dont know why, I just felt the need to share this)

@xorenpetrosyan2879 Год назад

Residual Connections were first proposed as an elegant solution for training very deep networks. Before ResNets the deepest networks researchers managed to train were 20-layer, go deeper and your network training would become too unstable and making it deeper could even hurt accuracy. But ResNets enabled a much more stable training of much deeper networks (up to 150 layers!) that generalised better. This was achieved by those linear residual connections that'd send the gradient signals unchanged to the first layers of a very deep network, because older DNNs had trouble passing any useful signal any deeper. Fascinating that this worked so well and turned out to be helpful not only in ConvNets (which the original ResNet was) and in architectures that didn't even exist at the time (Transformers).

@ArtOfTheProblem Год назад

thanks for sharing. another question about the heads. would you agree that in simple terms a transformer "looks at everything at every step" (and absorbes what's relevant), and so on the one hand it's a very naieve, brute force approach. where if the modeled was learned from scratch it might do 'much less' (guess at what it needs and from where, etc.) to acheve sparsness among other things

@xorenpetrosyan2879 Год назад

@@ArtOfTheProblem if you knew what your model should look at in advance, yes, but you don't and SOTA results Transformers produce is a proof of that. It's a perfect example of The Bitter Lesson - more compute and data >>> hand tuned features and architectures

@ArtOfTheProblem Год назад

@@xorenpetrosyan2879 makes sense,

@offchan Год назад

Basically it's Occam's razor at work. Simple reasonings are usually more generalizable than complex reasonings, when both reasonings fit the evidence.

@StasBar Год назад

The way this guy thinks and speaks reminds me of Vitalik Buterin. What do they have in common? High intelligence is not the only factor here.

@bmatichuk Год назад

Karpathy has some great insights. Transformers seem to solve the NN architecture problem without hyper parameter tuning. The "next" for transformers is going to be neurosymbolic computing i.e. integrating logic with neural processing. Right now transformers have trouble with deep reasoning. Its remarkable that reasoning processing automatically arises in transformers based on pretext structure. I believe there is a deeper concept of AI waiting to be discovered. If the mechanism for auto-generated logic pathways in transformers could be discovered, then that could be scaled up to produce general AI.

@taowroland8697 Год назад

Kanye is not manic or crazy. He just saw a pattern and talked about it.

@alexforget Год назад

Doesn’t excel in sequential reasoning? Our brain seem to be divided in two for that reason, allowing parallel processing on one side with sequential on the other with very few connection in between. The two modes of thinking are antagonist, they cannot coexist in the same structure or they need to be lightly connected to not confuse the other part. It’s still a problem human struggle with.

@marbin1069 Год назад

They are able to reason and every new model (> size) seems to reason better. Right now, there is no need for neurosymbolic AI.

@Paul-rs4gd Год назад

I have great hopes for Transformers. It seems like the forward pass is 'system 1' reasoning (intuitive perception/pattern recognition), but the autoregressive, sequential output from the decoder is like 'system 2' reasoning (Daniel Kahneman). The decoder produces a token, and then considers all the input plus all the output-so-far in order to produce the next token. This should be able to create a chain of reasoning. i.e. Look at the input facts plus all the conclusions so far in order to produce the next conclusion. Tokens may well be the basis for symbol-like reasoning too, but still in differentiable form.

@bmatichuk Год назад

@@marbin1069 That is a very intriguing idea and one that is being tossed about in various ML circles, but I'm skeptical. Full logical reasoning requires thinking about variables, loops, conditionals and induction. Complex problem solving requires careful planning and step by step assessment. The Transformer architecture is not recursive, and its memory is limited. It's not clear to me how scale alone addresses this problem. There is also the problem of stored knowledge. Transformers generate knowledge at run-time, which means that the entire output is probabilistic. In some context you want your agent to look up knowledge in a database. Transformers can't do that right now.

@Halopend Год назад

Self-attention. Transforming. It's all about giving the AI more parameters to optimize what are important internal representations of the interconnections between data itself. We've supplied first order interconnections. What about second order? Third... or is that expected to be covered in the sliding window technique itself? It would seem the more early representations we can add the greater we can couple to "the data" complex/nuance. At the other end, the more we couple to the output, the closer to alignment we can achieve. But input/output are fuzzy concepts in a sliding window technique. There is no temporal component to the information. The information is represented by large "thinking spaces" of word connections. It's somewhere between a CNN like technique to parse certain subsections of the entire thing at once, to a fully connected space between all the inputs. That said sliding is convenient as it removes the hard limit of what can be generated and makes for an easy to understand parameter we can increase at fairly small cost our increase our ability to generate long form representations exhibiting deeper level nuance/accuracy. The ability to just change the size of the window and have the network adjust seems a fairly nice way to flexibly scale the models, though there is a "cost" to moving around IE: network stability meaning you can only scale up or down so much at a time to maintain the most knowledge incurred from previous trainings. Anyway, the key ingredient is, we purposefully encode the spatial information (to the words theme-selves) to the depth we desire. Or at least that's a possible extension. The next question of course is which areas of representation can we supply more data that easily encodes within the mathematics of information we think is important to be represented in the information (that isn't covered by the processes of the system itself (having the same thing represented in multiple ways (IE: the Data + the system) ) is a path to overly-complicated systems in terms of 'growth/addendums". The easiest path is to just represent in the data itself. And patch it. But you can do stages of processing/filtering along multiple fronts and incorporate them into a larger model more easily, as long as the encodings are compatible (which I imagine will most greatly affect the growth of these systems/swapability though standardized ). Ideally this is information that is further self-represented within the data itself. FTT are a great approximations we can use to bridge continuous vs discrete knowledge. Though calculating it on word encodings feels a poor fit, we could break the "data signal" into an individual chosen subset of wavelengths. Note this doesn't not help in the next word prediction "component" of the data representation, but is a past knowledge based encoding that can be used in unison with the spatial/self-attention and parser encoding to represent the info (I'm actually not sure of the balance between spatial and self-attention except that the importance of the token in the generation of each word to the previous word (along with a possibly a higher order of inter-connections between the tokens) is contained within the input stream). If it is higher order than FFT's may already be represented and I've talked myself in a circle. I wonder what results dropouts tied to categorization would yield on the swap-ability of different components between systems? Or the ability to turn various bits/n/bobs on/off in a way tied to the data? I think that's how one can understand the partial derivative reverse flow loss functions as well, by turning off all but one path at a time to split the parts considered, but that depends on the loss function being used. I imagine categorization of subsections of data to then spit off into distinct areas would allow for finer control on representations of subsystems to increase scorability on specific test without affecting other testing areas as much. Could be antithetical to AGI style understanding, but it allow for field specific interpretation of information in a sense. Heck. What if we encoded each word as their dictionary definitions?

@ColinGrym Год назад

In a rarity for this channel, the title approaches clickbait. I will cede the point that it's 100% accurate to the conversation, but I'm bitterly disappointed that I didn't get to hear expert opinion about benevolent, shapeshifting AI platforms protecting sentient organics from the Megatrons of the universe.

@_PatrickO Год назад

This title has nothing to do with clickbait. They are talking about transformers. The scope of AI was even in the title. It is the least clickbaity as you can get.

@moajjem04 Год назад

satire

@Skynet_the_AI Год назад

Nice

@oldtools6089 Год назад

PRIME!!!! YOU NEEED ME!

@danparish1344 7 месяцев назад

“Attention is all you need” is great. It’s like a book title that you can’t forget.

@jeff__w Год назад

1:56 “I don’t think anyone used that kind of title before, right?” Well, maybe not as a title, but I can’t imagine that the authors of the paper were unaware of the lyric “Love is all you need” from The Beatles’ 1967 song “All You Need is Love.”

@mikeiavelli Год назад

I am surprised that they could find it surprising, as there is a long tradition of "meme" titles for papers in computer science, some of which are now classics. E.g. from the top of my head: - Lively linear Lisp: “look ma, no garbage!” - Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire - A Short Cut to Deforestation - Clowns to the left of me, jokers to the right (pearl): dissecting data structures - I am not a number -- I am a free variable There's also the "considered harmful" pattern, that originated from the classic paper 'GOTO Considered Harmful'. The opposing view was expressed in a paper titled: "'GOTO Considered Harmful' Considered Harmful". And then a paper considering both views came: "'GOTO Considered Harmful' Considered Harmful" Considered Harmful? Many papers now use the "[insert concept here] considered harmful" template to critique some concept in CS. "Attention is all you need" is in the same spirit. I like it.

@marcin8112 Год назад

5:05 he lost me after series of blocks

@RyanAlba Год назад

Incredibly interesting and thought provoking. But still disappointed this wasn't about Optimus Prime

@aangeli702 Год назад

Andrej's influence on the development of the field is so underrated. He's not only actively contributing academically (i.e. through research and co-founding OpenAI), but he also communicates ideas so well to the public (for free, by the way) that, he not only helps others contribute academically to the field, but also encourages many people to get into it simply because he manages to take an overwhelmingly complex (at least for me it used to be) topic such as the Transformer and strips it down to something that can be (more easily) digested. Or maybe that's just me - as my professor in my undergrad came no where near to an explanation of Transformers that it as good and intuitive as Andrej's videos do (don't get me wrong, [most] professors know their stuff very well, but Andrej is just on a whole other level).

@fearrp6777 Год назад

You have to remove his nuts from your mouth and breathe before you type.

@adicandra9940 9 месяцев назад

can you point me the video which Andrej talk about this Transformers in depth? or other Andrej videos you mentioned. Thanks

@aangeli702 9 месяцев назад

@@adicandra9940 ru-vid.com/group/PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ He's got more videos in his channel and also many videos on YT where he gives talks/lectures (e.g. a talk at Tesla)

@revimfadli4666 9 месяцев назад

To think that this guy made evolving fish simulator at about 12...

@ZombieLincoln666 4 месяца назад

underrated? tf are you talking about?

@dianes6245 8 месяцев назад

Yes the TF is great, however: 1- Next word prediction only gets the most popular ideas, not the right ones. 2- Hilton wants to improve on Back Prop. 3- Hallucinations in = Hallucinations out. Attribute that to Plato. 4- Will it get lost on real world data - the info in matter? 5- How bad is the compute wall? Factorial? N**4 does not consider parameters and data that must increase. 6- How will the measurement problem and non locality in physics affect AI? 7- does the entropy of physics eat your models alive if you dont engineer them perfectly?

@ReflectionOcean 10 месяцев назад

- Understanding the Transformer architecture (0:28) - Recognizing the convergence of different neural network architectures towards Transformers for multiple sensory modalities (0:38) - Appreciating the Transformer's efficiency on modern hardware (0:57) - Reflecting on the paper's title and its meme-like quality (1:58) - Considering the expressive, optimizable, and efficient nature of Transformers (2:42) - Discussing the learning process of short algorithms in Transformers and the stability of the architecture (4:56) - Contemplating future discoveries and improvements in Transformers (7:38)

@arifulislamleeton Год назад

Introduce myself my name is Ariful Islam leeton im software developers And developer open AI

@MsStone-ue6ek 11 месяцев назад

Great interview. Engaging and dynamic. Thank you.

@shinskabob8464 Год назад

Best idea seems to be to make sure the robot needs to be plugged in so it can't chase you

@JM-gw1sg Год назад

The model YOLO (you only look once) is also an example of a "meme title"

@ajitkirpekar4251 3 дня назад

I know one of the author's of Attention is All you Need at a personal level. Even that person is rather cagey about talking about its impact on the field. I would agree - I don't think they collectively knew just what this was going to do to the industry.

@rajatavaghosh1913 7 месяцев назад

I read the paper and was wondering Transformer is another kind of a LLM for generative tasks as they mentioned it as a model and also compared with other models at the last of the paper but finally after watching this explanation by Andrej i understood it is a kind of an architecture that learns the relationship between each sequence

@rohitsaxena22 Год назад

We as a field are stuck in this local maxima called Transformers (for now!)

@xpkareem Год назад

I looked up some of the words he said. I still have no idea what he is talking about.

@kazedcat Год назад

Transformers are easier if you look at it as a math equation. But you need to know how to multiply a matrix first.

@michaelfletcher940 Год назад

Pay this man his money (rounders)

@the_primal_instinct Год назад

Ah, so that's why they called their bot Optimus.

@michaelcalmeyerhentschel8304 Год назад

well, this clip was great until it ended abruptly in mid-sentence. PLEASE! I signed up for clips, but I am frustrated to have missed a punchline here from Karpathy explaining how GPT architecture has been held back for 5 years. I have SO little time to now scan the full version! But at least you linked it. Your sponsor has left me sleepless.

@yusuf4433 9 месяцев назад

And he's known for explaining stuff in simple terms...

@ankitbatra270 9 месяцев назад

that was probably one of the best high level explanations of the transformer i have come across

@leimococ Год назад

I had to check a few times if my video speed was not set to 1.5x

@oldtools6089 Год назад

Question: without the optimization algorithms designing the hardware manufacturing is there reason to believe that the fundamental nature of these mechanisms reflect the inherent medium of computation? Nope. I guess not. They're watching it.

@TheKaiB18 2 месяца назад

Bro made me double check if my playback speed was still at 1x

@brunotvrs Год назад

Saving that one to see if I can understand wth Andrej is talking in a year.

@toompi1 3 месяца назад

hey

@guepardiez Год назад

What is the best idea in AI and why is it transformers?

@imranq9241 Год назад

The weird thing about transformers is that it's just so random. There is huge space of potential architectures, but it's not clear why transformers are so good

@SMH1776 Год назад

I feel like the success of the transformer stems from their improvement over LSTMs. There is a huge range of problems that can be solved with sequence-to-(something) and transformers do a better job of encoding sequences because the attention mechanism makes it easier to look further back (or forward) in the input sequence, and they're orders of magnitude more parallelizable. LSTMs being so bad makes transformers look so good. I agree 100% that there are bound to be much better architectures in the set of all possible unknown model architectures, and I'm excited to see the next leap forward.

@ArtOfTheProblem Год назад

i don't think it's random though, i think of it as "look at everything at every step" it's a kind of brute force in a way

@farrael004 Год назад

It's not random if you understand how to design neural networks. For example, why did they choose to not use bias in the key/query single forward layer in the attention head? Because they want it to be a lookup table for what the inputs represent to eachother, and that's exactly what a single forward layer does. ResNets have been a recent development when the transformers paper came out, so it makes sense for them to take advantage of its demonstrated capabilities. If you read enough neural architecture research, you'll start seeing patterns of how different components affect eachother and start piecing together what the ideal architecture would look like. Then you can write a paper about it.

@sdsfgfhrfdgebsfv4556 Год назад

@@farrael004 that's like saying that reading about computer chip architechture from the 80s will help you figure out what the architechture was like in the 90s or in a new generation

@clray123 Год назад

It's not random, but the authors of the paper did a really shitty job explaining their design choices.

@ClaudioPascual Год назад

Transformers is not so good a movie

@johnlucich5026 10 месяцев назад

DR ILYA IS TRANSFORMING CULTURE SOCIETY & WORLD ! ?

@123cache123 4 месяца назад

"English! Do you speak it?!"

@ashutoshzade5480 9 месяцев назад

Great short video. What are the some of the limitations of the transformer architecture you can think of?

@alexbabich2698 3 месяца назад

Expensive to train, expensive to run, need a lot of data, hard to understand how they work so potentially risky especially because the way they learn its not clear what they actually understand about the word

@jabowery Год назад

Recurrence all you need.

@johnlucich5026 10 месяцев назад

Most people can Walk academically-But-ILYA IS “RUNNING” INTELLECTUALLY ! ?

@johnlucich5026 10 месяцев назад

Recall Football; SOMETIMES ONE NEEDS TO GO BACK A LITTLE TO RUN ALL THE WAY WAY FORWARD ! ?

@Skynet_the_AI Год назад

My fav Lex clip byfar

@KGS922 Год назад

I'm not sure I understand the other guy

@johnlucich5026 10 месяцев назад

Dr ILYA; consider that it’s not Altman- it’s “FAULT-MAN” ! ?

@ytubeanon 6 месяцев назад

5:04 while the effort is appreciated, I don't think the blocks analogy simplifies things enough to help the layman

@jetspalt9550 Год назад

What absolute rubbish!! Transformers are either Autobots or Decepticons. I don’t know what he was talking about but get a clue dude!

@johnlucich5026 10 месяцев назад

Someone should tell Altman; “IF YOU CANT RUN WITH BIG DOGS STAY ON PORCH”

@RalphDratman Год назад

What is meant by "making the.evaluation much bigger"? I do not understand "evaluation" in this context.

@generichuman_ Год назад

"Don't touch the transformer"... good advise regardless of what kind of transformer you're taking about.

@surecom12 5 месяцев назад

@1:15 Well they were probably AWARE since they named the paper : "Attention is all you need" 🤭

@censura1210 10 месяцев назад

4:41 I think his brain hallucinated here. Because sequential and serial is the same thing. I think he meant to say "in parallel".

@MellowMonkStudios Год назад

I concur

@johnlucich5026 10 месяцев назад

Altman TALKS-But-ILYA IS THE ONLY ONE “SAYING” ANYTHING ! ?

@johnlucich5026 10 месяцев назад

If one gave Dr ILYA an Enima you would get an ALTMAN ! ?

@flwi Год назад

Did anyone learn about transformers recently and can recommend a video that made it easy for them? I'm quite new to ML and would appreciate recommendations.

@boraoku Год назад

Transforms will land here soon - AK’s RU-vid Playlist: m.ru-vid.com/group/PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ

@bourgeois_radical9027 Год назад

Sebastian Raschka. And his book Machine Learning with PyTorch.

@peterfireflylund Год назад

This one did it for me -- by Leo Dirac (grandson of P.A.M. Dirac): ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-S27pHKBEp30.html

@blondisbarrios7454 8 месяцев назад

Minuto 6:31 tu cara cuando tu amiga te dice que hizo 25mil fotogramas para Ghibli en una semana 😄

@markcounseling Год назад

If I understand correctly, transformers = good. N'est pa?

@datasciyinfo5133 Год назад

Best explanation of the essence of Transformer architecture. I think the title is a red herring because it makes it more difficult to understand. You need much more than Attention, you need all the smart tweaks. And it keep making my mind think of Megatron from the movies, and I’m not sure what if any is the relationship. I like the generalized differentiable program, as the best description of a Transformer Model today. But that could change. The description is from Yuan LeCun in 2017-19 time period. Jennifer

@NicolaiCzempin Год назад

Attention is all you need? No, 🎼🎶All You Need Is Love 🎶

@theeagle7054 5 месяцев назад

How many of you saw the entire video and understood nothing?? 😂😂😂

@gabrielepi.3208 Год назад

So the last 5 years in AI are "just" dataset size changes while keeping the original Trasformer Architecture (almost) ?

@Muslim-uc2bh Год назад

Does any one have a simple definition for “general differentiable computer”?

@kazedcat Год назад

There is no simple definition. Computers can be abstracted into a system of mathematical functions. If you have a general purpose computer with their representative function that can be differentiated using differential calculus that is what he meant.

@strictnonconformist7369 Год назад

The most interesting aspect of the Transformer architecture is it can integrate as well as differentiate, in effect, as a result that it can take a given input, an generate an output based on it, and that can add to the size of the total with input+output, or it can be reduced in size if desired. In essence, it can rewrite equations in either direction, integrate and differentiate. It's also not a purely binary computation, and probability plays into it, as does rounding error. When you think about it, when we feed code into a typical Neumann architecture and have it work on data, the computation and output is based both on input (data) with the transformation (instructions) generating data of some kind as output, where part of that data is the next flow of execution: the biggest difference is it isn't probabilistic and it's much more discrete and deterministic in nature. Weirdly, it's the imprecision of representation of Transformers that are their greatest value in generalization, in that it enables things that are similar to have a similar value and fit into patterns in a more generalized manner.

@SevenDeMagnus Год назад

Coolness

@moritzsur997 Год назад

I loved them as well but after the first 2 films it really got boring

@priyankajain1691 2 месяца назад

This is truly amazing. Many many thanks 😊

@Statevector Год назад

When he says the transformer is a general-purpose differential computer does he mean in the sense that it is Turing complete?

@samario_torres 7 месяцев назад

Do a podcast with him and George at the same time

@jzuni001 Год назад

How many people thought they were going to talk about Megatron? Lmao

@AndreaDavidEdelman Год назад

It’s self attention ⚠️

@thekittenfreakify Год назад

...i am dissapointed it's not about optimus prime

@michaelfletcher940 Год назад

Lex is the translator between genius ideas and us normal folk

@TheJordanK Год назад

I like your funny words magic man

@WALLACE9009 Год назад

Any guess why Vadswani is ignored?

@PeterParker-mg7cb Месяц назад

Idk this

@mayetesla Год назад

Maye the force

@dancar2537 Год назад

and the greatest thing about it is that you do not need it

@EmilyStewart-dh8gf Год назад

Lex Fridman, you seem bored and uninterested. Holding your head up with your hand. You have Andrej in front of you, be professional. ;-)

@derinko Год назад

Lex posted today he's suffering from depression since last year, so it might be that...

@cit0110 Год назад

make your own podcast and have your neck perpendicular to the ground

@MaximilianBerkmann Год назад

Long time no see badmephisto!

@axelanderson2030 Год назад

Attention is truly all you need.

@chenwilliam5176 Год назад

I don't think so 😕

@JDNicoll Год назад

Robots in disguise....

@Skynet_the_AI Год назад

Genius in disguise...

@vulkanosaure Год назад

Father of robots in disguise

@markcounseling Год назад

What?

@cmilkau Год назад

The paper is called "Attention Is All You Need", and IMHO attention is what made transformers so successful, not its application in the transformer architecture.

@michaelfletcher940 Год назад

This guy is semi-Elon’s brain but articulate

@neofusionstylx Год назад

Lol in terms of AI knowledge, this guy runs circles around Elon.

@SJ-eu7em Год назад

Yeah coz Elon sometimes talks a bit tarded way like he's lost in his brain...

@kazedcat Год назад

This guy is smarter than Elon.

@TheChannelWithNoReason Год назад

@@kazedcaton this subject

@TheEsotericProgrammer 7 месяцев назад

No he's just smarter... @@TheChannelWithNoReason

@techpiller2558 Год назад

"Once you understand the way broadly, you can see it in all things." -- This could be the name for the AGI paper, whoever will write that. Just give a credit for "AGI technology enthusiast TechPiller in a RU-vid comment". :)

@abir95571 Год назад

This makes me slightly freaked out that we don’t really understand what we’re developing….

@oldtools6089 Год назад

Relax: They're watching it. The optimization and transformer algorithms are just calculation patterns in a medium and life adapts.

@abir95571 Год назад

@@oldtools6089 I'm a ML engineer by profession. We don't understand why does attention mechanism work so well. We just observe it's effects and agree to a common consensus

@randomusername6 Год назад

@@oldtools6089whatever helps you sleep at night

@oldtools6089 Год назад

@@abir95571 this correspondence has distracted me from my existential dread, so whatever you're doing is working. Thanks.

@lachainesophro8418 Год назад

Is this guy a robot...

@robbie_ Год назад

I studied neural networks in the 1990's as part of my undergrad degree in AI. I did a paper on simulated annealing... anyway there's been a lot of "progress" since then, but also unfortunately a lot of drivel written and bollocks spoken. I might trust an ANN bot to hoover my carpet, within reason, but I wouldn't trust it to make me a cup of tea. It's not clear to me I'll ever be able to either. Something important is missing...

@jimj2683 Год назад

Dude, you are a nobody. You are on the lower end of it-people. You have no idea where things are heading or even how the cutting edge today works. You are NOT an expert. Stop pretending to know anything.

@robbie_ Год назад

@@jimj2683 The "cutting edge" today is still spectacularly stupid.

@SMH1776 Год назад

Specialized transformer architectures can outperform humans in a variety of specific language and visual tasks, but there is nothing approaching a general AI that rivals biology. Even if we had the code (we don't), we lack the hardware to make it feasible.

@oldtools6089 Год назад

affordable precision engineering and perhaps very little materials-science innovation is all that really stands in the way of a fully articulated and self-powered humanoid AI using what we know exists today...using open-source off-the-shelf components for jank-points.

@oldtools6089 Год назад

@@SMH1776 The bizarre thing about life is that it adapts to its environment.

@ryansimon9855 Год назад

i know absolutely nothing about transformer architecture or ai. only recently did i find these clips and they are very interesting to watch. i immedietely recognize the arrangement and flow in the transformer model architecture as a fractal design present in nature already. it was an instant reaction i had to oberserving the transformer model architecture. this "recognition" of the design is likely only due to human pattern recognitions trying to convince me i am seeing similar shapes, but interesting none the less when you look at it this way. not sure if this random youtube comment helps anyone or anyone will see it. im speaking from a purely basic understanding and google searchs lol fun video to watch.

@vasylivanchukdoesntdeserveus Год назад

Transformers are more similar to compression algorithms, basically you have a certain encoder-decoder setup where the model attempts to continuously encode into some multi-dimensional vector representation (vectors are basically lists of numbers of a fixed size) whatever is fed as its input to be able and derive it back unchanged, they then introduce noise at multiple points in the process to "keep the model up on its toes" so it doesn't learn the exact composition of the data, but rather how it's meant to typically relate to itself. This leads into a very interesting proposition that training is ultimately about introducing noise, and inference (prediction) is about reducing that noise; the digital signal processing people were right all along, kind of!

@tadeohepperle Год назад

I don't see how transformers relate to fractals in any way.

@ryansimon9855 Год назад

@@tadeohepperle they dont at all. i typed that whole comment just to say they have a slightly similar shape lol

@ArtOfTheProblem Год назад

@@vasylivanchukdoesntdeserveus interesting can you say more about where the noise is introduced ?

@ArtOfTheProblem Год назад

@@vasylivanchukdoesntdeserveus just the regularization aspect?

@dainionwest831 8 месяцев назад

Optimizing NNs based on just the output layer always never really made sense to me, it's really cool knowing the transformer has solved that!

@schuylerhaussmann6877 Год назад

Any top AI expert that is not a member of the tribe?

@vulkanosaure Год назад

I think human brain has 2 mode of working : an efficient mode (non Turing complete) and an expensive one (Turing complete) As a simplified illustration, there are 2 ways of doing a simple multiplication : memorising it, or running the multiplication and counting on your finger. Our brain is capable of both, and transformers are probably only able to do it the non Turing complete way. It stills produce impressive results, which teaches something ABT how much seemingly complex reasoning we can produce just by memorizing / interpolating data. But it has a fundamental limitation compared to the turing complete way of reasoning (which is why chat gpt fails pretty fast with simple math problems) (This is just my take on the subject and I obviously have a big respect/admiration for what karpathy achieved with chat GPT and am excited to see the way it will evolve)

@ycombinator765 Год назад

wow, I love this perspective.! by the way as I am typing this, I realize that typing out a comment via laptop is way more satisfying than just tapping on my android lol