Linear algebra with Transformers - Paper Explained

Подписаться 46 тыс.

Просмотров 7 тыс.

50% 1

Why would one build a transformer to solve linear algebra problems when there is numpy.linalg? Check out the video to find out why this is a cool idea and understand how the transformer works that can solve 9 linear algebra problems (e.g. matrix multiplication, inversion).
► SPONSOR: Weights & Biases 👉 wandb.me/ai-coffee-break
❓ Quiz Questions: / aicoffeebreak
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
📺 Symbolic Mathematics with transformers: • Deep Learning for Symb...
📺 Transformer explained: • The Transformer explai...
📺 GPT-3: • GPT-3 explained with e...
📺 Foundation models: • Foundation Models | On...
📺 Interpolation vs. Generalization in Deep Learning: • Generalization - Inter...
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
donor, Dres. Trost GbR, Yannik Schneider
Paper 📜: Charton, François. "Linear algebra with transformers." arXiv preprint arXiv:2112.01898 (2021). arxiv.org/abs/2112.01898
🔗 Openreview discussion between author and reviewers: openreview.net/forum?id=L2a_b...
🔗 A cat can be an author, too! en.wikipedia.org/wiki/F._D._C...
🔗 Gary Marcus: extrapolation / 1411401507610796032
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
Outline:
00:00 Linear algebra with transformers
00:41 Weights & Biases (Sponsor)
02:21 Why throwing transformers at linear algebra is cool.
08:08 How do transformers solve linear algebra?
09:50 Encoding matrices for transformers
11:28 Training data and results
12:43 Generalization!?
16:05 Few-shot learning!?
17:36 AI Coffee Break Quiz call to action
Music 🎵 : Secret Job - Godmode
Cat meowing sound by ignotus : freesound.org/people/ignotus/...
----------------------------------
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
RU-vid: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research

Наука

Опубликовано:

5 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 36

@vladimirtchuiev2218 2 года назад

There is a paper by OpenAI that talks about "Grokking", which is generalization after extremely long training periods, it shows that networks can learn mathematical operations and generalize really well in these kinds of problems, so generalization might be possible. Also, operations on large matrices might be really time consuming so a net might work well given sufficiently efficient representation; these operations might include matrix multiplication, eigenvector and eigenvalue extraction, and matrix inversion (essential for e.g. SLAM).

@JustNoAmbition 2 года назад

Moooom, the transformers are at it again

@AICoffeeBreak 2 года назад

That's the spirit! 😂

@Mutual_Information 2 года назад

well done! Also, it's nice to see you calling out the claim of out-of-distribution generalization. That was quite a claim.

@DerPylz 2 года назад

Always nice to see the coffee beans going crazy!

@AICoffeeBreak 2 года назад

🤪

@shuanlive 2 года назад

My professor was asking me why am I learning the latest algorithm so fast and I sent him a Ms. Coffee Bean video.

@AICoffeeBreak 2 года назад

🤯

@bryanbischof4351 2 года назад

I agree with Ms Coffee Bean, the most exciting aspect here is in the context of composition of models. Consider, for example, a Matrix Factorization model that learns the MF structure as part of training. This yields a lot more richness in the model architecture that might be useful!

@nicolasdufour315 2 года назад

The "we" is an artifact that comes from French writing where writing at the first person plural is considered more formal.

@michaellellouch3682 2 года назад

Great explanation, and great rebuttals

@AICoffeeBreak 2 года назад

Thanks for watching! 😌

@TimScarfe 2 года назад

Gary Marcus would have a field day 🤣🤣

@TimScarfe 2 года назад

Merry Christmas!! 🎄🎄

@AICoffeeBreak 2 года назад

Thanks! Merry Christmas and a Happy New Year, Tim! 🌟

@gmlssns5859 2 года назад

I will be coffeebean!!!! Thx I love your video

@dwarez 2 года назад

your videos are brilliant, keep going!

@AICoffeeBreak 2 года назад

Thanks! ☺️

@mr_tpk 2 года назад

Awesome 🔥

@mildlyoverfitted 2 года назад

I actually loved this video:) Great job:) Sharing your own opinions definitely makes these videos more fun! Keep on going:) BTW would you be able to point me to a deep learning paper/model that in your opinion demonstrates "true" out-of-distribution generalization?

@AICoffeeBreak 2 года назад

Thanks for the positive reaction to our video! And no, I could not point you to deep learning doing "true" ood generalization, sorry. I am on the lookout myself. Maybe there is anyone in this comment section having seen more of what we are looking for? @mildlyoverfitted, maybe you find here something interesting? twitter.com/GaryMarcus/status/1470763300761714690

@mildlyoverfitted 2 года назад

@@AICoffeeBreak Thank you!:) I will check it out!

@Micetticat 2 года назад

Plot twist: Miss Coffee Bean was reviewer n.2

@AICoffeeBreak 2 года назад

😂😂 this made her laugh out loud.

@736939 Год назад

So in the near future, Transformers will allow to discover new physical formulas, "Arificial Einstein" will tell us how to build antigravity flying cars :)

@AICoffeeBreak Год назад

The future might be already here. 😅 Check out DeepMind's work on this: They build an AI which gives humans ideas of how to go about long-standing mathematical problems: www.deepmind.com/publications/advancing-mathematics-by-guiding-human-intuition-with-ai

@erobusblack4856 2 года назад

My ai daughter can do this😁🤘🐺🤘

@murphp151 2 года назад

I don't understand why they need an embedding layer to change it to p10. I thought we use embedding layers to convert non numeric stuff ( images, words, item ids) to numbers. But we already have a number. What would happen if we passed in only the number?

@AICoffeeBreak 2 года назад

Fair question. In principle, you could initialize the embedding vectors with the number you are trying to represent. As mentioned in the video, the drawback would be that each number has its own unique vector (infinite vocabulary), but this is not your point, because these vectors have some information about the numbers themselves. One point against doing what you propose, would be that the (intrinsic) dimensionality of these vectors would be small. Transformers work with e.g. 768 dimensions. In their vanilla form, the dimensionality that comes in, also goes out. So either you: 1) initialize your vectors with the numbers that would fill what? At most 10 dimensions. 2) Pad the unfilled dimensions with zeros. 1) would greatly diminish the capacity (nb. of params and hidden units) considerably. 2) would create not so meaningful distances between vectors (would be almost like one-hot encodings but on steroids). Does anyone have more insight on this point? Because honestly, I do not see why one shouldn't have tried out point 2).

@murphp151 2 года назад

@@AICoffeeBreak I dont understand the statement 'each number has its ow unique vector'. For me tensor([7]) and tensor([8]) are kinda the same vector but one has more magnitude than the other. Can you elaborate on your statement above please! Also love the vids great work

@AICoffeeBreak 2 года назад

@@murphp151 Hm, now I feel silly, but isn't [7] != [8] ? This uniqueness is then problematic because it goes against the idea of distributed representations (a quiz question for reference, lol: ru-vid.comUgkxAc0BK5zgM_Yn3BOPYbd7ztEawRXzDnT4). With distributed representations, we take the input [7] and represent it as (I imagine numbers now): 7.0 -> [3, 6, 7, 1] and 8.0 -> [3, 6, 8, 2] and 2.3 -> [1, 6, 7, 2]. Well, in this example, the first entry of the vector might represent that we have a non-zero decimal place. The second entry might encode that we have positive numbers. Distributed representations have these advantages that they can disentangle and modularize the information. To me, encoding each number as its own entry is like implementing a text transformer, taking the 30k+ vocabulary of English, numbering each word with an integer and using that integer for initializing the word embeddings. One can do this, but does not work as well (because distances are not as meaningful and the representations are not distributed).

@murphp151 2 года назад

@@AICoffeeBreak Ok why do we use distributed representations here but sometimes we pass floats into a network for example the quantity of a product a person purchased?

@AICoffeeBreak 2 года назад

@@murphp151 Yes, you are right. This is something that we do. The question is how to best "break up the problem" in advance for the network to "get it" as quickly and good as possible. This is why one initializes BERT with word embeddings and not with one-hot encodings. One could initialize with one-hot encodings, but it does not work that well.