DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

Yannic Kilcher

Подписаться 263 тыс.

Просмотров 20 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

5 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 85

@anshul5243 3 года назад

The old format seemed better, mainly because of the space wastage in this one. The title seems redundant, since the RU-vid video title already has the name of the paper, and the logo could be better off as a watermark in a smaller size.

@randomisedrandomness 3 года назад

I don't understand most of your videos, yet i keep watching them.

@anonymous6713 3 года назад

hahaha

@timdernedde993 3 года назад

The old layout was better. Especially as a Mobile user when the screen is smaller and there is a completely unnecessary black bar on the right.

@G12GilbertProduction 3 года назад

And so sumptuous.

@rbain16 3 года назад

I came to the comments to say similarly. Not on mobile, but still would rather not have some screen taken up by your twitter pic (what if it changes too?).

@willemwestra 3 года назад

Hi Yannic, Absolutely love your video's, but the new recording setup I find quite a bit worse. The simple paper only setup was very nice, clean and distraction free. Moreover the font rendering is also quite a bit worse. It varied throughout your videos possibly because you might have switched software, so just pick the editor and recording program combination that gives the best pdf rendering quality. Distraction free, crystal clear rendering and good microphone are the things I appreciate the most. Your voice and microphone are great but I long for the old clean and crisp setup :)

@YannicKilcher 3 года назад

Thanks a lot :)

@CosmiaNebula 2 года назад

Unlike what you stated, the positional encoding vectors do not get added to the token encoding vectors. The new token encoding vectors after each attention layer is merely a weighted sum of the previous token encoding vectors. See the paper's Equation (4).

@quebono100 3 года назад

I like the old settings more now its smaller

@adamrak7560 3 года назад

I always feed the positional information before every attention stage. That seemed better, and always converged faster for me.

@haukurpalljonsson8233 3 года назад

The positional information does not leak into later layers, at least not directly. The positional information is only in the attention which is then softmaxed and only multiplied to the content information.

@avatar098 3 года назад

Thank you so much for doing these videos. Helps me keep current with NLP!

@rohankashyap2252 3 года назад

Really awesome youtube channel, lucky to get access to such great content.Thanks a lot!

@null4598 2 года назад

Cool. You make it easy. Thank you, Yannic.

@nghiapham1632 Год назад

Thanks for your great explaination

@biesseti 3 года назад

Found this video and subscribed. You do it well.

@alpers.2123 3 года назад

How much do you practice to pronounce author names:)

@burakyildiz8921 3 года назад

It seems that he made up :)

@lloydgreenwald954 Год назад

Very well done.

@rezas2626 3 года назад

Awesome video! Do RANDOM FEATURE ATTENTION next please!

@CppExpedition Год назад

BRILLANT! :)

@etiennetiennetienne 3 года назад

I am not sure with Position fed at first layer means their architecture already "agglomerates" position. Values are produced only from content, but are"weighted" by the hybrid attention. In itself values are just a clever mix of other values, but positional encoding is not really part of the vector itself, like it would be with direct summation or concatenation.

@sergiomanuel2206 3 года назад

Hello!! First of all, Thanks for the video. I don't the new setup, the image took a lot of screen space, although the title is okay.

@andres_pq 3 года назад

Next do a video about "A straight forward framework for Video Retrieval using CLIP" 👀

@dr.mikeybee 3 года назад

Yannic is all you need!

@mschnell75 3 года назад

Why isn't this called ERNIE?

@MehrdadNsr 3 года назад

I think we already have a model named ERNIE!

@peterrobinson7748 3 года назад

Because then it'll rhyme with BERNIE, one of the greatest enemies of America.

@ChlorieHCl 3 года назад

Because there're already at least 2 models named ERNIE...

@Kram1032 3 года назад

3:13 OMG so that's why I am hungry!

@frenchmarty7446 2 года назад

I'm not sure if feeding information into the model at the beginning is necessarily better than at the end. Like you said yourself, the model would have to learn how to propagate that information through. That night be more of a bottleneck than just waiting until the end. There might also be a useful inductive bias here that's close to how humans read (you don't read a word and have both its relative and absolute position in mind).

@cerebralm 3 года назад

The only thing I didn't like about the old layout was that it rendered PDFs in lightmode. Not sure if there's a good way to do a vote on youtube to see which of your audience prefers lightmode and which would prefer darkmode, but that would be the only thing I would change if it was up to me :)

@susantaghosh504 2 года назад

Awesome

@drozen214 3 года назад

This paper makes me wonder whether we really need to use a whole vector to represent a position

@frenchmarty7446 2 года назад

What is the alternative?

@mathematicalninja2756 3 года назад

Every day we arrive at the future

@herp_derpingson 3 года назад

3:25 THICC vector :) . 10:10 Yeah, this addition thing always felt dirty in the original "Attention is all you need" paper. I am glad I was not the only one who felt so. . 14:13 Never mind, we end up adding them anyways, just with extra steps. . 24:11 Is P learnt? Are we using the same P for all languages? There are two main types of languages, subject-verb-object languages and subject-object-verb languages. I dont think we should use the same learnt values of P for all languages as the position works completely differently in both types of languages. . 35:30 Never mind, we end up using absolute positions anyways. . 41:35 Fermat's last theorem: "I could prove it but I don't have enough battery" . I thought you accidentally recorded your video in wrong aspect ratio LOL

@Hank-y4u 3 года назад

Hi Yannic, big fan here. Would you take a video about Meta Pseudo Label?

@MrJaggy123 3 года назад

So glue was deprecated when submissions surpassed human performance. This papers submission has done the same thing for superglue (alongside another submission which also does so). Is it time for a new benchmark again? What are your thoughts on what "the benchmark after superglue" would look like?

@sandraviknander7898 3 года назад

If the context and position were truly disentangled all the way trough the network how would the network be able to learn the transformations to the positional vectors it needs to do to rout context information? 🤔

@sajjadayobi688 3 года назад

from youtube import Yannic paper = 'any complex architect' easy_to_learn = Yannic(paper)

@andres_pq 3 года назад

Anyone knows an advanced Pytorch course. One that includes something like creating custom layers, custom training loops and handling weird stuff. I have some resesech ideas that I dont know how to implement.

@snippletrap 3 года назад

FastAI. The second half of the course is all custom implementations.

@andres_pq 3 года назад

@@snippletrap thanks

@谢安-k6t 3 года назад

@@snippletrap I was using FAST.ai framework about two years ago but quit it because it(the framework) was harder to customize then pytorch and has bad API. Do you think its course is worth learning?

@snippletrap 3 года назад

@@谢安-k6t Yes, I agree, I prefer vanilla PyTorch. I don't use the FastAI library because it's difficult to read the source when most functions rely on callbacks. I still highly recommend the course, you will learn a lot.

@谢安-k6t 3 года назад

@@snippletrap Get it, thanks a lot.

@yaaank6725 3 года назад

Why the flash traveling back through time talks about a paper half a year ago

@anonymous6713 3 года назад

So why in the beginning you said "the worst" case is disentangled embedding (half for positiong, half for content). But this paper just propose the disentangled one?

@frenchmarty7446 2 года назад

He meant the worst case would be the model learning its own disentangled embedding, which would mean some chunk of the "content" vector is being occupied by position information.

@kimchi_taco 3 года назад

Disentangled Attention is already handled by TransformerXL when it introduces relative positional embedding. In my opinion, no contribution about it.

@G12GilbertProduction 3 года назад

New shade of BERT v2, but more metronomical.

@Dynidittez 3 года назад

Hi, looking at the models they seem to have normal versions and versions fine tuned on mnli. Do the ones that are finetuned on mnli perform better in most benchmarks? Also on their git repo they show scores like i.e. 85.6/86.9. Is the second score there meant to represent the finetuned mnli version score?

@paveltarashkevich8387 3 года назад

The old layout was better. Text resolution was better. Screen space usage was better.

@sedenions 3 года назад

You are doing a good job. Talk about biologically plausible neural networks next, please.

@GreenManorite 3 года назад

Why is that an interesting topic? Not being snarky, just trying to understand motivation for the biological parallelism.

@sedenions 3 года назад

@@GreenManorite I'm biased, I majored in neuroscience and am currently switching careers. I guess this sentiment of mine comes from an interest in how researchers can better build cognitive AI. It seems like many of the early neural networks were 'neural' in name only. We are getting closer and closer to biologically plausible nets, but like you said, they're not that interesting to most.

@willrazen 3 года назад

Watch his video on "predictive coding"

@zhangshaojie9790 3 года назад

Can anyone explain to me what is the difference between transformer encoder and decoder? Other than bidirectional, autoencoding, and extra FFW layer, the two model architecture looks the same to me. I keep hearing ppl said decoder is better at scaling. Do ppl actually mean Bert and GPT.

@frenchmarty7446 2 года назад

The encoder and decoder are two components of the same model.

@pensiveintrovert4318 3 года назад

Two many layers pollute information that may have been decisive when pristine.

@bishalsantra 3 года назад

Just curious, what app are you using to annotate?

@timdernedde993 3 года назад

OneNote

@florianjug 3 года назад

@@timdernedde993 Is this also true with the new setup used in this video???

@gavinmc5285 3 года назад

content - ok. positioning - ok. what about context?

@frenchmarty7446 2 года назад

What exactly do you mean by "context"? Like somekind of additional information not in the word vectors themselves? That would probably be something the model should learn on its own.

@gavinmc5285 2 года назад

@@frenchmarty7446 ok, well around the 30 minute mark there is a breakdown analysis of relative and absolute positioning merits. and the strength of either technique (or both) seems to be correlated to context. leaving aside computational or processing power (if even they could be considered of relevance) the paper analysis seems to highlight the before or after options of adding absolute positioning (in this paper at the end of the process). nonetheless, the context 'factor' or 'solving' the context (so as to allow accurate word embedding or prediction) remains and surely the optimum solution (approximately or precisely) would be to have - in a positional and hierarchical (content) vector or matrix set with relative values - some form of absolute feed within which absolute values could be accessed without necessarily having to position those as a priority before or after the relative value calculations are processed.

@frenchmarty7446 2 года назад

@@gavinmc5285 You didn't actually answer my question but ok... When you say "hierarchical" information, I assume you mean some kind of graph. Unstructured graphs have actually been tried before (BP-Transformer) with some success. If you mean somekind of structured graph based on grammar rules, then that is a bad idea. The entire purpose of the self-attention mechanism is to learn the relationships between tokens. The attention mechanism *creates* its own graph at every layer. Transformers are powerful (with large amounts of data) because they impose very little inductive bias. We don't tell the network what is or isn't important, it learns that on its own. Feeding extra information that isn't in the data itself is just extra effort that only biases the network towards one particular way of looking at the data.

@gavinmc5285 2 года назад

@@frenchmarty7446 ok then, to be more definitive by 'context' i would understand it as such concepts as 'thrust', 'gist', 'essence' or 'meaning'. to interpret and apply context as relevant to subject matter is a function of intelligence. to some extent a lack of supervision - depending on the instance - may be appropriate although it is unlikely that any algorithm (unsupervised or reinforced) that wanders too far from the context within which it is operating (or supposed to be operating) is going to suddenly stumble on the parameters it needs to accurately determine values that require the appropriate context ('store / mall' is used here in the paper analysis example). not consistently time and again anyway.

@frenchmarty7446 2 года назад

@@gavinmc5285 That is literally *more* vague than just saying "context". You are being less definitive... I also don't know what you mean by "stumble" on the correct parameters. We don't stumble on parameters, we train them. And we do so very consistently. What do you mean by "wander outside the context"? You mean outside the data distribution? That's a different meaning of "context" and we train for that as well. Where exactly are you unsatisfied? You say (paraphrasing) "it is unlikely that any algorithm... is going to stumble on the right parameters to accurately determine the right values". Accurately based on what? What specifically does the network have to output to meet your standard of understanding context?