Masked Autoencoders Are Scalable Vision Learners - Paper explained and animated!

Подписаться 46 тыс.

Просмотров 24 тыс.

50% 1

“Masked Autoencoders Are Scalable Vision Learners” paper explained by Ms. Coffee Bean. Say goodbye to contrastive learning and say hello (again) to autoencoders in #ComputerVision! Love the simple, yet elegant idea!
► Check out our sponsor: Weights & Biases 👉 wandb.me/ai-coffee-break
📺 Vision Transformer explained: • Vision Transformers ex...
Thanks to our Patrons who support us in Tier 2, 3, 4: 🙏
donor, Dres. Trost GbR, Yannik Schneider
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
Paper 📜: He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll'ar and Ross B. Girshick. “Masked Autoencoders Are Scalable Vision Learners.” (2021). arxiv.org/abs/2111.06377
References:
🔗 blog.keras.io/building-autoen...
🔗 www.deeplearningbook.org/
🔗 / 1462446494766837773
📺 ViT video: • An image is worth 16x1...
📺 DeiT: • Data-efficient Image T...
📺 Swin Transformer: • Swin Transformer paper...
Outline:
00:00 Intro
00:41 Weights & Biases (Sponsor)
02:10 What are autoencoders?
05:03 Differences between vision and language masked autoencoding
07:02 How does masked autoencoding work for images?
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, pay us a coffee to help with our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
----------------
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
RU-vid: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research

Наука

Опубликовано:

5 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 54

@user-js9qb7hz5e 2 года назад

I have been procrastinating reading the paper until now and you just made a video, perfect.

@AICoffeeBreak 2 года назад

You were not procrastinating. You were waiting for us to make the video. 😂

@harumambaru 2 года назад

55 views, I am early bird! I hope you get enough money for coffee from sponsors :) I am not mocking, I really happy that even young channels are supported by sponsors and so happy that this sponsor can be helpful for most of the viewers

@AICoffeeBreak 2 года назад

Thanks! I can totally relate to your point. I feel the same when it comes to small RU-vidrs I love.

@harumambaru 2 года назад

@@AICoffeeBreak Could you list couple of small youtubers you love? I am into 3blue1brown, Yannik and 2min papers but they all are pretty huge

@AICoffeeBreak 2 года назад

Small but sponsored? No (except Sabine Hossenfelder, but she is not small). Just small: Machine Learning Street Talk, Alfredo Canziani, Henry AI Labs, Jay Alammar, The AI Epiphany, Aladdin Persson, Gradient Dude, vcubingx

@harumambaru 2 года назад

@@AICoffeeBreak wow, you made my weekend instead of watching Monster Hunter with Milla Jovovich I am going to watch Sabine Hossenfelder protein folding videos

@beizhou4025 2 года назад

The animation is awesome. Thank you for taking the effort!

@AICoffeeBreak 2 года назад

Glad you liked it!

@michaellellouch3682 2 года назад

Cool stuff. Thanks for keeping us up to date on papers outside of our domain

@prajwalsood1350 2 года назад

Can't thank you enough, I have to present this paper in my class and this helps me alot

@deoabhijit5935 2 года назад

wonderful explanation, amazing narration elegant editing

@DerPylz 2 года назад

I'm old enough!

@cipritom 2 года назад

In addition, I love the sound effects of the layer growing ! Nice video !

@AICoffeeBreak 2 года назад

Thanks! Including sound effects doesn't mean much most of the time. But at the right spots, it can trigger a sort of 3D effect.

@nilsmuller9286 2 года назад

Awesome video! :) Hadn't the paper on my radar yet, now I'll have to read it.

@mattcoleman2819 2 года назад

Great video, thanks! I'm a bit confused how the transfer learning/downstream tasks will work with the encoder if it's sequence length now needs to be increased? Or is the encoder sequence length set to the total # patches, and attention masking/padding is used during pretraining?

@Mrbits01 2 года назад

The first time I heard the sound effects you used when expanding stuff (parameters, encoder size) I literally thought it's my stomach growling. Darn it right when it was getting serious :D

@AICoffeeBreak 2 года назад

Lol 😂 You are nominated for the funniest comment award.

@soumyasarkar4100 2 года назад

your content organisation is very good

@AICoffeeBreak 2 года назад

Thanks! Glad we did a thing right.

@MengJiun_Chiou 2 года назад

Awesome explanation :)

@AICoffeeBreak 2 года назад

Thanks!

@sadface7457 2 года назад

certified classic

@Agrover112 2 года назад

That's a certified hood classic

@terryr9052 2 года назад

I am curious why non-overlapping patches were chosen. I would think that would lead to reconstruction errors.

@AICoffeeBreak 2 года назад

Thanks for the question. But could you please elaborate a little bit why this would cause errors and why overlapping patches would ameliorate the problem? The patches are non-overlapping but tile the entire image. And attention allows for patches to be informed about their fellow patches.

@terryr9052 2 года назад

@@AICoffeeBreak I dont really have a rigorous answer but my intuition is telling me that forcing the model to predict every boundary between patches is less accurate than a model that actually gets to see the boundary as data. Thinking more about it though, I do understand though that more patches means more work for the attention and thus would counter the advantage gained from removing patches through masking...

@arigato39000 2 года назад

thank you from japan

@AICoffeeBreak 2 года назад

Hey, it's you! Missed your comments. :)

@garyhuntress6871 Год назад

I've been working on VITMAE for 2 days. Thanks for this video, very interesting.

@AICoffeeBreak Год назад

Glad it was helpful! Keen to share what are you planning to do with it? :)

@garyhuntress6871 Год назад

@@AICoffeeBreak I'm very interested in processing audio, particularly spectrograms. Ideally I think we need the equivalent of a LLM for acoustics. A really good embedding model for time series.

@pohsoonchang6127 2 года назад

👍

@Tondo95 2 года назад

05:18 Are there any references where it is possible to look at in more detail into the phenomena of the introduction of artifacts generated by the usage of masking in CNN autoencoders? At a first glance I couldn't see the author taking care in highlighting this fact. P.S. The animations are great as always.

@antoinegar.638 4 месяца назад

Hey there, thanks for the video! I'm late to the party, but I don't understand something: How is this architecture usefull for downstream tasks like classification ? I undersatnd you can ditch the decoder and put your downstream classifier instead. However, the architecture of the encoder reads 25% of the input (75% being masked). Won't this seriously lower the quality of the system compared to a classical autoencoder ?

@AICoffeeBreak 4 месяца назад

Hmm, you wouldn't do the masking for classification tasks where one is interested in representations, would you? The masking is just for training.

@Youkouleleh 2 года назад

Thanks for the vidéo. Do you know why bert would not use this strategy and just give to the encoder thé not masked words?

@AICoffeeBreak 2 года назад

Because the masked words have to be predicted, meaning: a representation has to be computed there which in transformers (as much as goes in, goes out again) means that BERT has to process the mask words too. Not even the paper presented in the video gets away from that curse, because the decoder has to see the masks again.

@Youkouleleh 2 года назад

@@AICoffeeBreak Ok, and could BERT do this like in this paper (or why they do not use this same strategy)? aka give the (not masked/swapped) word to the encoder, and in the decoder give the embedded words + the masked worlds (that would be learning, like in this paper). This would also allow to have a bigger encoder during training.

@AICoffeeBreak 2 года назад

@@Youkouleleh Ah, now I see the confusion: BERT does not actually have a (heavyweight ) decoder. The "decoder" is just an MLP performing classification *on the MASK tokens* after they have been encoded. The decoder you just presented, is in a sense already the BERT encoder. See first answer to this question: stackoverflow.com/questions/60382793/what-are-the-inputs-to-the-transformer-encoder-and-decoder-in-bert

@AICoffeeBreak 2 года назад

But it also might be that I am confused. Or Ms. Coffee Bean. If I am right, it is me. If I am wrong, it is Ms. Coffee Bean. 😅

@Youkouleleh 2 года назад

@@AICoffeeBreak thanks for your answer, I had this idea that BERT was some kind of autoencoder, but not really. But it is quite close to an AE + the matching sentence task. If the classification for non-masked words would count in the loss, I think it would be an autoencoder + matching sentence task

@nicolettileo Год назад

Thank you for your work, but nonetheless, I still struggle to capture the idea of mask tokens, which seems crucial. I'm new to the field of transformers, but used to good old CNN autoencoders, and what bothers me is: how the masked tokens can be directly fed into the decoder even thought their latent representations hasn't been computed? From what I understood, this isn't the masked tokens which are fed but some learnable shared vector. Am I right?

@aishik11 2 года назад

Any assistance on how to use this model for just encoding without masking , like she suggests at 12:02 ? the huggingface implementation seems to be performing some masking.

@youssefprojects7757 Год назад

The video is informative and supported by good animations, but you need to speak a little slowly and have some breaks in your speech. Because sometimes there is too much information in one sentence. Thank you for your effort and I hope you will take this feed back. I discovered your channel today and I subscribed.

@Easyy-Peasyy-Cooking 2 года назад

Thank you for your nice explanation, but I would like to point out that MAE is not the first proposing this idea. in April 2021 which is much much earlier than MAE, we proposed "SiT: self-supervised vision transformers" and showed its merit on small datasets because as a small group, we can not afford training on ImageNet. Despite the fact that we contacted the authors of MAE to acknowledge the original research, they did not respond to us! Similarly, Microsoft also used the same idea in "SimMIM - A Simple Framework for Masked Image Modelling" and they did not acknowledge us. I would really appreciate if you support the original research and mention this story in your channel. Nowadays, the research is only acceptable and acknowledged if it is coming from these tech giants, and there is no place for small groups anymore.

@AICoffeeBreak 2 года назад

As a member of a small group myself, I really feel your pain. I usually do criticize in my videos that the huge companies are dominating. Often times just use larger resources and not much in terms of ideas and it looks more like engineering scale and less like research. It's a pity they did not cite you even after pointing this out. This is bad practice.

@Phenix66 2 года назад

Feels so bad hearing about this... Hurts enough to think of something and see that it already exists, but this is worse. In general really feels like david vs goaliath at some point... Even aside from not getting visibility, not having the resources sucks, especially when (as it seems) most of the recent cool papers (pathways, dalle2, etc.) seem to stem from having vast amounts of data & computation power, not having cool new ideas :( when even evaluation is so bloody expensive, even on simple datasets, completely can knock you out of competition...

@Agrover112 2 года назад

Idk what will happen by the time I get into a PhD , AI will be crazy

@AICoffeeBreak 2 года назад

Where are you at the moment?

@Agrover112 2 года назад

@@AICoffeeBreak Bachelors lol

@AICoffeeBreak 2 года назад

I pity you.

@AICoffeeBreak 2 года назад

Hold on there.