#55 Dr. ISHAN MISRA - Self-Supervised Vision Models

Подписаться 126 тыс.

Просмотров 23 тыс.

50% 1

Patreon: / mlst
Dr. Ishan Misra is a Research Scientist at Facebook AI Research where he works on Computer Vision and Machine Learning. His main research interest is reducing the need for human supervision, and indeed, human knowledge in visual learning systems. He finished his PhD at the Robotics Institute at Carnegie Mellon. He has done stints at Microsoft Research, INRIA and Yale. His bachelors is in computer science where he achieved the highest GPA in his cohort.
Ishan is fast becoming a prolific scientist, already with more than 3000 citations under his belt and co-authoring with Yann LeCun; the godfather of deep learning. Today though we will be focusing an exciting cluster of recent papers around unsupervised representation learning for computer vision released from FAIR. These are; DINO: Emerging Properties in Self-Supervised Vision Transformers, BARLOW TWINS: Self-Supervised Learning via Redundancy Reduction and PAWS: Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with
Support Samples. All of these papers are hot off the press, just being officially released in the last month or so. Many of you will remember PIRL: Self-Supervised Learning of Pretext-Invariant Representations which Ishan was the primary author of in 2019.
Pod: anchor.fm/machinelearningstre...
Panel: Dr. Yannic Kilcher, Sayak Paul (sayak.dev/), Dr. Tim Scarfe
Self supervised learning [00:00:00]
Lineage of SSL methods [00:04:08]
Better representations [00:06:24]
Data Augmentation [00:07:15]
Mode Collapse [00:08:43]
Ishan Intro [00:09:30]
Dino [00:12:40]
PAWS [00:14:19]
Barlow Twins [00:15:09]
Dark matter of intelligence article [00:15:36]
Main show kick off [00:16:51]
Why Ishan is doing work in self-supervised learning [00:19:49]
We don't know what tasks we want to do [00:21:57]
Should we try to get rid of human knowledge? [00:23:58]
Augmentations are knowledge via the back door [00:26:56]
Conceptual abstraction in vision [00:35:17]
Common sense is the dark matter of intelligence [00:38:14]
Are abstract categories (natural kinds) universal? [00:40:42]
Why do these vision algorithms actually work? [00:42:58]
Universality of representations, "semantics of similarity" [00:46:16]
Images on the internet are not uniformly random [00:49:41]
Quality of representations semi vs pure self-supervised [00:54:19]
Scaling laws for self-supervised learning and quality control [00:57:42]
Amazon turk thought experiment [01:00:42]
Architecture developments in SSL [01:03:01]
Architecture improvements - contrastive / SimCLR [01:05:33]
Architecture improvements - projector heads idea [01:07:08]
Architecture improvements - objective functions [01:09:15]
Mode collapse strategies (constrastive, clustering, prototypes, self-distillation) [01:09:48]
DINO [01:15:43]
How SSL is different in vision over language [01:18:20]
Dark matter paper and latent predictive models [01:22:05]
Energy Based Models [01:25:56]
Any big lessons learned? [01:28:24]
AVID paper (Video) [01:30:17]
DepthContrast paper (point clouds) [01:33:36]
References;
Shuffle and Learn - arxiv.org/abs/1603.08561
DepthContrast - arxiv.org/abs/2101.02691
DINO - arxiv.org/abs/2104.14294
Barlow Twins - arxiv.org/abs/2103.03230
SwAV - arxiv.org/abs/2006.09882
PIRL - arxiv.org/abs/1912.01991
AVID - arxiv.org/abs/2004.12943 (best paper candidate at CVPR'21 (just announced over the weekend) - cvpr2021.thecvf.com/node/290)
Alexei (Alyosha) Efros
people.eecs.berkeley.edu/~efros/
www.cs.cmu.edu/~tmalisie/proje...
Exemplar networks
arxiv.org/abs/1406.6909
The bitter lesson - Rich Sutton
www.incompleteideas.net/IncIde...
Machine Teaching: A New Paradigm for Building Machine Learning Systems
arxiv.org/abs/1707.06742
POET
arxiv.org/pdf/1901.01753.pdf
Music credit: / ambient-electronic-1
Visual clips credit: • Video
(Note MLST is 100% non commercial, non-monetised)

Опубликовано:

27 май 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 55

@ChaiTimeDataScience 2 года назад

I can never get enough of the Epic Tim intros! :D

@MachineLearningStreetTalk 2 года назад

❤

@talk2yuvraj 2 года назад

It was like a small literature review section in itself.

@rogerfreitasramirezjordan7188 2 года назад

This is what youtube for. Clear explanations and a beautiful intro! Tim intro is fundamental for understanding latter

@MachineLearningStreetTalk 2 года назад

Thanks!

@AICoffeeBreak 2 года назад

Thanks, this episode is 🔥! You ask many questions I had in mind lately.

@aurelius2515 2 года назад

This was definitely one of the better episodes - covered a lot of ground in some good detail with excellent content and good guiding questions and follow-up questions.

@beliefpropagation6877 2 года назад

Thank you for acknowledging the serious problems of calling images from Instagram "random", as is claimed in the SEER paper!

@tinyentropy Год назад

You guys are so incredible. Thank you so much. We appreciate this every single second. ☺️☺️☺️

@maltejensen7392 2 года назад

Such high quality content, so happy I found this channel!

@sugamtyagi9144 2 года назад

An agent always has a goal. No matter how broad or big, the data samples that it will collect from real world will be skewed towards that broader goal. So data samples collected by a such an agent will also have an inductive bias. Therefore collection of data is never completely disentangled from the task. So even if you pose a camera on a monkey or a snail, there will be a pattern to the data (i.e.. bias) that is collected. On the contrary to this approach of say taking completely random samples of images, say generated by a camera, which is parameterized by it's position (in the world) and view direction which are generated by a random number generator, will have very uniform distribution. But it that sense, is that even intelligence ? I think any form of intelligence ultimately imbues some sort of intrinsic bias. Humans beings being the most general intelligence machines and our goals (which is also learnt over time), also collect visual data in a converging fashion with age. Though still very general, humans too have a direction. PS. Excellent Video. Thanks for picking this up.

@mfpears 2 года назад

23:00 The tendency of mass to clump together and increase spatial and temporal continuity...

@minma02262 2 года назад

My gawd. I love this episode!!!

@talk2yuvraj 2 года назад

Here from Lex Fridman's shout out in his latest interview with Ishan Misra.

@MachineLearningStreetTalk 2 года назад

❤

@ayushthakur736 2 года назад

Loved the episode. :)

@strategy_gal 2 года назад

What a very interesting topic! It's amazing to know why these vision algorithms actually work!

@abby5493 2 года назад

Amazing video 😍

@drpchankh 2 года назад

Great episode and discussion! I think this discussion should also include GAN latent discovery discussion. Unsupervised learning is what every DS nirvana in production. On a side note, modern GAN can potentially span multi-domain though current works mainly are centered on single domain dataset area like Face, Bedroom etc. The latent variables or feature spaces are discovered in an unsupervised fashion by the networks though much work remains to be discovered for better encoder and generator/discriminator architecture. Current best model can reconstruct scene with different view angles, different lightings, different colours etc BUT they still CANNOT conjure up a structurally meaningful texture/structure of the scene, e.g. bed, table, curtain gets contorted beyond being a bed, table. ... It will be interesting to see if latent features discovered in GAN can help in unsupervised learning too.

@drpchankh 2 года назад

GANs are unsupervised learning algorithms that use a supervised loss as part of the training :)

@valkomilev9238 2 года назад

I was wondering if quantum computing will help with the latent variables mentioned at 1:24:54

@LidoList Год назад

Correction: In 13:29, you said BYOL as Bring Your Own Latent. Actually, it should be Bootstrap Your Own Latent (BYOL) Augmentation technique

@MachineLearningStreetTalk Год назад

Yep sorry

@tfaktas 2 года назад

What software are you using for annotating/presenting the papers?

@nathanaelmercaldo2198 2 года назад

Splendid video! Really like the intro music. Would anyone happen to know where to find the music used?

@MachineLearningStreetTalk 2 года назад

soundcloud.com/unseenmusic/sets/ambient-electronic-1

@angelvictorjuancomuller809 2 года назад

Hi, awesome episode! Can I ask which paper's is the figure in 1:15:51? It's supposed to be DINO but I can't find it in the DINO paper. Thanks in advance!

@MachineLearningStreetTalk 2 года назад

Page 2 of the DINO paper. Note "DINO" paper full title is "Emerging Properties in Self-Supervised Vision Transformers" arXiv:2104.14294v2

@angelvictorjuancomuller809 2 года назад

@@MachineLearningStreetTalk Thanks! I was looking to another DINO paper (arXiv:2102.09281 ).

@akshayshrivastava97 2 года назад

Great discussion! A follow-up question, one thing I didn't quite understand (perhaps I'm missing something obvious)..... with ref. to 6:36, from what I heard/read through the video/paper, these attention masks were gathered from the last self-attention layer of a VIT. DINO paper showed that one of the heads in the last self-attention layer is paying attention to areas that correspond to actual objects in the original image. Kinda seems weird, I'd think that by the time you reach the last few layers, the image representation would have been altered in ways that would make the original image irrecoverable. Would it be accurate to say this implies the original image representation either makes it through to the last layer(s) or it's somehow recovered?

@dmitryplatonov 2 года назад

It is recovered. It traces back where are the inputs which trigger the most attention.

@akshayshrivastava97 2 года назад

@@dmitryplatonov thanks.

@sabawalid 2 года назад

Is a "cartoon banana" and a "real banana" subtypes of the same category, namely a "banana"? There's obviously some relation between the two, but Ishan Misra is absolutely right, a "cartoon banana" is a different category and is not a subtype of a "banana" (it cannot be eaten, it does not smell or taste like a banana, etc...) Interesting episode, as usual, Tim Scarfe

@himanipku22 2 года назад

44:23 Is there a paper somewhere that I can read on this?

@MachineLearningStreetTalk 2 года назад

You mean the statement from Ishan that you could randomly initialise a CNN and it would already know cats are more similar to each other than dogs? Hmm. The first paper which comes to mind is this arxiv.org/abs/2003.00152 but I think there must be something more fundamental. Can anyone think of a paper?

@zahidhasan6990 2 года назад

It doesn't matter when I am not around, i.e. what happens in 100 years. - Modified from Mishra.

@_ARCATEC_ 2 года назад

It's interesting how useful simple edits like crop, rotation, contrast, edge and curve + the Appearance of dirty pixels within intentionally low resolution images are, while Self learning is being applied. 🍌🍌🍌😂So true 💓 the Map is not the territory.

@rubyabdullah9690 2 года назад

what if you create a simulation about a first world (when there is no technology etc) and then create an Agent that learn about the environtment make the Agent and World rule as close as possible in real world and then try to learn like the monster architecture of Tesla, but it's unlabelled, it's kinda super duper hard to make, but i think that the best approach to create an Artificial General Intelligence :v

@MadlipzMarathi 2 года назад

here from lex.

@massive_d 2 года назад

Lex gang

@MachineLearningStreetTalk 2 года назад

We are humbled to get the shout-out from Lex!

@shivarajnidavani5930 2 года назад

Fake blur is very irritating. Hurts to see

@fast_harmonic_psychedelic 2 года назад

Theres a lot of emphasis on this "us vs them" "Humans vs the machine" themes in your introduction, which i think is excessive and biased . Its not man and machine. It's just us. They are us. We're them.

@SimonJackson13 2 года назад

Radix sort O(n)

@SimonJackson13 2 года назад

When k < log(n) it's fantastic.

@SimonJackson13 2 года назад

For a cube root of bits in range a 6n FILO stack list sort time is indicated.

@MachineLearningStreetTalk 2 года назад

We meant that O(N log N) is the provably fastest comparison sort but great call out on Radix 😀

@fast_harmonic_psychedelic 2 года назад

machines are just an extension of nature just like a tree, a beehive, or a baby

@MachineLearningStreetTalk 2 года назад

For those who want to learn more from Ishan and more academic detail on the topics covered in the show today, Alfredo Canziani just released another show twitter.com/alfcnz/status/1409481710618693632 😎