10L - Self-supervised learning in computer vision

Подписаться 39 тыс.

Просмотров 31 тыс.

50% 1

Course website: bit.ly/DLSP21-web
Playlist: bit.ly/DLSP21-RU-vid
Speaker: Ishan Misra
Slides: bit.ly/DLSP21-10L
Chapters
00:00 - Welcome to class
01:05 - Self-supervised learning in computer vision
15:20 - Pretext-invariant representation learning (PIRL)
27:08 - Swapping assignments between views (SwAV)
48:39 - Audiovisual instance discrimination with cross model agreement (AVID + CMA)
58:24 - Barlow Twins: self-supervised learning via redundancy reduction
1:26:17 - Live chat

Опубликовано:

7 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 63

@khushpatelmd 3 года назад

One of the best lectures on SSL ever. Thank you, Alfredo and Ishan for making this available for everyone.

@alfcnz 3 года назад

🥳🥳🥳

@AIwithAniket 3 года назад

awesome lecture covering all different method of unsupervised learning! Thank you for making these video public.

@alfcnz 3 года назад

💪🏻💪🏻💪🏻

@IdiotDeveloper 3 года назад

A really informative lecture on self-supervised learning.

@jetnew_sg 3 года назад

Thanks for this! Can't wait to see how the best of all worlds can be combined for SSL!

@alfcnz 3 года назад

🔥🔥🔥

@SY-me5rk 2 года назад

Excellent presentation. Thanks

@buoyrina9669 2 года назад

Thanks, Ishan. This is excellent.

@alfcnz 2 года назад

🥳🥳🥳

@QuanNguyen-oq6lm 3 года назад

This is actually my research is focusing on, hopefully I can finish it on time to apply for phD at NYU and join with you Alfredo.

@alfcnz 3 года назад

😍😍😍

@aristoi 2 года назад

Thanks for this. Really terrific content.

@alfcnz 2 года назад

🥳🥳🥳

@filippograzioli3641 3 года назад

Beautiful lecture! Thanks :)

@alfcnz 3 года назад

Prego 😇😇😇

@saeednuman 3 года назад

Very informative as usual. Thank you @Alfredo

@alfcnz 3 года назад

🤓🤓🤓

@sami9323 Год назад

This is excellent, thank you!

@alfcnz Год назад

You're very welcome! 😇😇😇

@AdityaSanjivKanadeees 2 года назад

I have a question regarding the Barlow twins. Q1: For a batch of B samples, the output of the projector networks will be BxD. We have two such projections A and B. We know that rank(AxB)

@bisheshworneupane7996 3 года назад

Great Video

@alfcnz 3 года назад

😎😎😎

@dpetrini_ Год назад

I am sure students from all over the world thanks a Lot.

@alfcnz Год назад

🤗🤗🤗

@ashishgor2163 Год назад

Thank you so much sir

@alfcnz Год назад

You're welcome 🤗🤗🤗

@hedu5303 3 года назад

awesom

@alfcnz 3 года назад

😻😻😻

@charchitsharma8902 3 года назад

Thanks Alf :)

@alfcnz 3 года назад

You're welcome 😺😺😺

@prakharthapak4229 3 года назад

Basically an informative video :-)

@alfcnz 3 года назад

🤓🤓🤓

@vadimborisov4824 2 года назад

basically agree :)

@alfcnz 2 года назад

🤓🤓🤓

@NS-te8jx 2 года назад

every 2 minutes he referes work of others and his as a reference. I was shocked and overwhelmed by the number of papers refereed in this just 1 lecture. haha.

@alfcnz 2 года назад

😅😅😅

@NS-te8jx 2 года назад

@@alfcnz I enjoyed this session, very good content. thanks for organizing it.

@user-co6pu8zv3v 3 года назад

Thank you, Alfredo :) It will be very difficult for me to read all the materials indicated in the video for a week. )

@alfcnz 3 года назад

I'm aiming now at two videos per week. Haha, sorry 😅😅😅

@user-co6pu8zv3v 3 года назад

So it will be very, very, very difficult for me to read everything, but I will try. Thank you for videos, Alfredo :)

@tchlux 3 года назад

People really need to stop using linear classifiers to gauge the “correctness” of representations learned at different layers!! Use something like a Silhouette score, or anything that measures *local* consistency of the representation (could also use a k-fold Delaunay interpolant approximation if you’re attached to things being locally linear). Neural networks (ReLU) are capturing linearly seperable subsets of data at each layer, which means even the layer two before the output could have a highly nonlinear representation of data that is easily transformed with the right set of selections. You won’t succumb to this problem if you just measure local continuity of a representation with respect to your target output instead of using a global linear approximation.

@imisra_ 3 года назад

Thanks for the comment! I agree that evaluating representations with linear classifiers is not sufficient. Like you suggest, there are many different ways to evaluate them, and each of them tests different aspects of representations. Depending on the comparison/final application, the methodology for evaluating them will change.

@khushpatelmd 3 года назад

How to implement it? Do you have any use case? I understand the rationale but don't understand how to use something like Silhouette score over here.

@tchlux 3 года назад

@@khushpatelmd great questions. A simplified example: consider a binary classification problem where the model outputs a single number (the truth is either 0 or 1). Suppose we want to evaluate the amount of information captured by an embedding relative to this downstream prediction task. Option 1: We could measure the mean squared error of a best-fit linear function over the embedded data. In effect, this measures how "linearly separable" our embedded data is for this classification problem. Option 2: You compute the average distance to the nearest point of a different class (for all points) minus the average distance to the nearest point of the same class. (similar to the concept behind Silhouette scores, which answer the question "how near is each point to its own cluster relative to other clusters?") Now imagine that the embedding has data placed perfectly in a separated "three stripe" pattern, where the left stripe is all 0's, the middle stripe is all 1's, and the right stripe is all 0's. The pure linear evaluation (option 1) will tell us that the embedding yields about ~66% accuracy (not so good). However, a nearness approach (option 2) would tell us that the embedding is very good and yields all nearest points in the same class (distance to other class - distance to same class >> 0). Realistically option 2 is correct here, because there is a very simple 2-hidden-node MLP that can *perfectly* capture the binary classification problem given this embedding. I realize that some people might say, "well option 2 is irrelevant if you always know you're going to use a linear last layer." But that's against the point. In general we are trying to evaluate how representative the newly learned geometry is for downstream tasks. Restricting ourselves to only linearly-good geometries for evaluation is unnecessary and can be misleading. In the end most people care how difficult it would be to take an embedding and train a new accurate model given the embedding. I assume few people will arbitrarily restrict themselves to linear models in practice.

@khushpatelmd 3 года назад

@@tchlux Thanks a lot Thomas. This is so clearly explained by you.

@XX-vu5jo 2 года назад

Wait for our CVPR paper that will solve the memory problem. We hope it will be accepted.

@alfcnz 2 года назад

🤞🏻🤞🏻🤞🏻

@harrypotter1155 3 года назад

Hi, are you planning on to add subtitles or enable the automatic caption?

@alfcnz 3 года назад

Automatic captions should be enabled by default. I'll check this later if and why this is not working. Thanks for the feedback. 🙏🏻🙏🏻🙏🏻

@alfcnz 3 года назад

I'm in touch with RU-vid support team. They have identified the issue and are currently working on it. I'll let you know when there is any update. Thank you for your patience. 😇😇😇

@harrypotter1155 3 года назад

@@alfcnz THANK YOU VERY MUCH!! I really appreciate the length that you're going through just to make sure the auto caption is on 😭 Once again, thank you very much!

@alfcnz 3 года назад

😇😇😇

@alfcnz 3 года назад

They replied and… I'm losing my patience. RU-vid support is not cooperating. I'm escalating this soon. I'm not sure what part of “feed the audio stream to your text-to-speech model” is hard to comprehend.

@ChuanChihChou 3 года назад

So I was watching the "Scaling machine learning on graphs" @Scale talk the other day, for which they used the contrastive method w/ massive parallelism and negative sampling to prevent the trivial solution collapse: fb.watch/v/1pqXNP5au/ After this lecture now I wonder if we can use the other options (clustering, distillation, and redundancy reduction) in the arsenal instead. Has anyone at Facebook tried those for graph embedding training yet?

@alfcnz 3 года назад

Yup, we can indeed use the other techniques, where the positive pairs are defined by the adjacency matrix (connectivity defined by the graph). For the question about whether FB has tried these, I'll let Adam reply. (Let me ping him.)

@alerer1 3 года назад

Hi Chuan-Chih, thanks for watching my talk! I don't know of anyone at Facebook who has applied these unsupervised methods to the problem of learning node features for graphs. The graph embeddings problem is a little different than learning unsupervised image features so I don't immediately see how these methods would apply, but I wouldn't be surprised if there was a way! In the type of unsupervised learning described in this talk, you are learning a function f that converts a high-dimensional feature vector x_i into low-dimensional semantic feature z_i. In the graph embedding setting, the nodes don't have input features - you *learn the input features* in order to approximate the adjacency matrix. There are probably ways to apply these methods if you think of the one-hot edge list (aka each node's row of the adjacency matrix) as the features, but I haven't thought about it. Maybe a better place to start is the graph neural network setting where nodes *do* have input features and you're learning a function f that combines the features over the graph neighborhood to predict some supervised labels. I haven't seen any work on unsupervised graph neural networks, but there probably is some and some of these same approaches may work well!

@dexlee7277 3 года назад

that bear, he knows everything.

@alfcnz 3 года назад

Indeed he does. He's been present to all my lessons! 🐻🐻🐻

@hoseinhashemi3680 3 года назад

I was wondering about sth. In contrastive learning, if one uses a self-attention transformer encoder within the batch dimension, before feeding the representations to the contrastive loss, will it ruin the objective of contrastive learning? I am saying this since the transformer encoder over the batch will basically reweight the representation of each sample with respect to the dot-product similarity between each other. Thank you for the wonderful introduction btw.

@alfcnz 3 года назад

Why would you want to use a transformer “within the batch dimension” (whatever this means)? Can you clarify what you're trying to do? 🤔🤔🤔

@hoseinhashemi3680 3 года назад

@@alfcnz I sent an email to you. tnx

@alfcnz 3 года назад

I don't have the bandwidth to reply to emails, I'm sorry. I haven't checked them in a few months by now, I think.