Тёмный

Fast Zero Shot Object Detection with OpenAI CLIP 

James Briggs
Подписаться 66 тыс.
Просмотров 11 тыс.
50% 1

Zero shot object detection is made easy with OpenAI CLIP. A state-of-the-art multi-modal deep learning model. Here we will learn about zero shot object detection (and object localization) and how to implement it in practice with OpenAI's CLIP.
ILSVRC was a world-changing competition hosted annually from 2010 until 2017. It was the catalyst for the Renaissance of deep learning and was the place to find state-of-the-art image classification, object localization, and object detection.
Researchers fine-tuned better-performance computer vision (CV) models to achieve ever more impressive results year-after-year. But there was an unquestioned assumption causing problems.
We assumed that every new task required model fine-tuning; this required a lot of data. and this needed both time and capital.
It wasn't until very recently that this assumption was questioned and proven wrong.
The astonishing rise of multi-modal models has made the impossible possible across various domains and tasks. One of those is zero-shot object detection and localization.
Zero-shot means applying a model without the need for fine-tuning. Meaning we take a multi-modal model and use it to detect images in one domain, then switch to another entirely different domain without the model seeing a single training example from the new domain.
Not needing a single training example means we completely skip the hard part of data annotation and model training. We can focus solely on the application of our models.
In this chapter, we will explore how to apply OpenAI's CLIP to this task-using CLIP for localization and detection across domains with zero fine-tuning.
🌲 Pinecone article:
pinecone.io/learn/zero-shot-o...
🤖 AI Dev Studio:
aurelio.ai/
👾 Discord:
/ discord
00:00 Early Progress in Computer Vision
02:03 Classification vs. Localization and Detection
03:55 Zero Shot with OpenAI CLIP
05:23 Zero Shot Object Localization with OpenAI CLIP
06:40 Localization with Occlusion Algorithm
07:44 Zero Shot Object Detection with OpenAI CLIP
08:34 Data Preprocessing for CLIP
13:55 Initializing OpenAI CLIP in Python
17:05 Clipping the Localization Visual
18:32 Applying Scores for Visual
20:25 Object Localization with New Prompt
20:52 Zero Shot Object Detection in Python
21:20 Creating Bounding Boxes with Matplotlib
25:15 Object Detection Code
27:11 Object Detection Results
28:29 Trends in Multi-Modal ML
#machinelearning #python #openai

Наука

Опубликовано:

 

5 авг 2024

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 33   
@BradNeuberg
@BradNeuberg 9 месяцев назад
Since this video was released, it looks like the image rescaling assumptions of the CLIP model being used has changed. In the existing code in this videos notebook when the image is fed to the processor() function, it’s values have been scaled to 0-1. Unfortunately this breaks some newer CLIP assumptions. Everything will break for you, so you should add big_patches*255. before passing it into the processor() call for things to work correctly.
@drewholmes9946
@drewholmes9946 7 месяцев назад
@jamesbriggs Can you pin this and/or update the Pinecone article?
@rogerganga
@rogerganga Год назад
Hey James! As someone with 0 coding experience in Computer Vision and new to OpenAI's clip, I found this video incredibly valuable. Thank you so much!
@jamesbriggs
@jamesbriggs Год назад
glad it was helpful!
@ceegee9064
@ceegee9064 Год назад
What an incredibly approachable breakdown of a very complicated topic- thank you!
@jamesbriggs
@jamesbriggs Год назад
Thanks!
@khalilsabri7978
@khalilsabri7978 9 месяцев назад
thanks for the video, amazing work !!!
@hridaymehta893
@hridaymehta893 Год назад
Thanks for your efforts James! Amazing video!!
@jamesbriggs
@jamesbriggs Год назад
thanks a ton!
@henkhbit5748
@henkhbit5748 Год назад
Realty amazing the advances in ai . Thanks for showing the hybrid approach for "object detection" using text👍
@jamesbriggs
@jamesbriggs Год назад
Glad you liked it, I'm always impressed with how quick things are moving in AI, it's fascinating
@manumaminta6131
@manumaminta6131 Год назад
Hi! Love the content. I was just wondering, since we are passing patches of images to the CLIP Visual Encoder (and each patch has a dimension), does that mean we have to resize the patch size so that it fits the input dimension of the CLIP visual encoder? :) Looking forward to your reply
@lorenzoleongutierrez7927
@lorenzoleongutierrez7927 Год назад
Great tutorial ! 👏
@AthonMillane
@AthonMillane Год назад
Hi James, thanks for the fantastic tutorial. How do you think this would work for e.g. drawing bounding boxes around multiple books on a bookshelf. They are next to each other, and so the image patches will all probably correspond to "book" but which individual book is not clear. Would making the patches smaller improve things? Any ideas how to address this use case would be much appreciated. Cheers!
@hariypotter8
@hariypotter8 Год назад
Using your code line for line I'm having trouble with this, no matter what prompt I use my output image looks the exact same in regards to localization and the dimming of patches based on score. It looks like I'm only seeing the most frequently visited patches rather than the highest CLIP score. Any ideas?
@ITAbbravo
@ITAbbravo Год назад
I might be a bit late to the party, but it seems that the major issue is that the variabile "runs" is initialized with torch.ones instead of torch.zeros. The localization is still not as good as the one in the video though...
@stevecoxiscool
@stevecoxiscool Год назад
What models and technology would one use to "scan" a directory of images and then text of what the model found in each image ?
@andy111007
@andy111007 Год назад
Hi James, how did you create the dataset. Did you need to do annotation of images? convert to yolo or coco format? before forming the dataset? love to hear more? . Thanks, Ankush Singal
@andy111007
@andy111007 Год назад
The code does not work for forming bounding box around object localization
@andrer.6127
@andrer.6127 Год назад
I have been trying to figure out how to change it from one class and one instance to one class and many instances, but I can't seem to figure out how to do it. What should I do?
@papzgaming9412
@papzgaming9412 8 месяцев назад
Thanks
@shaheerzaman620
@shaheerzaman620 Год назад
Fascinating!
@jamesbriggs
@jamesbriggs Год назад
thanks as always Shaheer!
@TheArkLade
@TheArkLade Год назад
Does anyone know why [IndexError: list index out of range] appears when trying to detect more than 2 objects? For example: detect(["cat eye", "butterfly", "cat ear"], img, window=6, stride=1)
@abhishekchintagunta8731
@abhishekchintagunta8731 Год назад
excellent explanation kudos james
@jamesbriggs
@jamesbriggs Год назад
glad it helped!
@SinanAkkoyun
@SinanAkkoyun Год назад
Thank you! Can you get the vectors right out of CLIP without supplying a prompt? So, that you get embeddings for every patch and then can derive what is being detected?
@jamesbriggs
@jamesbriggs Год назад
You can get embeddings but they’re after the clip encoder stage, the image patches are what are fed into the model and aren’t very easily interpretable - it’s the clip encoding layers that encode ‘meaning’ into them
@Helkier
@Helkier Год назад
Hello James, the colab link is not available anymore in your pinecone article
@hchautrung
@hchautrung Год назад
Might I kow the total runtime If we put in a production mode?
@AIfitty-xs7qn
@AIfitty-xs7qn Год назад
Hello James! I have a use case for CLIP. I think. If it works. I am not a computer programmer and have never used colab, but I have a few months to learn if learning all that can be done in that amount of time. I also have about 30k-40k photos that I would like to tag every day in the summer - tagged either blue shirt or white shirt (sports). Every tutorial I have seen uses a data set that is located online. Can I direct clip to my local server to perform object detection? Do the photos need to be in any particular format for optimum results? Well. Let me back up. Can you direct me to a resource that will give me the background I need to be able to follow along with you in these videos? After that, I should be able to ask more relevant questions. Thank you for the videos!
@Ahmad-H5
@Ahmad-H5 Год назад
Hello, thank you so much for creating this video it is quite easy to follow for a beginner like me☺. I was also wondering if clip can connect images to text instead text to images.
@jamesbriggs
@jamesbriggs Год назад
Yes 100% - after you process the images and text with CLIP it just outputs vectors, and with vector search it doesn't matter whether those were produced from text or images, see here: ru-vid.com/video/%D0%B2%D0%B8%D0%B4%D0%B5%D0%BE-fGwH2YoQkDM.html Hope that helps!
Далее
OpenAI CLIP Explained | Multi-modal ML
33:33
Просмотров 22 тыс.
OpenAI's CLIP for Zero Shot Image Classification
21:43
ТЫ С ДРУГОМ В ДЕТСТВЕ😂#shorts
01:00
ЮТУБ БЛОКИРУЮТ?
02:04
Просмотров 637 тыс.
LangGraph 101: it's better than LangChain
32:26
Просмотров 63 тыс.
Zero-shot object detection with Grounding DINO
19:09
Просмотров 1,9 тыс.
The moment we stopped understanding AI [AlexNet]
17:38
Просмотров 853 тыс.
Fast intro to multi-modal ML with OpenAI's CLIP
22:54
How AI 'Understands' Images (CLIP) - Computerphile
18:05
КРАХ WINDOWS 19 ИЮЛЯ 2024 | ОБЪЯСНЯЕМ
10:04