AI agent + Vision = Incredible

Подписаться 114 тыс.

Просмотров 53 тыс.

50% 1

A step by step tutorial of how to build vision powered AI agent via autogen + llava + stable diffusion AND Break down of 160-page analysis of GPT4V capabilities
🤘 Get 15% off on sceneXplain via my code AIJASON : go.jina.ai/scenexplainjason
🔗 Links
- Follow me on twitter: / jasonzhou1993
- Join my AI email list: www.ai-jason.com/
- My discord: / discord
- sceneXplain: go.jina.ai/scenexplainjason
- Vision-agent Github: github.com/JayZeeDesign/visio...
⏱️ Timestamps
0:00 Intro
1:15 What is multi-modal model
2:12 GPT4V ability break down
4:34 sceneXplain
6:00 Visual prompt techniques
10:53 Use cases
13:00 Build vision agent #1 - Setup
14:20 Build vision agent #2 - Use Llava model
15:58 Build vision agent #3 - Use Stable diffusion
16:52 Build vision agent #4 - Set agent system via autogen
18:53 Build vision agent #5 - Demo
👋🏻 About Me
My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
#gpt4 #autogen #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #chatgpt #largelanguagemodels #largelanguagemodel #bestaiagent #chatgpt #agentgpt #agent #babyagi #llava #stablediffusion

Наука

Опубликовано:

12 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 100

@AIJasonZ 9 месяцев назад

Which vision-enabled agent do you want to see me building? Leave comment and let me know! 🤖

@jasonfinance 9 месяцев назад

Would love to see AI agent that can control browser!

@Nodeagent 9 месяцев назад

Yes browser control would be hot. Also image manipulation for things like precise mockups for customers - useful for Ecom stores who sell personalized goods

@ld-yt. 9 месяцев назад

Successfully getting this done with a local llm would be interesting to see.

@PeterAustin666 9 месяцев назад

mine.

@dawidzurawski8870 9 месяцев назад

I would to see an agent that can read old handwritten documents and turn them into pdf

@T33KS 9 месяцев назад

Your content has the right amount of abstraction, making your videos short sweet nd appealing to a wide audience (it's not a course). But at the same time it has the right amount of technical detail for devs and engineers to replicate what you are demonstrating. Thank you for this great content

@asithakoralage628 8 месяцев назад

Hi Jason, yet another great video, I learned a lot from your channel. Thanks for sharing your knowledge.

@vakman9497 9 месяцев назад

Hey bro good job on that thumbnail! I didnt even realize it was one of your videos I honestly thought I was clicking on a VICE video lmao,

@MattLuceen 9 месяцев назад

This is exactly what I needed. Thank you.

@craigcasee7183 8 месяцев назад

I've been needing to see a video like this where someone strings together some ai with code, glad to see. I want to add eye tracking to ar and ai vision. It would be nice to quickly ask questions in the real world. And the automation aspect is very nice for you to share, plus continue to make informative instructional demonstrative amazing videos like this! Thank you!

@leu2304 9 месяцев назад

This channel is real gold! Thank you so much

@markksantos 9 месяцев назад

You're the best. PLEASE POST MORE OFTEN!

@frankchangshow 8 месяцев назад

I really appreciate you and the videos your creating ai Jason. They are helping me a lot in learning this space

@Hisma01 9 месяцев назад

Great content. You have a new sub. Keep up the great work!

@cliffordramsey2500 9 месяцев назад

Thank you for this clever integration of tools!

@ryzikx 9 месяцев назад

Great stuff I was looking for vision autogen tutorials

@moberpriller 8 месяцев назад

Thanks for the great content!

@skanderbegvictor6487 8 месяцев назад

Wow this content is great. Subscribed

@SamuelHollis 9 месяцев назад

🎯 Key Takeaways for quick navigation: 00:00 🌐 Introduction to AI Vision Integration - The video begins with an introduction to the integration of AI agents and vision capabilities. - AI agents with vision power can revolutionize various applications, from web design to answering complex questions and enabling general-purpose robots. 02:06 📸 Multimodal Models and Their Potential - Multimodal models can process not only text but also images, audio, and videos, enabling them to understand different types of data and their relationships. - GPT-4 Vision (GPT-4V) can handle various image types, including photographs, text within images, diagrams, tables, and floor plans, unlocking numerous use cases. 04:36 🧠 Understanding GPT-4V's Abilities - GPT-4V demonstrates impressive out-of-the-box performance, such as identifying objects, recognizing people, counting objects, and even understanding perspective. - However, it also has limitations and can make mistakes, particularly in tasks like text extraction and chart interpretation. 06:49 🚀 Promoting GPT-4V's Performance - Different prompting techniques can be used to improve GPT-4V's performance in image-related tasks. - Techniques include providing detailed text instructions, setting performance expectations, using few-shot prompts, and visual referring prompts. 09:19 🌟 Expanding Use Cases with GPT-4V - GPT-4V's ability to understand the relationship between multiple images opens up new possibilities, such as calculating costs from images or determining the sequence of images in a task. - It can also facilitate interactions through visual annotations, allowing users to point or circle objects for AI understanding. 11:44 🤖 Building Autonomous AI Agents - GPT-4V's capabilities make it possible to create autonomous AI agents that can continuously improve image generation and perform tasks like desktop automation. - These agents have potential applications in various industries, from architecture and engineering to customer support and medical diagnosis. Made with HARPA AI

@KCM25NJL 8 месяцев назад

I cannot even begin to imagine the API costs for running things like these on frontier models right now. As impressive as it is, you'll need a real profitable use case if you wanna use it like this.

@user-ug3pf3uw6x 9 месяцев назад

You are the best!

@PrincepsPolycap 8 месяцев назад

Notification enabled for that parse automation!

@Ychuah_1997 8 месяцев назад

Chatgpt: I can't count apples... Prompt: You are an expert in counting! Chatgpt: Giving the correct answer :) These prompts are just fascinating - and great content as usual!

@ultimategolfarchives4746 9 месяцев назад

Always providing incredible content. 👍 👍👍

@pocoso 9 месяцев назад

First! Good tutorial man

@GabrielVeda 9 месяцев назад

Brilliant

@AI-Wire 8 месяцев назад

Great job, Jason. In the future could you please consider showing how to use these tools without paying for any API keys. For example, using PaLM API or some of the open source models. This is because building projects at scale is cost prohibitive using recursive tools like Autogen.

@aliyousefi9735 9 месяцев назад

AI Jason is da man

@JosephDefendre 9 месяцев назад

This is nuts auto gen is a game changer

@lifeofdean3647 9 месяцев назад

very good man :))

@kodeengatai1347 8 месяцев назад

Thanks mate great stuff would really be interested in agents that can generate video based on prompts even if the agents need to be first trained on sample videos.

@leandrogoethals6599 8 месяцев назад

how to use a uncensored stable diffusion variant with this. Great vid by the way can't wait for what u do next! Also could it be that the discord invite link is broken? Can't wait to join!

@krisograbek 8 месяцев назад

Would that be possible to build a similar agent but improve on illustrated, short stories for kids? That way it would improve both the images as well as the text provided in the stories... BTW, I've been learning so much from you, Jason! Your channel is a gem! As a fellow RU-vidr, you make me feel small...

@joxxen 8 месяцев назад

Really nice video, i for myself would love if the agents could start running stable diffusion on local machine. Any chance you want to create a video about that?

@GlenBland 7 месяцев назад

I would love to see one video that summarizes the most popular libraries and api's for llms along with which are the best to work together and which have replaced older ones. Include: AutoGPT, MemGPT, ChromaDB, LangChain, Ollama, Pinecone, etc.

@markksantos 8 месяцев назад

make a video about memgpt

@nashvillebrandon 9 месяцев назад

Would be awesome to give the agent the ability to do inpainting!

@spicer41282 8 месяцев назад

My Request Please... Can you apply this GPT4V Agent? Simple shed photo and analyze its size, the pitch of the roof, and perhaps how many or how much wood is used to build the simple shed from a photo. Thank you for considering this and testing the multimodal capabilities with this use case.

@Joy_jester 2 месяца назад

Hey can u do an agent where it has to do instruction following in a simulator? I think that will be a very practical and interesting application

@bbproperties-oq5vu 8 месяцев назад

Hey hi jason it is really good. can you upload browser automation. i am really more interested on it.

@amandamate9117 9 месяцев назад

can you write agents that operate a headless browser. Within this browser, one window can utilize GPT-4's website features designed for Plus users, while another window can generate images using DALL-E 3. These images can then be uploaded for review in the same headless browser session. Although you'll be limited to 50 prompts every 3 hours, this setup should still be sufficient for most use-cases. Additionally, this approach allows you to conduct user interface analysis or other tasks without incurring API costs.

@itshuskai 9 месяцев назад

Now to really test it, see if it can pass the "Are you a robot?" prompts lol.

@spookyrays2816 9 месяцев назад

Create a bot that can read, and visually react to output, so that way it can create a Deep Learning type feedback loop improving upon itself until it no longer can

@jtjames79 9 месяцев назад

I was thinking AutoGen, an artist agent, and editor agent. I don't know how to do it, but theoretically it should work.

@georgecochran4091 9 месяцев назад

Ok you know how the game no man's sky you get a analysis visor to scan the environment and save data on plants rocks and animals. Something like that for irl.i would be collecting data all the time

@yasinyaqoobi 8 месяцев назад

Great video as always. Can you please put your head to the bottom right. It cut off a lot of the content. :)

@matthewboyd8689 8 месяцев назад

They need to make it be able to work on less information and make correct deductions that aren't in its training data before trying to make it more generalized. Otherwise it will just compound hyperbolically the amount of information they need to be able to understand as much as a human can.

@ibrahimhalouane8130 8 месяцев назад

How about a SuperAgent that can create other agents by its own to perform a complex task?

@popfizz311 8 месяцев назад

Can you use this feature with the gpt4 api?

@darkbelg 9 месяцев назад

For what i'm trying to do llava isn't yet good enough like GPT-4V. GPT-4V has once again raised the bar for me. And now the waiting begins for an api.

@ward_jl 9 месяцев назад

So interesting. Is it possible to get the code to experiment with it?

@AIJasonZ 9 месяцев назад

Yep it is in the description

@carterjames199 9 месяцев назад

I think another good video would be comparing these different agent creation frameworks. Feel like I see another one everyday. I specifically would like to hear your opinion on autogen vs superagi

@stereotyp9991 8 месяцев назад

I'm always hitting the token limit after just a few posts of the agent. Is there a way to work around this?

@jp00738 8 месяцев назад

hahaha oh man, you are a legendary.

@brando2818 9 месяцев назад

How do you finetune llava?

@jtjames79 9 месяцев назад

I want to be able to use AutoGen or something like that, to set up adversarial agents to use Stable Diffusion for me. So I can ask for an image before I go to bed, and by morning it'll have worked out something.

@jtjames79 9 месяцев назад

I should have just kept watching, instead of commenting before watching.

@pissmilker2313 8 месяцев назад

Our obsolescence as human beings isnt to be feared, but celebrated. Rejoice!

@raresmircea 8 месяцев назад

There’s gonna be a long time until AI will be conscious & match my subtlety. But even then, this take would still be so myopic. Have birds, elephants & dolphins "became obsolete" when humans arrived? Has your mother & brother "became obsolete" when that Indian boy was found to have a huge IQ? These kinds of extreme opinions, desires & manifestations that most people have often betray some unmet need, and I’m sorry for that.

@greengoblin9567 8 месяцев назад

@@raresmirceawe don’t need the ai to be conscious. We just need it to be more intelligent.

@arpitkumar2981 8 месяцев назад

@@greengoblin9567yes

@defaultdefault812 8 месяцев назад

It got the speedometer right - just equated the wrong measurement circle to MPH.

@olivMertens 8 месяцев назад

Could you give the source for the file and examples shown in this video ?

@olivMertens 7 месяцев назад

so i found by myself arxiv.org/pdf/2309.17421.pdf ;)

@brisonvsn 8 месяцев назад

Can agents browse and interact with the internet yet?

@SkyJensen 9 месяцев назад

Full website builder. Full website builder. Full Website Builder

@aghasaad2962 8 месяцев назад

GPT4V will soon be able to take research papers write code, write thesis, get a job, then marry....wait what thats what humans are for....

@psychxx7146 9 месяцев назад

« 2023 »

@AntonioRonde 9 месяцев назад

there were too many basics in the video, I enjoyed your videos were you provided a more in-depth review like in the Autogen video

@AIJasonZ 9 месяцев назад

Thanks for the feedback - is there specific area you would like to see me dive deeper?

@Huru_ 9 месяцев назад

I wonder what kind of results you'd get if you were to feed that model some proper English...

@soulspawn 9 месяцев назад

Well, it generated human hands because it has been tasked to create palms instead of hooves (see manager reply @20:04 ). I'd call this a win. 👀

@Huru_ 9 месяцев назад

Didn't say it wasn't one. Just giving pointers for optimization. Also, I wasn't even going that deep. Just regular ass grammar and complete sentences for starters... @@soulspawn

@PeterAustin666 9 месяцев назад

wastes tokens@@Huru_

@AIJasonZ 9 месяцев назад

I honestly didn’t know palm is specifically for human, hah 😂😂 thanks will try again

@Huru_ 9 месяцев назад

Lol, that's why you need to read your Manager's input.@@AIJasonZ

@brytonkalyi277 8 месяцев назад

`• I believe we are meant to be like Jesus in our hearts and not in our flesh. But be careful of AI, for it is just our flesh and that is it. It knows only things of the flesh (our fleshly desires) and cannot comprehend things of the spirit such as peace of heart (which comes from obeying God's Word). Whereas we are a spirit and we have a soul but live in the body (in the flesh). When you go to bed it is your flesh that sleeps but your spirit never sleeps (otherwise you have died physically) that is why you have dreams. More so, true love that endures and last is a thing of the heart (when I say 'heart', I mean 'spirit'). But fake love, pretentious love, love with expectations, love for classic reasons, love for material reasons and love for selfish reasons that is a thing of our flesh. In the beginning God said let us make man in our own image, according to our likeness. Take note, God is Spirit and God is Love. As Love He is the source of it. We also know that God is Omnipotent, for He creates out of nothing and He has no beginning and has no end. That means, our love is but a shadow of God's Love. True love looks around to see who is in need of your help, your smile, your possessions, your money, your strength, your quality time. Love forgives and forgets. Love wants for others what it wants for itself. Take note, love works in conjunction with other spiritual forces such as faith and patience. We should let the Word of God be the standard of our lives not AI. If not, God will let us face AI on our own and it will cast the truth down to the ground, enslave us and make us worship it. We can only destroy ourselves but with God all things are possible. God knows us better because He is our Creater and He knows our beginning and our end. Our prove text is taken from the book of John 5:31-44, Daniel 7-9, Revelation 13-15, Matthew 24-25 and Luke 21. Let us watch and pray... God bless you as you share this message to others.