Fine-tune Multi-modal LLaVA Vision and Language Models

Подписаться 8 тыс.

Просмотров 15 тыс.

50% 1

➡️ ADVANCED Vision Fine-tuning Repo: trelis.com/advanced-vision/
➡️ ADVANCED-inference Repo: trelis.com/enterprise-server-...
➡️ ADVANCED-fine-tuning Repo: trelis.com/advanced-fine-tuni...
➡️ Trelis Function-calling Models and Scripts: trelis.com/function-calling/
➡️ ADVANCED Transcription Repo: trelis.com/advanced-transcrip...
➡️ One-click Fine-tuning & Inference Templates: github.com/TrelisResearch/one...
➡️ Trelis Newsletter: Trelis.Substack.com
➡️ Trelis Resources and Support: Trelis.com/About
Affiliate Links (support the channel):
- Vast AI - cloud.vast.ai/?ref_id=98762
- RunPod - tinyurl.com/4b6ecbbn
*Video Resources*
Slides: docs.google.com/presentation/...
One-click RunPod / VastAI Templates: github.com/TrelisResearch/ins...
IDEFICS: huggingface.co/HuggingFaceM4/...
LLaVA: llava.hliu.cc/
Trelis Newsletter: Trelis.Substack.com
Chapters:
0:00 Fine-tuning Multi-modal Models
0:16 Overview
1:30 LLaVA vs ChatGPT
4:53 Applications
5:37 Multi-modal model architecture
9:05 Vision Encoder architecture
14:00 LLaVA 1.5 architecture
16:30 LLaVA 1.6 architecture
18:30 IDEFICS architecture
22:00 Data creation
24:11 Dataset creation
25:29 Fine-tuning
34:25 Inference and Evaluation
37:34 Data loading
40:00 LoRA setup
42:52 Recap so far
43.25 Evaluation pre-training
44:26 Training
45:40 Evaluation post-training
46:45 Technical clarifications
50:29 Summary

Наука

Опубликовано:

8 июн 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 82

@TrelisResearch Месяц назад

UPDATE APRIL 24th 2024 VRAM Requirements have been greatly reduced by adding gradient checkpointing (all below are for 16 bit training): LLaVA 1.5 - liuhaotian/llava-v1.5-7b takes a min of 24 GB to train and will run on a single A6000. - liuhaotian/llava-v1.5-13b REQUIRES VRAM OF

@TemporaryForstudy Месяц назад

Nice video. Where do you work in Dublin? I am from india and i want to work at your company. I have masters degree in AI. I am currently working in an indian company but they are not providing remote work and the amount they are paying is also low. So please let me know if there is something for me.

@sam_joshua_s 3 месяца назад

Most underatted youtube channel

@ForTheEraOfLove 3 месяца назад

Reminds me of the Person of Interest episode called "If-Then-Else" where "The Machine" has to make a choice in nearly infinite possibilities. Great show for those ML enthusiasts.

@user-my1tx4dc2w 3 месяца назад

Amazing video! Thank you for sharing!❤

@Tsardoz Месяц назад

Very well explained.

@lourdarunraj9967 2 месяца назад

Amazing content!!!

@NametVevo 25 дней назад

Thank you for your video! I just starting in AI and It help me a lot.

@Cloudvenus666 Месяц назад

One thing to note, it took 9x A6000s for me, as 7 caused Cuda to run out of memory. Nevertheless, this is the best channel to learn how to fine-tune models and it is worth buying the repos.

@TrelisResearch Месяц назад

interesting, what model - the 34B. And did you change batch size or context length?

@Cloudvenus666 Месяц назад

@@TrelisResearch I used the 34B, and didn’t change the configurations. I’m sure that I could have gotten away with 8 GPUs but 7 ran a bit short.

@jacekb4057 25 дней назад

Man this helps me a lot. Thanks ❤

@3169aaaa 25 дней назад

@jacekb4057 hi did you created a notebook related this?

@3169aaaa 25 дней назад

@jacekb4057

@danieldemillard9412 3 месяца назад

Thanks again for another great video and tutorial. How much effort would it require to swap out your code to work with Mixtral 8x7b? I assume it isn't as trivial as swapping out the model name and fine-tuning. Do you foresee any issues with combining these with Instruct models instead of the base chat models?

@TrelisResearch 3 месяца назад

Good Q. I don’t think it would take much work, although - Mixtral doesn’t quite fit on a single A100, so training will be slower. Maybe 24 hours on 8 A100s… Btw I’m also just fine tuning so if you wanted to swap in Mixtral, it’s maybe better to use the original code.

@unsaturated8482 3 месяца назад

very informative

@imranullah3097 3 месяца назад

❤❤❤❤❤. Kindly also create a video on hifi gan to fine tune model for natural synthesis..

@unclecode Месяц назад

Worth of your life 51 minutes. Kudos. I learned a lot. Quick question - got any vids or tips on making something like LLava from scratch? Moondream's a good example. I have watched your other vids, but that is more about fine-tuning like this one. I wanna grasp the whole process of merging models, building the adapter, training it, and dropping a new multi-model version of the original language model used. Thx again

@TrelisResearch Месяц назад

I guess you watched the moon dream video I made right? That's a start. Yeah building from scratch is a bit more involved as you have to make the loading scripts. Again, the moondream model repo is a good place to look and get inspiration. I may get around to building from scratch at some point.

@lalpremi 3 месяца назад

Thank you for sharing, very interesting. Wow, your trained model summarizing given pictures is very impressive and fast. What type of hardware is behind the scenes handling all your site? have a great day. 🙂

@TrelisResearch 3 месяца назад

I'm running on A6000s on runpod! See: github.com/TrelisResearch/install-guides/blob/main/llm-notebook-setup.md

@UtoobNam Месяц назад

Hey! Are you making something similar for the multimodal output llava (interactive)?

@user-im4mt4ce1x 2 месяца назад

Hi. love the content btw. do you think finetuning phi2 with this approach might be a good idea like what moondream is about. and will this same script work for phi2.

@TrelisResearch 2 месяца назад

Yes, in principle that would work, although you would need to instantiate the model correctly swapping in phi for mistral/llama

@mirai5749 Месяц назад

Hello Embeddings are expert resamplers? Just read about Prismer VLM

@luce_yliu7524 Месяц назад

Great materials! Do you have this repo on your GitHub?

@TrelisResearch Месяц назад

Yup, this is in the ADVANCED-vision repo for purchase from trelis.com/ADVANCED-vision

@user-gp5wb6cz2v 3 месяца назад

Great video! I have fine-tuned a llama2 model on a v100 previously but I'm wondering if a model like llava-v1.6-mistral-7b on huggingface would be too large to fit on the 16gb available on the v100? Any suggestions on how to figure out how much vram a model requires? It doesn't seem to be too obvious a lot of the time from the documentation.

@TrelisResearch 3 месяца назад

Yeah, so Llama 7B has 7B parameters and in 16-bit, that's two bytes per parameter, so you need about 14 GB of VRAM to load the model, plus some headroom for kv cache for context length. For LLaVA you additionally need space for the image model AND you need space for the kv cache for the images. The vision model is quite small - a few hundred GB in size - so that shouldn't add much. I see on the repo that the files are around 16 GB in total for model plus vision. However, the vision model is cast up to 32-bits, so that can also double its size. All in all - in 16-bit - it won't be possible to fit in 16 GB of VRAM unless you do quantization. There's a flag to set that, but it's not stable and I had issues trying it. Basically, the LLaVA 1.6 model is not well supported in HuggingFace, so custom scripts are needed like I showed in the video here. However, you can train llava 1.5 with 4-bit quantization and that should fit in your V100.

@user-gp5wb6cz2v 3 месяца назад

Thank you for taking the time to reply! I assume you meant a few hundred MB for the vision model? That's interesting on the differences between training 1.5 vs 1.6 currently. Do you think there might be some more out-of-the-box approaches to fine-tuning 1.5 or would it still require more custom scripts like yours?

@TrelisResearch 3 месяца назад

@@user-gp5wb6cz2v oops, yes, hundreds of MB. Actually I just tested 1.6 again yesterday and I think it should be ok with about 24 GB of VRAM. Regarding more out-of-the-box, I'm a bit puzzled why this hasn't happened, and it's been a month or so now, perhaps we'll just have to look towards the next model.

@nguyenhoangnam 2 месяца назад

⁠⁠@@TrelisResearchcorrect me if I’m wrong, as from what you stated above, you mean your script can finetune 1.6 on a 24gb 3090?

@TrelisResearch 2 месяца назад

@@nguyenhoangnam in principle it should be possible but in practise the scripts for 1.6 take quite a bit more. There are some notes on trelis.com/advanced-vision

@divyagarh Месяц назад

Hi Ronan, once the Model is trained can we ask the model to give a image of a Wooden Rook or a black/white rook? or is this model just classifying if it is a rook or a King piece?

@TrelisResearch Месяц назад

nice question. The model is just classifying/describing. To go the other direction (generation) you need a diffusion model that basically starts with features and then renders and smooths those out.

@LukeDupin 3 месяца назад

Awesome

@user-io1jn5ob1p 3 месяца назад

amazing and very informative. Can you pls also show us how to fine tune LLaVA 1.5 ?

@TrelisResearch 3 месяца назад

Same approach! I used the same script!

@khalilbezrati8638 Месяц назад

Thank you for this video. I have a question and I would be happy if you could answer it. do you think that these multimodal AIs like LLAVA cen be fine-tuned for fraud detection in identity documents (passports, ID cards, driver's licenses)?

@TrelisResearch Месяц назад

Yes, this sounds like a good use case.

@xtu373 2 месяца назад

on how much examples to fine-tune LLaVA to get better results? 100 examples? what's the minimum number ?

@TrelisResearch Месяц назад

It depends how broad the concepts you are aiming to build into the model. For a very narrow fine-tune, it's possible just 25 images might be enough. You can get a rough sense from the video here and this application. Now, if you additionally wanted to train on other board games, you'd need quite a few more examples.

@AlexBerg1 3 месяца назад

On a first warch through, my impression is that it looks like fine-tuning LLaVA is a much longer script than fine-tuning Llama.

@TrelisResearch 3 месяца назад

Yeah, it's much longer because you can't use out of the box trainers with default data preparation (because the preparation of prompts for a model with images and vision is different). Probably out of the box will come, but will take some time.

@sillystuff6247 3 месяца назад

Is there a way to upload images to a OpenAI model via the API ?

@sherpya 3 месяца назад

yes you need to use gpt 4 vision model

@TrelisResearch 3 месяца назад

platform.openai.com/docs/api-reference/chat and click on image input to the right of the screen

@xtu373 2 месяца назад

Hi! Can I get notebook repo of Fine-tuning Multi-modal LLaVA?

@TrelisResearch 2 месяца назад

Check out Trelis.com/ADVANCED-vision

@ayushsinghal28 3 месяца назад

can it work with multiple images in a single prompt??

@TrelisResearch 3 месяца назад

It can!

@jonatan01i Месяц назад

Wouldn't it be easier to load in the model anyhow it comes and then looping through all the modules and setting them to bfloat16?

@TrelisResearch Месяц назад

Yeah now that you say it, I don’t see why not . Sounds better

@TrelisResearch Месяц назад

UPDATE: Yeah, I had forgotten that the main reason not to do this is that you need more VRAM to first load everything in float32 (or whatever the default is). So you may OOM

@jonatan01i Месяц назад

@@TrelisResearch oh wow, I haven't thought of that. Feels like a lot of hassle, hats off that you pushed through to make it happen. But upon more thinking: - Can you not change the number of gpus aft... - No-no I do one better either send it in fp16 or if that doesn't work then loop through on the cpu, send to gpu one set of parameters at a time and convert to bfloat16, then go to next and so on

@TheYephers 3 месяца назад

Will these fine tuning projects run on Colab Pro (A100) as is?

@TrelisResearch 2 месяца назад

LLaVA 1.5 will, but LLaVA 1.6 won't for now, the memory requirement to fine-tune is 100 GB. It should be a lot lower, but there is an open comment on the github repo around that high memory usage. So you need 2X A100 or 3X A6000

@DeviGoneMad 2 месяца назад

@@TrelisResearch But we can use 4bit quantization to finetune llava 1.6, that will run on google colab, right?

@TrelisResearch 2 месяца назад

@@DeviGoneMad in principle yes, but I haven't been able to get quantization working with the 1.6 models (as opposed to 1.5). :(

@semigoso7274 4 дня назад

Did you run into this error when checking LoraConfig? ValueError: Target module Sequential( (0): Linear(in_features=1024, out_features=4096, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=4096, out_features=4096, bias=True) ) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`. peft - Version: 0.11.2.dev0. Running in NVIDIA A10G. Everythin run correctly until that part. Great video!

@TrelisResearch 3 дня назад

seems like you tried to set a certain module as trainable that is not a linear layer. Just look at your lora modules and try commenting them out one by one. Or comment them all out and then include them one by one. use print(model) to see the list of modules.

@semigoso7274 3 дня назад

@@TrelisResearch the layer is the mm_projector (the adapter) that it is compose of Sequential( (0): Linear(in_features=1024, out_features=4096, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=4096, out_features=4096, bias=True) ). Did you train the adapter as a whloe without any issues or just the linear part of the adapter?

@TrelisResearch 2 дня назад

@@semigoso7274 ah yes, you can't do that because the GeLU isn't a linear layer. You have to target "model.mm_projector.0" and "model.mm_projector.2"

@pacograciadonat8885 Месяц назад

Hello I'm having problems with the image.py file when I try to use raw image URL what can i do?

@pacograciadonat8885 Месяц назад

this is the error i have : cannot identify image file

@TrelisResearch Месяц назад

howdy! if you purchased repo access, it's best to post an issue there. If you're using a URL, then put the relevant portion of the image.py code into chat gpt and ask it to adjust it to allow an image OR a URL to be passed as the input.

@pacograciadonat8885 Месяц назад

@@TrelisResearch ty really, and one more thing, what is the entire code when fine-tuning dataset

@xiaojinyusaudiobookswebnov4951 3 месяца назад

Can you show how to fine tune Google's gemma models?

@TrelisResearch 3 месяца назад

Same approach as in the embeddings vs fine tuning videos. Btw I’m unsure Gemma is that good compared to mistral or openchat

@tami9154 2 месяца назад

may i do all this on windows?

@TrelisResearch 2 месяца назад

you can do it on windows if you have a GPU. If you don't have a separate GPU, then you won't have enough RAM.

@fuba44 3 месяца назад

If you reversed the axis, the queen would be h5, maybe it's not a standard chess board? I'm not a big chess guy.

@TrelisResearch 3 месяца назад

Yeah it’s possible that’s the mix up

@xtu373 2 месяца назад

Why did you post this video on RU-vid - When you are trying to sell the repo? Pls change the video title.

@TrelisResearch 2 месяца назад

Howdy! Hopefully you can learn quite a lot without buying the repo. I don't have ads on this channel and those who do buy repos help to support the channel. That's the business model.