Тёмный

vLLM - Turbo Charge your LLM Inference 

Подписаться
Просмотров 15 тыс.
% 547

vLLM - Turbo Charge your LLM Inference
Blog post: vllm.ai/
Github: github.com/vllm-project/vllm
Docs: vllm.readthedocs.io/en/latest/getting_started/quickstart.html
Colab: drp.li/5ugU2
My Links:
Twitter - Sam_Witteveen
Linkedin - www.linkedin.com/in/samwitteveen/
Github:
github.com/samwit/langchain-tutorials
github.com/samwit/llm-tutorials
Timestamps:
00:00 Intro
01:17 vLLM Blog
04:27 vLLM Github
05:40 Code Time

Наука

Опубликовано:

 

7 июл 2023

Поделиться:

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист
Посмотреть позже
Комментарии : 57   
@rajivmehtapy
@rajivmehtapy Год назад
As always, you are one of the few people who hit this topic on RU-vid.
@user-ew8ld1cy4d
@user-ew8ld1cy4d Год назад
Sam I love you videos but this one takes the cake. Thank you!!!
@sp4yke
@sp4yke Год назад
Thanks Sam for this video. It would be interesting to dedicate a video comparing OpenAI emulation such as LocalAI, Oobabooga, and vLLM
@g-program-it
@g-program-it Год назад
Finally AI models that don't take a year to give a response. Cheers for sharing this Sam.
@clray123
@clray123 Год назад
Uhh... you already get instant response from GGML/llama.cpp (apart from the model weights loading time, but this is not anything that PagedAttention improves on). The deal with PagedAttention is that it prevents the KV cache from wasting memory by not overallocating the entire context length at once, but rather doing it in chunks as the sequence keeps growing (and possibly sharing chunks among different inference beams or users). This allows the same model to serve more users (throughput) - of course, provided that they generate sequences shorter than the context length. It should not affect the response time for any individual user (if anything, it makes it worse because of the overhead of mapping virtual to physical memory blocks). So if it improves HF in that respect, it just demonstrates that either HF's implementation of KV cache sucks or Sam is comparing non-KV-cached generation with KV-cached one.
@wilfredomartel7781
@wilfredomartel7781 Год назад
Finally we can achieve fast responses.
@MultiSunix
@MultiSunix Год назад
Talked to its core developer, they don't have plan to support quantized model yet, hence you really need powerfull GPU(s) to run it.
@henkhbit5748
@henkhbit5748 11 месяцев назад
Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍
@Rems766
@Rems766 11 месяцев назад
Thanks mate, I am going to try to add that to langchain so it can integrate seamlessly to my product
@akiempaul1117
@akiempaul1117 8 месяцев назад
Great Great Great
@mayorc
@mayorc Год назад
This looks very useful.
@mayorc
@mayorc Год назад
A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.
@guanjwcn
@guanjwcn Год назад
This is very interesting. Thanks for sharing this. It would be nicer, I guess, if langchain can do the same.
@jasonwong8934
@jasonwong8934 Год назад
I’m surprised the bottleneck was due to memory inefficiency in the attention mechanism and not volume of matrix multiplications
@MeanGeneHacks
@MeanGeneHacks Год назад
Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit? Edit: Noticed no falcon support..
@samwitteveenai
@samwitteveenai 11 месяцев назад
AFAIK I don't think they are supporting Bitsandbytes etc.which doesn't surprise me as for what they are mainly using it for is comparing models which is not ideal a low resolution quantization.
@NickAubert
@NickAubert Год назад
It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.
@harlycorner
@harlycorner 11 месяцев назад
Thanks for this video. Although I should mention that at least on my RTX 3090 TI the GPTQ 13B models with exllama loader are absolutely flying. Faster than ChatGPT-3.5 turbo. But I'll definitely take a look
@shishirsinha6344
@shishirsinha6344 10 месяцев назад
Where is the model comparison made in terms of execution time wrt HuggingFace?
@MrRadziu86
@MrRadziu86 11 месяцев назад
@Sam Witteveen do you know by any chance how it compares to latest other technics of speeding up model. I don't remember exactly, but sometimes it is just a settings, a parameter nobody didn't use, until somebody share it, as well other technics. AS well, if you would know, which are better suitable for falcon, llama, etc.?
@samwitteveenai
@samwitteveenai 11 месяцев назад
for many of the options I have looked at this compares well for the models that it works with etc.
@MariuszWoloszyn
@MariuszWoloszyn Год назад
vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use huggingface/text-generation-inference which can load falcon 40b in 8-bit flawlessly!
@samwitteveenai
@samwitteveenai Год назад
Yes none of these are flawless. I might make about video about hosting with HF Text-gen-inference as well.
@Gerald-xg3rq
@Gerald-xg3rq 3 месяца назад
can you run this on aws sagemaker too? does it also work with llama 2 model with 7 and 13 billion parameters?
@user-hf3fu2xt2j
@user-hf3fu2xt2j Год назад
Now I wonder if this is possible to launch on CPU Some models will work tolerable.
@frazuppi4897
@frazuppi4897 Год назад
not sure since they compared with hf transformer and hf doesn't use flash attention to my knowledge so they are quite slow by default
@samwitteveenai
@samwitteveenai 11 месяцев назад
They compared to TGI also which does have Flash-Attention huggingface.co/text-generation-inference and it is still quite a bit faster
@asmac001nolastname6
@asmac001nolastname6 Год назад
Can this package be used with quantized 4-bit models? I don't see any support for them in the docs..
@samwitteveenai
@samwitteveenai Год назад
no I don't think it will work with that.
@clray123
@clray123 Год назад
It should be noted that for whatever reason it does not work with CUDA 12.x (yet).
@samwitteveenai
@samwitteveenai Год назад
My guess is just because their setup is not using that yet and it will come. I actually just checked my Colab and that seems to be running in Cuda 12.0 but maybe that is not optimal.
@io9021
@io9021 Год назад
I'm wondering how vLLM compares against conversion to onnx (e.g. with optimum) in terms of speed and ease of use. I'm struggling a bit with onnx 😅
@s0ckpupp3t
@s0ckpupp3t 11 месяцев назад
does ONNX have a streaming ability? I can't see any mention of websocket or http/2
@io9021
@io9021 11 месяцев назад
@@s0ckpupp3t Not that I know. I converted bloom-560 to ONNX and got similar latency as with vLLM. I guess with ONNX one could optimise it a bit further, but I'm impressed by vLLM because it's much easier to use.
@TheNaive
@TheNaive 7 месяцев назад
could you show how to add any hugging face model to vllm? Also above colab aint working.
@chenqu773
@chenqu773 Год назад
I am wondering if it works with huggingface 8bit and 4bit quantization
@samwitteveenai
@samwitteveenai Год назад
If you are talking with bitsandbytes I don't hink it does just yet.
@navneetkrc
@navneetkrc 11 месяцев назад
So can I use this with models downloaded from huggingface directly?? Context: In my office setup I can only use models weight downloaded separately.
@samwitteveenai
@samwitteveenai 11 месяцев назад
Yes totally the colab I show was downloading a model from HuggingFace. Not all of the LLMs are compatible, but most the popular ones are.
@navneetkrc
@navneetkrc 11 месяцев назад
@@samwitteveenai In my office setup, these models cannot be downloaded (blocked), so I download them separately and use their weights using huggingface pipelines as LLM for Langchain and other use cases. Will try a similar approach for vLLM hoping that this approach works
@samwitteveenai
@samwitteveenai 11 месяцев назад
@@navneetkrc Yes totally, will just need to load locally etc.
@navneetkrc
@navneetkrc 11 месяцев назад
@@samwitteveenai thanks a lot for the quick replies. You are the best 🤗
@andrewdang3401
@andrewdang3401 11 месяцев назад
Is this possible with langchain and a gui
@ColinKealty
@ColinKealty Год назад
Is this usable as a model in langchain for tool use?
@samwitteveenai
@samwitteveenai Год назад
You can use it as an LLM in Langchain. Whether it will work with tools will depend on which model you serve etc.
@ColinKealty
@ColinKealty Год назад
@@samwitteveenai I assume it doesn't support quants? Don't see any mention
@napent
@napent Год назад
What about data privacy?
@samwitteveenai
@samwitteveenai Год назад
You are running it on a machine you control. What are the privacy issues ?
@napent
@napent Год назад
@@samwitteveenai i though that it's cloud based 🎩
@keemixvico975
@keemixvico975 11 месяцев назад
it don't work.. daim it. I don't want to use Docker to make this work, so I'm stuck
@samwitteveenai
@samwitteveenai 11 месяцев назад
what model you trying to get to work? It also doesn't support quantized models if you are trying for that.
@saraili3971
@saraili3971 11 месяцев назад
@@samwitteveenai Hi Sam, thanks for the sharing(life-saver for newbies). Wonder your recommendation for quantized models ?
@seinaimut
@seinaimut 11 месяцев назад
can use with GGML model?
@samwitteveenai
@samwitteveenai 11 месяцев назад
no so far these are for full resolution models only
@sherryhp10
@sherryhp10 Год назад
still very slow
@eyemazed
@eyemazed 8 месяцев назад
It doesnt work on windows folks, trash
@eljefea2802
@eljefea2802 8 месяцев назад
they have a docker image. That's what im using right now