vLLM - Turbo Charge your LLM Inference

Просмотров 15 тыс.

% 547

vLLM - Turbo Charge your LLM Inference
Blog post: vllm.ai/
Github: github.com/vllm-project/vllm
Docs: vllm.readthedocs.io/en/latest/getting_started/quickstart.html
Colab: drp.li/5ugU2
My Links:
Twitter - Sam_Witteveen
Linkedin - www.linkedin.com/in/samwitteveen/
Github:
github.com/samwit/langchain-tutorials
github.com/samwit/llm-tutorials
Timestamps:
00:00 Intro
01:17 vLLM Blog
04:27 vLLM Github
05:40 Code Time

Наука

Опубликовано:

7 июл 2023

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 57

@rajivmehtapy Год назад

As always, you are one of the few people who hit this topic on RU-vid.

@user-ew8ld1cy4d Год назад

Sam I love you videos but this one takes the cake. Thank you!!!

@sp4yke Год назад

Thanks Sam for this video. It would be interesting to dedicate a video comparing OpenAI emulation such as LocalAI, Oobabooga, and vLLM

@g-program-it Год назад

Finally AI models that don't take a year to give a response. Cheers for sharing this Sam.

@clray123 Год назад

Uhh... you already get instant response from GGML/llama.cpp (apart from the model weights loading time, but this is not anything that PagedAttention improves on). The deal with PagedAttention is that it prevents the KV cache from wasting memory by not overallocating the entire context length at once, but rather doing it in chunks as the sequence keeps growing (and possibly sharing chunks among different inference beams or users). This allows the same model to serve more users (throughput) - of course, provided that they generate sequences shorter than the context length. It should not affect the response time for any individual user (if anything, it makes it worse because of the overhead of mapping virtual to physical memory blocks). So if it improves HF in that respect, it just demonstrates that either HF's implementation of KV cache sucks or Sam is comparing non-KV-cached generation with KV-cached one.

@wilfredomartel7781 Год назад

Finally we can achieve fast responses.

@MultiSunix Год назад

Talked to its core developer, they don't have plan to support quantized model yet, hence you really need powerfull GPU(s) to run it.

@henkhbit5748 11 месяцев назад

Hmm, did not know that Red Bull and Verstappen are in the race for turboscharging Llm's😉 thanks for demonstrating vllm in combination with an open source model👍

@Rems766 11 месяцев назад

Thanks mate, I am going to try to add that to langchain so it can integrate seamlessly to my product

@akiempaul1117 8 месяцев назад

Great Great Great

@mayorc Год назад

This looks very useful.

@mayorc Год назад

A very good test for this, you could make a video for would be to use the OpenAI-Compatible Server functionality with a good performing optimized local model with good coding training, and testing this with new great tools like GPT-Engineer or AIDER, to see how it performs compared to GPT4 in real case scenarios of writing applications.

@guanjwcn Год назад

This is very interesting. Thanks for sharing this. It would be nicer, I guess, if langchain can do the same.

@jasonwong8934 Год назад

I’m surprised the bottleneck was due to memory inefficiency in the attention mechanism and not volume of matrix multiplications

@MeanGeneHacks Год назад

Thank you for this nugget. Very useful information for speeding up inference. Does it support bitsandbytes library for loading in 8 or 4-bit? Edit: Noticed no falcon support..

@samwitteveenai 11 месяцев назад

AFAIK I don't think they are supporting Bitsandbytes etc.which doesn't surprise me as for what they are mainly using it for is comparing models which is not ideal a low resolution quantization.

@NickAubert Год назад

It looks like vLLM itself is CUDA, but I wonder if these techniques could apply to CPU based models like llama.cpp? Presumably, any improvements wouldn't be as dramatic if the bottleneck is processing cycles rather than memory.

@harlycorner 11 месяцев назад

Thanks for this video. Although I should mention that at least on my RTX 3090 TI the GPTQ 13B models with exllama loader are absolutely flying. Faster than ChatGPT-3.5 turbo. But I'll definitely take a look

@shishirsinha6344 10 месяцев назад

Where is the model comparison made in terms of execution time wrt HuggingFace?

@MrRadziu86 11 месяцев назад

@Sam Witteveen do you know by any chance how it compares to latest other technics of speeding up model. I don't remember exactly, but sometimes it is just a settings, a parameter nobody didn't use, until somebody share it, as well other technics. AS well, if you would know, which are better suitable for falcon, llama, etc.?

@samwitteveenai 11 месяцев назад

for many of the options I have looked at this compares well for the models that it works with etc.

@MariuszWoloszyn Год назад

vLLM is great but lacks support for some models (and some are still buggy like mpt-30b with streaming but mpt was added like 2 day ago so expect that to be fixed soon). For example there's less chance it will support Falcon-40b soon. In that case use huggingface/text-generation-inference which can load falcon 40b in 8-bit flawlessly!

@samwitteveenai Год назад

Yes none of these are flawless. I might make about video about hosting with HF Text-gen-inference as well.

@Gerald-xg3rq 3 месяца назад

can you run this on aws sagemaker too? does it also work with llama 2 model with 7 and 13 billion parameters?

@user-hf3fu2xt2j Год назад

Now I wonder if this is possible to launch on CPU Some models will work tolerable.

@frazuppi4897 Год назад

not sure since they compared with hf transformer and hf doesn't use flash attention to my knowledge so they are quite slow by default

@samwitteveenai 11 месяцев назад

They compared to TGI also which does have Flash-Attention huggingface.co/text-generation-inference and it is still quite a bit faster

@asmac001nolastname6 Год назад

Can this package be used with quantized 4-bit models? I don't see any support for them in the docs..

@samwitteveenai Год назад

no I don't think it will work with that.

@clray123 Год назад

It should be noted that for whatever reason it does not work with CUDA 12.x (yet).

@samwitteveenai Год назад

My guess is just because their setup is not using that yet and it will come. I actually just checked my Colab and that seems to be running in Cuda 12.0 but maybe that is not optimal.

@io9021 Год назад

I'm wondering how vLLM compares against conversion to onnx (e.g. with optimum) in terms of speed and ease of use. I'm struggling a bit with onnx 😅

@s0ckpupp3t 11 месяцев назад

does ONNX have a streaming ability? I can't see any mention of websocket or http/2

@io9021 11 месяцев назад

@@s0ckpupp3t Not that I know. I converted bloom-560 to ONNX and got similar latency as with vLLM. I guess with ONNX one could optimise it a bit further, but I'm impressed by vLLM because it's much easier to use.

@TheNaive 7 месяцев назад

could you show how to add any hugging face model to vllm? Also above colab aint working.

@chenqu773 Год назад

I am wondering if it works with huggingface 8bit and 4bit quantization

@samwitteveenai Год назад

If you are talking with bitsandbytes I don't hink it does just yet.

@navneetkrc 11 месяцев назад

So can I use this with models downloaded from huggingface directly?? Context: In my office setup I can only use models weight downloaded separately.

@samwitteveenai 11 месяцев назад

Yes totally the colab I show was downloading a model from HuggingFace. Not all of the LLMs are compatible, but most the popular ones are.

@navneetkrc 11 месяцев назад

@@samwitteveenai In my office setup, these models cannot be downloaded (blocked), so I download them separately and use their weights using huggingface pipelines as LLM for Langchain and other use cases. Will try a similar approach for vLLM hoping that this approach works

@samwitteveenai 11 месяцев назад

@@navneetkrc Yes totally, will just need to load locally etc.

@navneetkrc 11 месяцев назад

@@samwitteveenai thanks a lot for the quick replies. You are the best 🤗