Quantize LLMs with AWQ: Faster and Smaller Llama 3

Подписаться 30 тыс.

Просмотров 4,6 тыс.

50% 1

Explore how to make LLMs faster and more compact with my latest tutorial on Activation Aware Quantization (AWQ)! In this video, I demonstrate how to apply AWQ to quantize Llama 3, achieving a model that's not only quicker but also smaller than its non-quantized counterpart. Dive into the details of the process and see the benefits in real-time. If you found this video helpful, don't forget to like, comment, and subscribe for more insightful content like this!
Join this channel to get access to perks:
/ @aianytime
To further support the channel, you can contribute via the following methods:
Bitcoin Address: 32zhmo5T9jvu8gJDGW3LTuKBM1KPMHoCsW
UPI: sonu1000raw@ybl
GitHub: github.com/AIA...
Activation Aware Quantization Research paper: arxiv.org/pdf/...
Quantized Model on HF here: huggingface.co...
#llama3 #genai #ai

Опубликовано:

29 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 22

@mehmetbakideniz 5 месяцев назад

fantastic video! I will watch the other videos. Definetely a very talented tutor here!

@edwards438 8 дней назад

Lopez Susan Garcia Jessica Jackson Scott

@HowellsArlen-w4b 7 дней назад

Harris Ruth Martin Ronald Hall Kenneth

@lazypunk794 5 месяцев назад

awq has lower throughput than unquantized model when serving using VLLM. Do you know if there are quantization methods that can also increase throughput?

@nashtashasaint-pier7404 5 месяцев назад

@ShaunPrince 5 месяцев назад

This is only true in inappropriate scenarios, where you don't have flash attention compiled, or if you are using a old gpu, like the colab T4. Try to avoid using the pre-made docker images, ensure that all your hardware is enabled to it's best ability. Always use the latest python 3.11.x, latest cuda developers toolkit 12.x Dont use the cuda GPU drivers, use drivers that you make yourself or that are meant for your operating system. Then this stupid argument about vllm unquantized is faster is lo longer true. Not many people want to take the time to learn about and properly prepare their inference systems. AWQ is meant to save memory, Exl2 is like for finetuning for your available VRAM with their variable bbw and hb.

@nguyentrong0603 Месяц назад

Nice video!

@Sowmya_D 3 месяца назад

getting this error while loading your model: RuntimeError: Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback): name 'torch' is not defined

@ShaileshSarda-m6z 3 месяца назад

Amazing video. But just a one suggestion Instead of directly jumping on code. Please try to explain the paper and its working. So that people can correlate theory with practical. Its really amazing video content. Thanks.

@ragibhasan2.0 5 месяцев назад

Is "Fine Tuning of LLMs" playlist enough for finetuning any llam model?

@AIAnytime 5 месяцев назад

Yes!

@ragibhasan2.0 5 месяцев назад

@@AIAnytime Thanks for creating this type of playlist☺

@criticalnodecapital 4 месяца назад

Can we collab on a project, also Cuda Vs Triton, and also inference evaluations. How do you make Research into Code? Can we work together?

@IdealVijay- 4 месяца назад

Does quantizing a model make it less accurate? How many parameters will be in the Quantize Model? If It is 13B then how quantizing the model is making it faster?

@maitreyazalte6971 5 месяцев назад

Doubt : In this case, we are downloading the entire model first and then quantizing it. Is there any way to quantize a model on the fly during loading? Since I'm GPU poor, I might not be able to download the entire model, and hence can't quantize. Please suggest something...

@IdPreferNot1 5 месяцев назад

1.58 bytes seems so promising but i understand it has to be part of the original training, you cant post-training quantize. Have you heard of anyone actually training any models with this?