Fast Inference of Mixture-of-Experts Language Models with Offloading

AI Papers Academy

Подписаться 11 тыс.

Просмотров 1,4 тыс.

50% 1

Видео Поделиться Скачать Добавить в

Опубликовано:

22 окт 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 6

@winterclimber7520 9 месяцев назад

Very exciting work! The resulting speed the paper proposes won't break any land speed records (2-3 tokens per second), but in my experience one of the most productive and practical applications of LLMs is prompting it with multiple choice questions, which only require a single token. This paper (and provided code!) for GPT3.5 levels of inference running local on consumer hardware is a huge breakthrough, and I'm excited to give it a try!

@jacksonmatysik8007 9 месяцев назад

I have been looking for channel like this for ages as I hate reading

@fernandos-bs6544 7 месяцев назад

I just found your channel. It is amazing. Congratulations. Your numbers will grow soon, I am sure. Great quality and great content.

@aipapersacademy 7 месяцев назад

Thank you 🙏

@PaulSchwarzer-ou9sw 9 месяцев назад

Thanks! ❤

@ameynaik2743 6 месяцев назад

I believe this applicable only for single request? If you have change of experts, you will most likely have many experts active for various requests. Is my understanding correct? thank you.