Optimizing FastAPI for Concurrent Users when Running Hugging Face ML Models

Подписаться 6 тыс.

Просмотров 4 тыс.

50% 1

To serve multiple concurrent users accessing FastAPI endpoint running Hugging Face API, you must start the FastAPI app with several workers. It will ensure current user requests will not be blocked if another request is already running. I show and describe it in this video.
Sparrow - data extraction from documents with ML:
github.com/katanaml/sparrow
0:00 Introduction
0:30 Concurrency
2:50 Problem Example
4:10 Code and Solution
6:10 Summary
CONNECT:
- Subscribe to this RU-vid channel
- Twitter: / andrejusb
- LinkedIn: / andrej-baranovskij
- Medium: / andrejusb
#python #fastapi #machinelearning

Опубликовано:

27 июл 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 9

@shaheerzaman620 Год назад

these fastapi + ml videos are great!

@marka9424 9 месяцев назад

Great video - how do you scale this to handle 500 requests per second with only 4 workers?

@AndrejBaranovskij 9 месяцев назад

Depends on request processing time, complexity and hardware

@juvewan 4 месяца назад

FastAPI by default is multi-threaded, it runs in a threadpool. If you change your endpoints from "async def" just normal "def", then while you are running inference(Hugging Face API call), the get stats endpoint should return instantly.

@AndrejBaranovskij 4 месяца назад

Based on my tests, this was not happening. Hugging Face code blocks all threads.

@johnvick8861 Месяц назад

Use asyncio for concurrent

@AndrejBaranovskij Месяц назад

@@johnvick8861 It wasn't working with HuggingFace libs. I need to test again.

@hodiks 7 месяцев назад

Hello, what about running another python subprocess which extract data and waiting for a response, that shouldn't block the current thread.Or it's bad idea?

@AndrejBaranovskij 7 месяцев назад

Hey, Using a Python subprocess can help in avoiding blocking, but it introduces extra complexity and resource usage. Leveraging FastAPI's asynchronous features or scaling with more workers is a more efficient and simpler solution. Thanks for your suggestion!