How to Optimize FastAPI for ML Model Serving

If you do I/O alongside ML model serving, this will definitely make your FastAPI service faster.

6 min readSep 14, 2023

Nowadays, if you’re serving machine learning models, there’s a high chance you’ll be using FastAPI.

It has matured into a really robust framework and most people have migrated from the old and trusty Flask+gunicorn combo.

One of the main issues when changing to this framework is that you’re also changing from sync to async. This means that your concurrency will be based on a cooperative model on top of coroutines instead of threads that use a preemptive model.

As a refresher:

async uses a cooperative model with coroutines using an event loop. This means that each coroutine will run uninterrupted until it releases control. That control is usually released when it does I/O or finishes its task.
sync uses a preemptive model with threads where they can be paused even if they haven’t released control. This happens when they’ve been running for 5ms.

If you want to read more about this, check my article about sync and async workers.

The critical difference here is that you’re now running your workloads on top of an event loop and if you’re not careful, you might degrade you’re API if you let a task block that event loop for a long time.

Take this simple example:

import asyncio
import time

from fastapi import FastAPI, Request
from sentence_transformers import SentenceTransformer

app = FastAPI()

sbertmodel = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')


def model_predict():
    return sbertmodel.encode('How big is London')


async def vector_search(vector):
    # simulate I/O call (e.g. Vector Similarity Search using a VectorDB)
    await asyncio.sleep(0.005)


@app.get("/")
async def entrypoint(request: Request):
    ts = time.time()
    vector = model_predict()
    print(f"Model  : {int((time.time() - ts) * 1000)}ms")
    ts = time.time()
    await vector_search(vector)
    print(f"io task: {int((time.time() - ts) * 1000)}ms")
    return "ok"

In the above example, we call a BERT model to generate an embedding/vector. After that, we do an I/O call (in this case it is a simple 5ms sleep but in the real world, this could be a DB query or API call).

If you just do a single call, everything will look normal:

fastapi_pytorch-app-1  | Model  : 6ms
fastapi_pytorch-app-1  | io task: 5ms

The model takes 6ms and the I/O call is 5ms as it should be.

But what happens if we add concurrency (i.e. multiple requests being made at the same time)?

fastapi_pytorch-app-1  | Model  : 6ms
fastapi_pytorch-app-1  | Model  : 5ms
fastapi_pytorch-app-1  | Model  : 5ms
fastapi_pytorch-app-1  | Model  : 5ms
fastapi_pytorch-app-1  | Model  : 5ms
fastapi_pytorch-app-1  | Model  : 5ms
fastapi_pytorch-app-1  | Model  : 5ms
fastapi_pytorch-app-1  | io task: 47ms <----
fastapi_pytorch-app-1  | io task: 41ms <----
fastapi_pytorch-app-1  | io task: 35ms <----
fastapi_pytorch-app-1  | io task: 29ms <----
fastapi_pytorch-app-1  | io task: 23ms <----
fastapi_pytorch-app-1  | io task: 17ms <----
fastapi_pytorch-app-1  | io task: 11ms
fastapi_pytorch-app-1  | io task: 6ms
fastapi_pytorch-app-1  | Model  : 11ms
fastapi_pytorch-app-1  | io task: 12ms

In this scenario (similar to an actual production workload), we see a completely different picture.

The model runs as expected, but suddenly, our I/O tasks latency increases dramatically (up to 10x). This is happening because the model is hogging the event loop and won’t let anything else run until it finishes the task, leaving a huge queue of tasks waiting for execution.

To solve this, we can run the model in an executor that won’t block our event loop:

import asyncio
import time
from concurrent.futures import ThreadPoolExecutor

from fastapi import FastAPI, Request
from sentence_transformers import SentenceTransformer

# if you try to run all predicts concurrently, it will result in CPU trashing.
pool = ThreadPoolExecutor(max_workers=1)

app = FastAPI()

sbertmodel = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')


def model_predict():
    ts = time.time()
    vector = sbertmodel.encode('How big is London')
    print(f"Inner model  : {int((time.time() - ts) * 1000)}ms")
    return vector


async def vector_search(vector):
    # simulate I/O call (e.g. Vector Similarity Search using a VectorDB)
    await asyncio.sleep(0.005)


@app.get("/")
async def entrypoint(request: Request):
    loop = asyncio.get_event_loop()
    ts = time.time()
    vector = await loop.run_in_executor(pool, model_predict)
    # vector = model_predict()
    print(f"Model  : {int((time.time() - ts) * 1000)}ms")
    ts = time.time()
    await vector_search(vector)
    print(f"io task: {int((time.time() - ts) * 1000)}ms")
    return "ok"

One important thing to notice is these lines:

pool = ThreadPoolExecutor(max_workers=1)
vector = await loop.run_in_executor(pool, model_predict)

It’s important to create an executor with only one worker or else you’ll use the default one that has multiple workers.

Since most ML model libs already use threads under the hood, you’ll end up with terrible latencies due to multiple predictions happening at the same time, resulting in CPU thrashing.

The result is:

fastapi_pytorch-app-1  | Model  : 26ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 55ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 34ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 21ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 20ms
fastapi_pytorch-app-1  | io task: 5ms

The good news is that we’ve improved our I/O latency a lot. We no longer take more than the expected 5ms for each I/O call.

The bad news is that our model prediction time has degraded by more than 4x. This is unacceptable!

This is a behaviour you’ll usually see when mixing I/O and CPU work using threads in Python.

To solve this, we need to use a process executor instead of a thread executor:

import asyncio
import time
from concurrent.futures import ProcessPoolExecutor

from fastapi import FastAPI, Request
from sentence_transformers import SentenceTransformer

app = FastAPI()
sbertmodel = None


def create_model():
    global sbertmodel
    sbertmodel = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')


# if you try to run all predicts concurrently, it will result in CPU trashing.
pool = ProcessPoolExecutor(max_workers=1, initializer=create_model)


def model_predict():
    ts = time.time()
    vector = sbertmodel.encode('How big is London')
    return vector


async def vector_search(vector):
    # simulate I/O call (e.g. Vector Similarity Search using a VectorDB)
    await asyncio.sleep(0.005)


@app.get("/")
async def entrypoint(request: Request):
    loop = asyncio.get_event_loop()
    ts = time.time()
    # worker should be initialized outside endpoint to avoid cold start
    vector = await loop.run_in_executor(pool, model_predict)
    print(f"Model  : {int((time.time() - ts) * 1000)}ms")
    ts = time.time()
    await vector_search(vector)
    print(f"io task: {int((time.time() - ts) * 1000)}ms")
    return "ok"

A caveat is that we need to load the model inside the worker to avoid IPC and other issues. We just need to use the initializer param for that as shown in the above code snippet.

The result:

fastapi_pytorch-app-1  | Model  : 7ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 8ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 9ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 8ms
fastapi_pytorch-app-1  | io task: 4ms
fastapi_pytorch-app-1  | Model  : 7ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 8ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 9ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 9ms
fastapi_pytorch-app-1  | io task: 5ms
fastapi_pytorch-app-1  | Model  : 7ms
fastapi_pytorch-app-1  | io task: 6ms

This is much better! We still have some extra latency (1–3ms), but it can be considered negligible and expected due to the high concurrency we’re doing.

Conclusion

When running I/O tasks alongside CPU-bound tasks like model prediction, we should always use ProcessPoolExecutor to avoid extreme degradation of our I/O tasks.

Want to learn more about Python? Check these out!

Gunicorn Worker Types: You’re Probably Using Them Wrong

Scale your wsgi project to the next level by leveraging everything Gunicorn has to offer.

luis-sena.medium.com

Understanding and optimizing python multi-process memory management

This post will focus on lowering your memory usage and increase your IPC at the same time

luis-sena.medium.com

Creating the Perfect Python Dockerfile

Increase your python code performance and security without changing the project source code.

luis-sena.medium.com

How to Optimize FastAPI for ML Model Serving

If you do I/O alongside ML model serving, this will definitely make your FastAPI service faster.

As a refresher:

Conclusion

Want to learn more about Python? Check these out!

Gunicorn Worker Types: You’re Probably Using Them Wrong

Scale your wsgi project to the next level by leveraging everything Gunicorn has to offer.

Understanding and optimizing python multi-process memory management

This post will focus on lowering your memory usage and increase your IPC at the same time

Creating the Perfect Python Dockerfile

Increase your python code performance and security without changing the project source code.

Written by Luis Sena