Using Redis to Build a Realtime “NIKE Sneakers Drop App” Backend

How to build a backend that can handle millions of concurrent users efficiently and consistently.

Context

Brands like Nike, Adidas, or Supreme created a new trend in the market called “drops”, where they release a finite amount of items. It’s usually a limited run or pre-release limited offer before the real release.

This poses some special challenges since every sale is basically a “Black Friday”, and you have thousands (or millions) of users trying to buy a very limited amount of items at the exact same instant.

Main Requirements

  • All clients can see item changes in realtime (stock, description, etc);
  • Handle sudden increases in load;
  • Minimize client interactions with the backend;
  • Deal with concurrency while avoiding race conditions;
  • Handle thousands or millions of concurrent requests to the same item;
  • Fault-tolerant;
  • Auto Scaling;
  • High Availability;
  • Cost-efficient.

Infrastructure Architecture

This is the high-level infrastructure architecture, using mainly AWS services. You can adapt to any of your favorite cloud providers.

The main component that can be troublesome is the Load Balancer. We’ll see why later on.

Software Architecture

Backoffice

It all starts with our back office. We need to save this drop information and available items in the proper data structure.

We will use 2 systems. PostgreSQL (click here to know how to enable high performance for PostgreSQL) and Redis.

PostgreSQL, MySQL, Cassandra, or any other durable Database Engine will work. We need a durable and consistent source of truth to store our data.

Redis will be our workhorse and take the bulk of our traffic.
The chosen data structures will be Redis HASH for each item and LIST for each checkout queue.

Using a HASH instead of TEXT, we can update just a single element of each item attribute like the item stock. It even allows us to INCR/DECR integer fields without needing to read the attribute first.

This will be very handy since it will allow non-blocking concurrent access to our item information.

Every time the back-office staff user makes any change, Redis needs to be updated to guarantee we have an exact replica of our PostgreSQL database when it comes to the DROP data.

Redis is not just a simple key-value memory database, it has specialized data structures like HASH and LIST that will increase the system performance by orders of magnitude.

Initial Connection

When the user launches the app, the first thing it needs to do is request the DROP payload and connect to a WebSocket.

This way, it just needs to request the information once and from thereafter, it will get information in realtime each time there is a relevant update without needing to constantly poll the server.
This is called a push architecture and it’s particularly useful here since it enables “realtime” while keeping our network noise to a minimum.

Aside from the checkout during a drop, this endpoint that returns our drop information is the one that will be most massacred and should be highly efficient. To achieve this, we will only use Redis to return the information in order to protect our main database.

But we need to take into consideration something might have happened with Redis and can’t trust it will always have our data.
It is a good practice to always treat Redis as a cache, even though it behaves like a normal database.

A simple approach is to validate if no data is returned from Redis, then, immediately load it from PostgreSQL and save it to Redis. You should return the data to the user only after that. This way we keep your interaction with your main database to a minimum, while guaranteeing you have the correct data at all times.

Our traffic is usually not evenly distributed. Identify the endpoints that are hit the most and make sure they have the architecture to scale.

Cop Item AKA Buy The Item

Now that the client has all the information it needs, as soon as the drop starts, the user will try to cop an item for himself.

This part is the most crucial and dangerous part of our whole system.

We need to take into consideration that the market is flooded with bots, built specially to exploit these systems. A whole underground economy thrives on buying and reselling these items.

For this post, we’ll focus on availability and consistency and won’t dive into the multiple options we have to try and secure this system (let me know in the comments if you would like a post describing those!).

As soon as the DROP starts, the checkout endpoint will be flooded with requests. This means each request needs to be handled as efficiently as possible, ideally with an in-memory database and O(1) operations only while guaranteeing order since this is a first-come, first-served business model.

This is where Redis shines, it checks all the marks!

  • O(1) operations for reading and writing;
  • In-memory database;
  • Single-threaded guarantees order without needing extra mutex complexity on our code;
  • Extreme throughput and very low latencies;
  • Specialized data structures like HASH and LIST.

At a very high level, when a user requests the checkout endpoint, we just need to check if the drop already started (Redis), if the item is still available (Redis), the user is already in the queue (Redis), and saving the request to the checkout queue (Redis) if he passes all the checks.

This means we are able to serve the checkout request in a few milliseconds doing mostly only O(1) operations and using an in-memory DB without being worried about concurrency and consistency.

Try to use only O(1) operations for critical endpoints, even if you need to do extra “write” work on your backoffice system.

Charge Credit Card and Finish the Transaction

Now that you have a single queue per item and all users that tried or are currently trying to buy the item are stored in order in those queues, we have “all the time in the world” to process them asynchronously.

We can have as many servers as needed and use an auto-scaling group to scale in and out as needed to process all the queues — as long as you guarantee that no queue is handled by more than 1 worker process at the same time.

This simple architecture allows us to try to charge the credit cards, and even have some wiggle room for retries and other checks.

If the first user in the queue fails due to CC problems or other issues, we can move on to the next user in the queue.

It also allows us to have unique items or items with stock be processed the exact same way (just set the stock to 1).

Technical Challenges

Load Balancer

If you’re using an AWS managed Load Balancer, you’ll probably run into a problem. Before the drop starts, traffic will be very low compared to when it starts. This means your load balancer has only a few computing units (nodes) allocated to it and it will take a few minutes to be able to scale to the needed traffic during the DROP.

Well, we don’t have a few minutes, right? … and we don’t want our Load Balancer to return 502 errors to our users either.

We have at least 2 options here. Warm up your Load Balancer with simulated traffic (using lambda for example) before the drop starts, or run your own Load Balancer cluster using HAProxy for example.

Both are valid and it will depend on your team size and experience with those systems.

A third option is to contact AWS so they can pre-warm the LB, but since this is a manual process, and I don’t recommend it.

Redis Scalability

Regarding our Redis, it is a good idea to scale out if you start to have a big amount of items and/or users participating.

The best way is to have a multiple write node (cluster mode) approach, instead of main/replica architecture. This is mainly to avoid lag issues. Remember, we want to have consistency without needing too much code complexity.

You can distribute the drop items across those nodes using a modulo or deterministic hash function.

Scaling out our Redis using an active-active cluster

It’s very important to make use of async here. That way, our total latency will only be the max latency from our Redis nodes instead of the sum of all latencies from all the used nodes.

For our use case, using the item id as our partition key works very well since the load will be evenly distributed across our keyspace.

Second Part

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

Principal Engineer @ Farfetch https://www.linkedin.com/in/luis-sena/