Multi-Region: the final frontier — How Redis and atomic clocks save the day

Active-Active multi-region is challenging and expensive, but sometimes it’s the only choice.

6 min readMay 4, 2021

In my previous article (you can read it here), I showed the architecture used to handle a large-scale sneakers drop backend.

There was an essential part missing though, especially in our case with the strong requirement of “first come, first served”.

If the machines are in the USA and you’re trying to cop an item in Japan, the chances of winning will be slim to none just because of network latency. By the time your request hits the backend, chances are, you’re already behind someone else in the queue.

Since trusting the client clock is not an option, especially in this business swarming with bots and hackers, we needed a stronger solution for this. Here’s what I came up with…

General Architecture

Main Issues when scaling to multi-region

CAP theorem
Consistency across regions
Managing multiple deployments
No longer able to use a simple Redis LIST to keep consistency
Cost
Multi-Region failover
Complexity

CAP THEOREM

Every time we need to deal with distributed systems, we need to make a choice/compromise as stated by the CAP Theorem.

In our case, we need Availability and Consistency

Our two main requirements are Availability and Consistency. We need to make sure we don’t oversell our items, since most of them have limited stock.
On top of that, people are placing their bets on a specific item — which means they won’t take it nicely if we tell them later the shoes they bought are not available after all.

In practice, this means that if a region can’t contact the main region, it will block the “checkout”.
This also means that if the main region is offline, every region will be unable to work properly (we’ll go back to this point later in this article to see how we can minimize our risk).

CONSISTENCY

We decided we want the A and C from the CAP Theorem but, unfortunately, that doesn’t mean we get them for free just because we dropped the P requirement …

If only we could get away with just using Cassandra or CockroachDB!

Drawing inspiration from Google Spanner and their external consistency concept, we decided to “use atomic clocks and GPS” as our source of truth. Fortunately, AWS released a free service called Time Sync that fits our use case quite nicely!

Using this AWS service, all our machines across the globe are in sync.

Relying on the machine clock removes the need to get a timestamp from an API during the transaction, reducing our latency and removing the need to have a circuit breaker — as you should when dealing with external calls.

When an order arrives, we just need to get the current time and send it to the main Redis instance.
Using an async model is important for handling these requests since that “long-range” request can take a while to complete, potentially creating a bottleneck when you’re trying to serve thousands of concurrent requests.

Sorted Set used to keep track of an item orders

Previously, we kept the “order of our orders” using one Redis LIST per item. Now that we’re using a timestamp to maintain our consistency across the globe, a LIST is no longer the best data structure. …Unless you want to reshuffle it every time you get a packet out of order or doing an O(N) operation when you need to pop the first order.

Sorted Sets (ZSET) to the rescue!
Using the timestamp as a score, Redis will keep everything in proper order for us. Adding new orders to an item is as simples as this:

ZADD orders:<item_id> <timestamp> <user_token>

To go through your orders for each item, you can just:

ZPOPMIN orders:<item_id>

COST

Running a multi-region setup is usually more expensive, and our case is no exception.

Replicating PostgreSQL, Redis, and EC2 across the globe doesn’t come cheap, forcing us to iterate on our solution a bit.

In the end, we need to understand what we need to optimize and where we can compromise. I think this duality rings true for almost everything.

The user flow that needs the lowest latency is loading the app and places an order in the queue. Everything else is relatively lag tolerant.
This means we can focus on that path and have less strict requirements for the other user interactions.

The most important thing we did was dropping our local PostgreSQL instances, adapting the backend to only use Redis for the critical path, and be tolerant to eventual consistency where we could get away with it.
This also helped lower the API computing needs (double win!).

We also made use of AWS spot instances for the API and async workers.

And finally, we optimized our PostgreSQL cluster to the bone to get away with using smaller instances than usual.

All in all, our Redis is being used as:

Cache
Checkout Queue
Async Task Queue
Websockets Backend using Channels

This represents big savings on cost and overall complexity.

Final Iteration for the Multi-Region Architecture

MULTI-REGION FAILOVER

A multi-region failover can take different forms. Unfortunately, not a single one can be called perfect. We will need to choose where we want to compromise:

RTO (Recovery time objective)
RPO (Recovery point objective)
Cost
Complexity

Each use case will be different and the solution is almost always “it depends”.

For this use case, we used 3 mechanisms:

RDS automated backups — multi-az single region point in time recovery
Scripted manual snapshots using AWS lambda — multi-region recovery
Cross-Region Read Replica — multi-region availability and best data durability

To avoid rampant costs for a mostly idle PostgreSQL replica, this instance is a smaller machine and can be scaled up and promoted to “main” in case of emergency. It means we need to be tolerant to a few minutes of offline time, but I feel like it’s a good compromise for this use case.

Promoting a read replica will usually be faster than restoring a snapshot. Plus, you also have fewer chances of losing data this way.

Just make sure that the smaller instance can keep up with the write load. You can keep an eye on that by monitoring its replica lag (or better yet, configure an alert for it!).
A restart will “fix” the lag, but if the difference in size is the origin of the problem, you’re only pushing the issue and you should just up the instance size.

The manual snapshots give you an extra layer of safety and “data versioning” at the cost of S3 storage.

AWS has a really nice article about DR here.

As a final note on this, I recommend ensuring consistency and durability, even at the cost of having some manual interactions for this process.
If you have the engineering time and capacity to safely automate the entire process and run that drill from time to time to make sure all the automation is reliable, go for it. It’s the holy grail advertised by the likes of Netflix.

If not, losing an entire region is a very unlikely scenario and as long as you have alerts, the process of failing over to a new region can be done manually in less than 30 minutes (if the previous steps are in order).

First Part

Using Redis to Build a Realtime “NIKE Sneakers Drop App” Backend

Let’s learn how to build a backend that can handle millions of users efficiently

luis-sena.medium.com

Extra Reads

CAP Theorem:

https://en.wikipedia.org/wiki/CAP_theorem

AWS Time Sync Service:

https://www.abhishek-tiwari.com/rise-of-truetime/

Elasticache Multi Region:

https://aws.amazon.com/about-aws/whats-new/2020/03/amazon-elasticache-for-redis-announces-global-datastore/

https://redis.io/topics/replication

PostgreSQL Replication:

https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling

https://wiki.postgresql.org/wiki/Streaming_Replication

Route53 Routing:

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.html#routing-policy-latency

Netflix Multi-Region Resiliency:

https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b

https://netflixtechblog.com/project-nimble-region-evacuation-reimagined-d0d0568254d4

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!