Multi-Region: the final frontier — How Redis and atomic clocks save the day

Active-Active multi-region is challenging and expensive, but sometimes it’s the only choice.

In my previous article (you can read it here), I showed the architecture used to handle a large-scale sneakers drop backend.

There was an essential part missing though, especially in our case with the strong requirement of “first come, first served”.

If the machines are in the USA and you’re trying to cop an item in Japan, the chances of winning will be slim to none just because of network latency. By the time your request hits the backend, chances are, you’re already behind someone else in the queue.

Since trusting the client clock is not an option, especially in this business swarming with bots and hackers, we needed a stronger solution for this. Here’s what I came up with…

General Architecture

First Iteration for Multi-Region General Architecture

Main Issues when scaling to multi-region

  • CAP theorem
  • Consistency across regions
  • Managing multiple deployments
  • No longer able to use a simple Redis LIST to keep consistency
  • Cost
  • Multi-Region failover
  • Complexity

CAP THEOREM

Every time we need to deal with distributed systems, we need to make a choice/compromise as stated by the CAP Theorem.

In our case, we need Availability and Consistency

Our two main requirements are Availability and Consistency. We need to make sure we don’t oversell our items, since most of them have limited stock.
On top of that, people are placing their bets on a specific item — which means they won’t take it nicely if we tell them later the shoes they bought are not available after all.

In practice, this means that if a region can’t contact the main region, it will block the “checkout”.
This also means that if the main region is offline, every region will be unable to work properly (we’ll go back to this point later in this article to see how we can minimize our risk).

CONSISTENCY

We decided we want the A and C from the CAP Theorem but, unfortunately, that doesn’t mean we get them for free just because we dropped the P requirement …

If only we could get away with just using Cassandra or CockroachDB!

Drawing inspiration from Google Spanner and their external consistency concept, we decided to “use atomic clocks and GPS” as our source of truth. Fortunately, AWS released a free service called Time Sync that fits our use case quite nicely!

Local Cluster Architecture

Using this AWS service, all our machines across the globe are in sync.

Relying on the machine clock removes the need to get a timestamp from an API during the transaction, reducing our latency and removing the need to have a circuit breaker — as you should when dealing with external calls.

When an order arrives, we just need to get the current time and send it to the main Redis instance.
Using an async model is important for handling these requests since that “long-range” request can take a while to complete, potentially creating a bottleneck when you’re trying to serve thousands of concurrent requests.

Sorted Set used to keep track of an item orders

Previously, we kept the “order of our orders” using one Redis LIST per item. Now that we’re using a timestamp to maintain our consistency across the globe, a LIST is no longer the best data structure. …Unless you want to reshuffle it every time you get a packet out of order or doing an O(N) operation when you need to pop the first order.

Sorted Sets (ZSET) to the rescue!
Using the timestamp as a score, Redis will keep everything in proper order for us. Adding new orders to an item is as simples as this:

ZADD orders:<item_id> <timestamp> <user_token>

To go through your orders for each item, you can just:

ZPOPMIN orders:<item_id>

COST

Running a multi-region setup is usually more expensive, and our case is no exception.

Replicating PostgreSQL, Redis, and EC2 across the globe doesn’t come cheap, forcing us to iterate on our solution a bit.

In the end, we need to understand what we need to optimize and where we can compromise. I think this duality rings true for almost everything.

The user flow that needs the lowest latency is loading the app and places an order in the queue. Everything else is relatively lag tolerant.
This means we can focus on that path and have less strict requirements for the other user interactions.

The most important thing we did was dropping our local PostgreSQL instances, adapting the backend to only use Redis for the critical path, and be tolerant to eventual consistency where we could get away with it.
This also helped lower the API computing needs (double win!).

We also made use of AWS spot instances for the API and async workers.

And finally, we optimized our PostgreSQL cluster to the bone to get away with using smaller instances than usual.

All in all, our Redis is being used as:

  • Cache
  • Checkout Queue
  • Async Task Queue
  • Websockets Backend using Channels

This represents big savings on cost and overall complexity.

Final Iteration for the Multi-Region Architecture

MULTI-REGION FAILOVER

A multi-region failover can take different forms. Unfortunately, not a single one can be called perfect. We will need to choose where we want to compromise:

  • RTO (Recovery time objective)
  • RPO (Recovery point objective)
  • Cost
  • Complexity

Each use case will be different and the solution is almost always “it depends”.

For this use case, we used 3 mechanisms:

  • RDS automated backups — multi-az single region point in time recovery
  • Scripted manual snapshots using AWS lambda — multi-region recovery
  • Cross-Region Read Replica — multi-region availability and best data durability

To avoid rampant costs for a mostly idle PostgreSQL replica, this instance is a smaller machine and can be scaled up and promoted to “main” in case of emergency. It means we need to be tolerant to a few minutes of offline time, but I feel like it’s a good compromise for this use case.

Promoting a read replica will usually be faster than restoring a snapshot. Plus, you also have fewer chances of losing data this way.

Just make sure that the smaller instance can keep up with the write load. You can keep an eye on that by monitoring its replica lag (or better yet, configure an alert for it!).
A restart will “fix” the lag, but if the difference in size is the origin of the problem, you’re only pushing the issue and you should just up the instance size.

The manual snapshots give you an extra layer of safety and “data versioning” at the cost of S3 storage.

AWS has a really nice article about DR here.

As a final note on this, I recommend ensuring consistency and durability, even at the cost of having some manual interactions for this process.
If you have the engineering time and capacity to safely automate the entire process and run that drill from time to time to make sure all the automation is reliable, go for it. It’s the holy grail advertised by the likes of Netflix.

If not, losing an entire region is a very unlikely scenario and as long as you have alerts, the process of failing over to a new region can be done manually in less than 30 minutes (if the previous steps are in order).

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

Principal Engineer @ Farfetch https://www.linkedin.com/in/luis-sena/