How to choose between different analyzers and queries to get the best search performance? Benchmarking of course!

Photo by Arie Wubben on Unsplash

Deploying a large-scale full-text search engine can be very hard. Elasticsearch makes the job much easier but it’s not one size fits all — quite the contrary.

Elasticsearch has many configurations and features, but having many features also means many ways to achieve the same goal and it’s not always straightforward to know what’s the best way for the product you’re building.

Let’s start with finding out the main ways we can find users by their username/name, measuring their performance, advantages, and drawbacks.

Experiment Stats


How some simple changes can result in less latency and better memory usage.

Redis Strings are probably the most used (and abused) Redis data structure.

One of their main advantages is that they are binary-safe — This means you can save any type of binary data in Redis.

But as it turns out, most Redis users are serializing objects to JSON strings and storing them inside Redis.

What’s the problem you might ask?

  • JSON serialization/deserialization is incredibly inefficient and costly
  • You end up using more space in storage (which is expensive in Redis since it’s an in-memory database)
  • You increase…


In this post, I’ll describe the building blocks of a resilient self-hosted transcoding platform using open source tools and AWS.

For part two, I’ll share a sample python project that allows you to bootstrap this in minutes.

General principles

When building a system like this, you should never compromise on these:

  • Self-healing (AWS ASG)
  • Retry failed jobs (SQS)
  • Instrumentation
  • Cost efficiency
  • Auto Scaling (AWS ASG)
  • Error logging (Sentry.io)
  • Central logging (AWS CloudWatch)

Most of those can be attained effortlessly through SaaS solutions and/or your cloud provider services.

Infrastructure diagram

AWS Infra diagram

You can replace the compute layer with lambda, for example, just bear in mind that…


Building a plugin to filter large lists of numbers and get 10x performance on Elasticsearch cluster.

A few years ago, I faced a bottleneck in ElasticSearch when trying to filter on a big list of integer ids. I ended up writing a simple plug-in that used Roaringbitmaps to encode the list of ids and ran some tests with promising results.

…unfortunately, it never went into production. We were using AWS Elasticsearch at the time and that doesn’t allow custom plugins.

The other day I came across this post, which made me realize that I wasn’t the only one with this…


Increase your python code performance and security without changing the project source code.

Table of Contents

  • Introduction
  • Motivation
  • Benchmarks
  • Further Optimizations
  • The Perfect Dockerfile for Python
Photo by SpaceX on Unsplash

Introduction

Having a reliable Dockerfile as your base can save you hours of headaches and bigger problems down the road.

This post will share the “perfect” Python Dockerfile. Of course, there is no such thing as perfection and I’ll gladly accept feedback to improve possible issues you might find.

TL;DR;

Skip to the end to find a Dockerfile that is +20% faster than using the default one in docker hub. It also contains special optimizations for gunicorn and to build faster and safer.

Motivation

In a previous project, I built an elastic…


Active-Active multi-region is challenging and expensive, but sometimes it’s the only choice.

In my previous article (you can read it here), I showed the architecture used to handle a large-scale sneakers drop backend.

There was an essential part missing though, especially in our case with the strong requirement of “first come, first served”.

If the machines are in the USA and you’re trying to cop an item in Japan, the chances of winning will be slim to none just because of network latency. By the time your request hits the backend, chances are, you’re already behind someone else in the queue.

Since trusting the client clock is not an option, especially in…


A constant flow of document updates can bring an Elasticsearch cluster to its knees. Fortunately, there are ways to avoid that scenario.

As we’ve seen in my previous article, Elasticseach doesn’t really support updates. In Elasticsearch, an update always means delete+create.

In a previous project, we were using Elasticsearch for full-text search and needed to save some signals, like new followers, along with the user document.

That represented a big issue since thousands of new signals for a single user could be generated in seconds and that meant thousands of sequential updates to the same document.

Going for the naive solution of just issuing those updates is a good way to set an Elasticsearch cluster on fire :)

We had tolerance for…


Multiple strategies that you can use to increase Elasticsearch write capacity for batch jobs and/or online transactions.

Over the last few years, I’ve faced bottlenecks and made many mistakes with different ES clusters when it comes to its write capacity. Especially when one of the requirements is to write into a live Index that has strict SLAs for reading operations.

If you use Elasticsearch in production environments, chances are, you’ve faced these issues too and maybe even made some of the same mistakes I did in the past!

I think having a clear picture of the high-level overview on how ES works underneath the covers, will help a lot when you’re trying to get the best performance…


How to build a backend that can handle millions of concurrent users efficiently and consistently.

Photo by Hermes Rivera on Unsplash

Context

Brands like Nike, Adidas, or Supreme created a new trend in the market called “drops”, where they release a finite amount of items. It’s usually a limited run or pre-release limited offer before the real release.

This poses some special challenges since every sale is basically a “Black Friday”, and you have thousands (or millions) of users trying to buy a very limited amount of items at the exact same instant.

Main Requirements

  • All clients can see item changes in realtime (stock, description, etc);
  • Handle sudden increases…


How you can make the most out of this powerful database

Photo by Hoover Tung on Unsplash

Table of Contents

  • Common Issues
  • General principles
  • Indexes
  • Index Types
  • Improving queries
  • Locks
  • Rules of thumb
  • PG Config
  • BULK Updates/Inserts

Assumptions

  • You know basic SQL
  • You’ve already used PostgreSQL in the past
  • Basic knowledge of what an index and constraints are

Common issues you might have faced in the past

  • A query is slow periodically
  • A query is slow at times, but only affects certain users
  • High memory usage
  • High query latency, even for simple queries
  • The database is not responding
  • My server code is not able to connect but I’m able to connect with my root account

Luis Sena

Principal Engineer @ Farfetch https://www.linkedin.com/in/luis-sena/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store