Improve ElasticSearch Filtering Performance 10x Using This Plugin

Building a plugin to filter large lists of numbers and get 10x performance on Elasticsearch cluster.

A few years ago, I faced a bottleneck in ElasticSearch when trying to filter on a big list of integer ids. I ended up writing a simple plug-in that used Roaringbitmaps to encode the list of ids and ran some tests with promising results.

…unfortunately, it never went into production. We were using AWS Elasticsearch at the time and that doesn’t allow custom plugins.

The other day I came across this post, which made me realize that I wasn’t the only one with this problem (you’re never the only one on the internet, I know…)

That realization gave me the motivation to dig up that old code, clean it up, and publish it as an open-source plug-in for ElasticSearch.

Choosing The Correct Data Structure

Feel free to skip this part since it won’t contribute much to the final result.

Our main requirements are:

Enter RoaringBitmap, it just fits like a glove into our requirements:

  • Very fast random access
  • Good compression ratio
  • Fast computation
  • Fast serialization

Roaring solves this problem. It works in the following manner. It divides the data into chunks of 2^16 integers (e.g., [0, 2^16), [2^16, 2 x 2^16), …). Within a chunk, it can use an uncompressed bitmap, a simple list of integers, or a list of runs. Whatever format it uses, they all allow you to check for the presence of a value quickly (e.g., with a binary search).

You can read a bit more here.

How this can improve your overall system:

  • Fewer data to transmit — not only from client to the server but also between ES nodes.
  • Less memory usage — our client will use less memory, but most importantly, so will the ES nodes.
  • Less disk space — if those lists need to be kept somewhere in a database, saving it using RoaringBitmaps can save a lot of space.
  • Efficient lookup — Roaringbitmaps can be incredibly efficient and fast for in-memory lookups

Benchmarks

Objects size in memory (Bytes)

Since the whole list needs to be sent to every Elasticsearch node and be kept in memory, these savings can be multiplied by the number of nodes used to serve the query.

Filtering to include a list of ids (lower is better)
Filtering to exclude a list of ids (lower is better)

And the most important result, our latency.

When we have less than 100 ids to filter (to include or exclude), using our plugin might even be slower — especially if you serialize before issuing the query instead of having it serialized beforehand.

After we try to send a list of 100+ ids, this plugin starts to shine as you can see in the previous images, going so far as reducing query time by some orders of magnitude.

That being said, I recommend doing some benchmarks with your production data and cluster to find if you need to set a “sweet spot” for the number of ids needed to use the plugin.

Implementation

The following snippet of code shows the entire logic. As you can see, it’s a very simple plugin! This is very good news for maintenance :)

In fact, most of my time was consumed trying to get ES integration tests to work properly… The official documentation is lacking, to say the least.

Note: Let me know in the comments if you’d like an article/tutorial on how to properly add unit and integration tests, building the plugin, and having a CI running all of those.

code snippet from the plugin responsible for the filtering

Python example

Usage is also very simple as you can see in this python example.

Using python to issue a query to Elasticsearch

The “operation” field greatly increases the plugin versatility since you can use it to exclude or include a big list of ids in Elasticsearch.

Github Project:

Into Elasticsearch? Check these out:

References:

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-engine.html

http://www.technocratsid.com/how-to-create-an-elasticsearch-6-4-1-plugin/

https://www.elastic.co/guide/en/elasticsearch/plugins/7.11/plugin-authors.html

https://richardstartin.github.io/posts/roaringbitmap-performance-tricks

https://richardstartin.github.io/posts/a-quick-look-at-roaringbitmap

https://github.com/carrot2/elasticsearch-carrot2

https://github.com/vvanholl/elasticsearch-prometheus-exporter

Principal Engineer @ Farfetch https://www.linkedin.com/in/luis-sena/