Building a plugin to filter large lists of numbers and get 10x performance on Elasticsearch cluster.
A few years ago, I faced a bottleneck in ElasticSearch when trying to filter on a big list of integer ids. I ended up writing a simple plug-in that used Roaringbitmaps to encode the list of ids and ran some tests with promising results.
…unfortunately, it never went into production. We were using AWS Elasticsearch at the time and that doesn’t allow custom plugins.
The other day I came across this post, which made me realize that I wasn’t the only one with this problem (you’re never the only one on the internet, I know…)
That realization gave me the motivation to dig up that old code, clean it up, and publish it as an open-source plug-in for ElasticSearch.
Choosing The Correct Data Structure
Feel free to skip this part since it won’t contribute much to the final result.
Our main requirements are:
- Deterministic — this drops Bloom filter
- Less memory/bandwidth usage over ES approach — this drops Bit arrays
- Very fast random access — this drops Compressed Bit arrays
Enter RoaringBitmap, it just fits like a glove into our requirements:
- Very fast random access
- Good compression ratio
- Fast computation
- Fast serialization
Roaring solves this problem. It works in the following manner. It divides the data into chunks of 2^16 integers (e.g., [0, 2^16), [2^16, 2 x 2^16), …). Within a chunk, it can use an uncompressed bitmap, a simple list of integers, or a list of runs. Whatever format it uses, they all allow you to check for the presence of a value quickly (e.g., with a binary search).
You can read a bit more here.
How this can improve your overall system:
- Fewer data to transmit — not only from client to the server but also between ES nodes.
- Less memory usage — our client will use less memory, but most importantly, so will the ES nodes.
- Less disk space — if those lists need to be kept somewhere in a database, saving it using RoaringBitmaps can save a lot of space.
- Efficient lookup — Roaringbitmaps can be incredibly efficient and fast for in-memory lookups
Since the whole list needs to be sent to every Elasticsearch node and be kept in memory, these savings can be multiplied by the number of nodes used to serve the query.
And the most important result, our latency.
When we have less than 100 ids to filter (to include or exclude), using our plugin might even be slower — especially if you serialize before issuing the query instead of having it serialized beforehand.
After we try to send a list of 100+ ids, this plugin starts to shine as you can see in the previous images, going so far as reducing query time by some orders of magnitude.
That being said, I recommend doing some benchmarks with your production data and cluster to find if you need to set a “sweet spot” for the number of ids needed to use the plugin.
The following snippet of code shows the entire logic. As you can see, it’s a very simple plugin! This is very good news for maintenance :)
In fact, most of my time was consumed trying to get ES integration tests to work properly… The official documentation is lacking, to say the least.
Note: Let me know in the comments if you’d like an article/tutorial on how to properly add unit and integration tests, building the plugin, and having a CI running all of those.
Usage is also very simple as you can see in this python example.
The “operation” field greatly increases the plugin versatility since you can use it to exclude or include a big list of ids in Elasticsearch.
Filter large lists with Elasticsearch using Roaringbitmap - lsena/fastfilter-elasticsearch-plugin
Into Elasticsearch? Check these out:
The Complete Guide to Increase Your Elasticsearch Write Throughput
Multiple strategies that you can use to increase Elasticsearch write capacity for batch jobs and/or online…
Using Event Sourcing to Increase Elasticsearch Performance
A constant flow of document updates can bring an Elasticsearch cluster to its knees. Fortunately, there are ways to do…