Building a Plugin to Improve Elasticsearch Filter Performance Pt. 2
Using RoaringBitmaps to include or exclude big lists of integers in Elasticsearch
In a previous article, I shared a plugin I did that allows you to filter big lists of integers using RoaringBitmaps.
Today I bring you the latest version of that plugin, now properly using doc_values in order to improve its performance.
Let’s start with the best-case scenario for this plugin, which excludes a big list of integers when searching inside Elasticsearch.
Exclude a List of Integers
In this graph, the only thing we can clearly see is that they are orders of magnitude apart since you can barely notice the red bars.
Let’s break this apart to understand the numbers a bit better:
Terms Query Latency
Plugin Latency
By the way, if you try to filter on more than 65536, you will get this error from Elasticsearch:
‘failed to create query: The number of terms [XXXXXX] used in the Terms Query request has exceeded the allowed maximum of [65536]. This maximum can be set by changing the [index.max_terms_count] index level setting.
You can change that with the following command:
PUT /index-name/_settings
{
"index" : {
"max_terms_count" : 10000000
}
}
But as we saw in the previous charts, it’s probably a good idea to keep that limit to avoid huge latencies and compromising Elasticsearch stability.
In the following table, we can see the latency in milliseconds for each list size.
An important detail to note here: This is the server-side latency only. I’m only reporting this latency as an attempt to be fair (the difference in latency would be even bigger according to my measurements since the payload size increases dramatically when using JSON but can be more susceptible to the programming language and environment).
Even when filtering out 1 million ids, we still get latencies well below 10ms.
We can also use it to include a specific list of integers as you’ll see next. This scenario doesn’t always perform better than Elasticsearch, possibly because of some performance tricks Elasticsearch does under the hood.
Filter on a list of integers
Terms Query Latency
Plugin Latency
Here we can see the plugin loses miserably against the terms query. In this test environment, it only starts being faster with a list size of around 50k items.
This is probably due to the use of Elasticsearch Filter BitSets (and their cache), allowing Elasticsearch to skip checking every single document at the filter phase. This plugin, on the other hand, doesn’t have that advantage.
Filter on a list of integers and matching text strings
For all iterations, the queries with the plugin were always below 1 ms, making the bar invisible in this chart.
Plugin Implementation
The first part uses the RoaringBitmap and Base64 packages to deserialize the message.
I’m doing it inside the newFactory method in order to run this only once per request. If this is done at a later stage, you’ll end up running this for each Lucene segment, increasing your query latency.
After you have your RoaringBitmap in memory, you can use it like a set/map to check if a specific integer is present.
In the following code snippet, we use DocValues to check all the values for the specified field.
The reason I enforce the use of DocValues is due to its performance, especially when you need to go through a specific field in multiple documents. You can read more about their performance here.
As you can see, the plugin continues to be very simple, and without too much code or complexity!
Python Client Example
If you want to include this in a project, it should be quite simple since most popular languages already have a RoaringBitmap package.
A Python example:
GitHub Project:
Feel free to check the entire project!
Into Elasticsearch? Check these out:
How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!
Stay tuned for the next post. Follow so you won’t miss it!