Building a Plugin to Improve Elasticsearch Filter Performance Pt. 2

Using RoaringBitmaps to include or exclude big lists of integers in Elasticsearch

Luis Sena
4 min readJan 4, 2023

In a previous article, I shared a plugin I did that allows you to filter big lists of integers using RoaringBitmaps.

Today I bring you the latest version of that plugin, now properly using doc_values in order to improve its performance.

Let’s start with the best-case scenario for this plugin, which excludes a big list of integers when searching inside Elasticsearch.

Exclude a List of Integers

Excluding a list of integers. Terms Query vs Plugin Latency (lower is better)

In this graph, the only thing we can clearly see is that they are orders of magnitude apart since you can barely notice the red bars.

Let’s break this apart to understand the numbers a bit better:

Terms Query Latency

After 100k integers, the Elasticsearch Terms query stops being usable in real-world scenarios.

Plugin Latency

With the plugin, the latency stays at ~0ms even when using 10k integers.

By the way, if you try to filter on more than 65536, you will get this error from Elasticsearch:

‘failed to create query: The number of terms [XXXXXX] used in the Terms Query request has exceeded the allowed maximum of [65536]. This maximum can be set by changing the [index.max_terms_count] index level setting.

You can change that with the following command:

PUT /index-name/_settings
{
"index" : {
"max_terms_count" : 10000000
}
}

But as we saw in the previous charts, it’s probably a good idea to keep that limit to avoid huge latencies and compromising Elasticsearch stability.

In the following table, we can see the latency in milliseconds for each list size.

An important detail to note here: This is the server-side latency only. I’m only reporting this latency as an attempt to be fair (the difference in latency would be even bigger according to my measurements since the payload size increases dramatically when using JSON but can be more susceptible to the programming language and environment).

Even when filtering out 1 million ids, we still get latencies well below 10ms.

We can also use it to include a specific list of integers as you’ll see next. This scenario doesn’t always perform better than Elasticsearch, possibly because of some performance tricks Elasticsearch does under the hood.

Filter on a list of integers

Terms Query Latency

Plugin Latency

Here we can see the plugin loses miserably against the terms query. In this test environment, it only starts being faster with a list size of around 50k items.

This is probably due to the use of Elasticsearch Filter BitSets (and their cache), allowing Elasticsearch to skip checking every single document at the filter phase. This plugin, on the other hand, doesn’t have that advantage.

Filter on a list of integers and matching text strings

For all iterations, the queries with the plugin were always below 1 ms, making the bar invisible in this chart.

Plugin Implementation

The first part uses the RoaringBitmap and Base64 packages to deserialize the message.

I’m doing it inside the newFactory method in order to run this only once per request. If this is done at a later stage, you’ll end up running this for each Lucene segment, increasing your query latency.

After you have your RoaringBitmap in memory, you can use it like a set/map to check if a specific integer is present.

In the following code snippet, we use DocValues to check all the values for the specified field.

The reason I enforce the use of DocValues is due to its performance, especially when you need to go through a specific field in multiple documents. You can read more about their performance here.

As you can see, the plugin continues to be very simple, and without too much code or complexity!

Python Client Example

If you want to include this in a project, it should be quite simple since most popular languages already have a RoaringBitmap package.

A Python example:

GitHub Project:

Feel free to check the entire project!

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

--

--