Stop using the _id field in Elasticsearch

These simple changes will dramatically improve Elasticsearch's performance.

Luis Sena
4 min readOct 24, 2022

Oftentimes, you only need to retrieve a few fields from Elasticsearch. If that’s the case, using doc_values for those fields can drastically improve your query performance.

But one field that is retrieved by default is the field _id. It might look like a harmless thing, but as we’ll see later on, it can hinder your system performance.

Another typical Elasticsearch use is as an auxiliary database, especially for full-text search where you’ll use Elasticsearch to retrieve the ids of the relevant documents.

An example where Elasticsearch is used for full-text search and PostgreSQL as the source of truth to retrieve the final data.

If you’re like a good part of the Elasticsearch user base, you’re probably saving your external id in the _id field to make things simple and avoid duplicating data. Or maybe you don’t even use _id but since it’s returned by default you end up using it indirectly.

The problem with this is that _id is saved using stored_fields which can have a bigger reading overhead when compared to doc_values.
Especially with Elasticsearch 7.10+ where stored_fields started being written/read in blocks of 80KB (previously it was 16KB). This was an optimization done by Lucene (the underlying search engine of Elasticsearch) in order to optimize document compression, especially for smaller documents.

More about this format later in this article.

How to increase Elasticsearch read performance

The solution is simple. You just need to start using doc_values and disable the retrieval of stored_fields (this includes the _id field!).

So a query like this:

GET test/_search
{
"query": {
"match_all": {}
},
"_source": false
}

Becomes this:

GET test/_search
{
"query": {
"match_all": {}
},
"_source": false,
"stored_fields": "_none_",
"docvalue_fields": ["my_id_field"]
}
  • "stored_fields": "_none_": This will disable the retrieval of all stored_fields like _source and _id.
  • "docvalue_fields": ["my_id_field"]: This allows you to retrieve only the selected doc_values fields.

Benchmark

Index Size: 1.3 million documents
Document Size: ~4KB
Queries: 10 000 queries with random match clauses

The average latency of 10 000 random queries. Lower is better.

The results are as expected. Even for relatively small document sizes, we can see that excluding the _id from our results and retrieving only a doc_values field dramatically improves our latency. Expect to see even better results for bigger documents!

Deep dive into Lucene internals

We’ve concluded that it can be more efficient to discard _id and retrieve fields using doc_values and know how to make use of this knowledge in Elasticsearch.

For many of you reading this article, that’s probably enough. For the ones that want to understand better what’s happening behind the scenes to allow such behaviour, continue reading!

Stored Fields

For stored fields, each document is saved as a row that contains all the stored fields consecutively.
The first field is the “_id”, followed by other individual stored fields and “_source” is the last stored field in that row.

How Lucene retrieves stored fields

Previously, I oversimplified the way those blocks are saved in Lucene as you can see in the above diagram. In fact, they are split into 10*8 KB sub-blocks with a shared dictionary.

The benefits are twofold. You get better compression if you have small documents since you’ll have multiple documents inside each block.

For reading, since you can uncompress individual sub-blocks, you might get away with uncompressing only a sub-block to retrieve a specific field/document.

To retrieve a value, Lucene needs to read Field Index (.fdx) file first. This file stores two monotonic arrays (ascending order), one for the first Doc ID of each block of compressed documents, and another one for the corresponding offsets on disk. The array containing doc IDs is binary-searched in order to find the block that includes the expected doc ID, and the associated offset on disk is retrieved from the second array.

This results in potentially two disk seeks for each stored field value you need to retrieve.

It’s worth noting that those Doc IDs are internal to Lucene and have nothing to do with Elasticsearch _id. They are unique only inside each Lucene segment and are incremented with each new document.

DocValues

Lucene stores all the values of a DocValues field consecutively together. This format is also known as columnar/column manner.

Those values can be stored in different formats that optimize their space usage and can be divided into blocks that each can use a different compression format.

Since they are stored in docId order, Lucene just needs to do a sequential read in docId order to retrieve the values of the field for the matched documents.

Into Elasticsearch? Check these out:

How does this all sound? Is there anything you’d like me to expand on? Let me know your thoughts in the comments section below (and hit the clap if this was useful)!

Stay tuned for the next post. Follow so you won’t miss it!

--

--

Responses (1)