Following our discussion of the relative merits of dynamic mapping, and how using explicit mappings can help you manage your data, we move on to the second feature to highlight: index aliases.
Index aliases may appear to be just a naming convenience, but they act as an important abstraction layer between your Elasticsearch indices and your application code.
We will review multi-index queries and the mechanics of index aliases, then present two real-world use cases for aliases: managing cluster data volume, and performing downtime-free maintenance operations.
Complete example source for this article can be found here.
You can query multiple Elasticsearch indices in one search operation. The indices can be specified with wildcard patterns, or by listing multiple indices using commas as separators.
Let’s create two indices to store visitor logs, one for records from 2017 and one for 2018:
Add some data to each of these indices. These log data represent two separate log events generated by one user, identified by user-id 30c1b62a:
Then we search the indices for the log events for that user. Note how multiple indices are specified in the query, separated by a comma:
The results’ _index fields also shows that data come from multiple source indices.
Being able to search multiple indices with one query is very useful. We could have achieved identical results to listing each of the indices in the query URL by specifying a wildcard pattern, e.g.:
Index aliases are another way to work with multiple indices at the same time. An index alias is simply a grouping of a number of indexes under a single logical alias name. Index aliases have their own API, allowing you to create, manage and delete aliases. The typical operation is to add or remove an index from the alias, and a number of operations can be grouped into a single atomic API call.
Here we create an index alias named it “visitor_logs” and add both the 2017 and 2018 visitor logs indices to it. The alias name can be substituted anywhere that an index list or pattern match can be used, e.g. in a search:
Because we don’t need to know the actual index names for the operation, and we can transparently change the indices referenced by the alias without impacting users of the alias, this turns out to be an incredibly useful feature. We will focus on two main benefits: long-term data management, and structural maintenance.
Use case: data volume management
Expunging data from data stores is frequently necessary. Certain data sets can get very large over time, but the value of the data decreases with its age (e.g. log streams). You may need to implement a time-based data retention policy, and remove old data from your systems. We can use index aliases to simplify removing data from Elasticsearch.
Continuing the visitor logs indices example above, we consider what happens over time as the indices grow. When the time comes to reduce the total data store size, we can remove the oldest data from Elasticsearch without any downtime, without any query interruptions, and without any client-side changes.
For a better example, we can add more data to the indices:
It’s not an enormous amount of data, but the _cat APIs show us how many documents and how much data we have indexed:
i.e. a total of 65 documents and 84kB of storage.
If we now want to reduce our cluster’s data size, we would likely want to remove the oldest visitor logs index. First, remove visitor_logs_2017 from the alias:
Immediately, when we re-query the index alias, we get fewer results:
Now we are safe to delete the old 2017 index, and verify that space has been freed:
i.e. down to just 14 documents and 29kB storage, with just one index remaining.
It is easy and useful to adopt this date-based scheme to fit many patterns of operational requirements.
Use case: maintenance
Once you have an index alias in place, you will find restructuring the underlying indices without affecting users considerably easier. We have found this to be very helpful when doing the routine reindexing operations as the Elasticsearch index design evolves over time.
Specifically, we often use this technique to adjust index mappings to enable new queries or make existing query loads more efficient, changing index shard counts (or other index parameters) to optimize cluster performance, splitting an index into two, or running a script to correct bad data.
In general, the approach is to reindex from an index in the alias, or to an index (or indices) not in the alias. Once this operation is complete, the source index and destination index are swapped in the alias, and the new data are immediately available for query.
When we set up our index mappings earlier, we made a mistake and mapped the IP address field as text instead of IP address. Let’s fix that, and do so without any downtime.
First, we create a new index with a corrected mapping:
Next, we use the reindex API to re-index our data into the index with the correct mapping:
Then we can adjust the index alias to use the new correct index, and to stop using the old index:
At this point the old index can be deleted to reclaim disk space:
Finally, we can show that the reindex was successful by running a specialized IP range query against the newly mapped IP fields, but still using the same index alias:
All of this is possible without the index alias, except for the continuity of querying and the affordances you gain in testing the results of your reindex before swapping the indices in the alias.
For consideration when using index aliases…
The index alias API has a number of options. It’s worth being familiar with the complete set of options available to you. The most important thing to know is a single API call is atomic, and you can perform multiple changes to an alias in one single POST. Use this to your advantage for zero downtime maintenance.
Most of the time, you want the indices behind an alias to have consistent mappings and data shape. It will require effort to keep all the indices consistent when you’re trying to make fast changes, but it will help avoid problems later.
Index aliases can’t be used everywhere. When you index a document, you must know which physical index to write to. The same is true of update operations. This abstraction leakage and subsequent duplication of naming logic is a little unfortunate, but a standardized index naming convention will help you enormously.
Check in next time as we look at index templates. These are another strong feature of Elasticsearch, can help manage both index mappings and index aliases, and clearly bring value to your data store configuration.
Ian Truslove is a co-founder of Cambium Consulting. He specializes in building large-scale resilient data processing systems using tools like Clojure and Elasticsearch. When not hunched over an Emacs terminal, you might find him on a bike in the wilds of Colorado.
Learn about Cambium’s standardized Elasticsearch Performance Audit, Performance+Data Audit, and Elasticsearch End-of-Life Upgrade.