Our Favorite Elasticsearch Features: Part 1 - Explicit Mapping
This article is the first in a series covering some basic yet really useful features of Elasticsearch. If you are new to Elasticsearch you may not be aware of these features, and knowing these techniques will probably help you design a more maintainable data index.
The features we will outline are:
- explicit mapping (disable dynamic mapping)
- index aliases
- index templates
Today, we will cover the first of these: the pros and cons of dynamic mapping, and how valuable explicitly defining your index mappings can be.
Complete example source for this article can be found here.
What is mapping?
Mapping is the way Elasticsearch defines the underlying data type with which a given field is stored. When a JSON document is indexed into Elasticsearch, the keys in the object are used to look up the data type mapping to use for the data value. Values with the same key are all mapped with the same data type.
You could think of this as a database schema.
What is dynamic mapping?
By default, Elasticsearch indices are created with dynamic mapping enabled. The first time a given key is indexed, Elasticsearch will determine an appropriate data type to use, create a mapping for that key, and index the field.
If you index a field that looks like a date, it gets mapped as a date. Numbers, booleans, and strings are all detected and mapped (see here for reference).
This is fantastic for exploratory work, and for when your data structure isn't necessarily known before indexing. All you have to do to get going is create a new index, index documents, and you can immediately use all of the search and query power of Elasticsearch.
We will use a Docker container to demonstrate the examples that follow:
To show how Elasticsearch handles dynamically mapped fields, let’s create a new index and add some documents:
These documents are now fully indexed and available for searching:
Problems with dynamic mapping
Relying on dynamic mapping also has some practical shortcomings, notably: simple mappings only, mapping errors can be hard to fix, and a lack of error reporting.
Dynamic mapping only creates simple mappings for a small number of Elasticsearch's data types. If you need to store IPv6 addresses or geo points, you will not be able to rely on dynamic mapping.
Let’s look at the dynamic mapping created for the “source_ip” field:
“source_ip” got mapped as a text field, not as an IP address field. It will work for search, but you won’t be able to do the IP address specific range searches.
Mappings are a "one time only" deal. On an index-by-index basis, a field mapping must exist for every key in a document in order to index the document, and once a mapping exists, documents indexed with that mapping can't be changed without a reindex. The automatic mapping is created for the first instance of a key/value in an index will be used for every subsequent field with that key. If the mapping is incorrect, your only option to correct the mistake is a reindex.
We can see this in documents 001 and 002:
Document 002 has a date field indexed with a text data type. It’s hard to say what problems this might cause later on, but in our experience, errors in this class of data mismapping are hard to detect and fix.
Once a mapping is set up, Elasticsearch won’t index documents with incompatible data. For example, document 003 had a “start” field with a date. That was dynamically mapped as date field. If we try to add a new document with a different data type in a “start” field, we get an error:
Dynamic mapping is meant to accept any data. If your data has a regular structure, data shape mistakes will just be silently accepted and indexed with an auto-generated mapping. "Fixing" these mis-mapped data fields can be done with a reindex, but what's even harder is detecting that these errors exist at all.
This error is easy to fix with a reindex, but as a false-negative in a search it can be very hard to detect.
Of course, Elasticsearch allows you to define mappings ahead of time. You can create them when you create an index (recommended), or you can set them on an existing index (note: doing so won't affect any data in the index, so be careful).
When you create an index mapping you can use any of Elasticsearch's data types, set up complex field definitions, control the analysis of the field, and many other aspects controlling ingest, indexing, and ultimately search.
If you are creating a mapping, you can opt to disable dynamic mapping. You can set Elasticsearch to either not index fields without a mapping, or to raise an error and reject the document.
If you are working with a fairly regular dataset, we recommend disabling dynamic mapping with the strict setting: errors will be raised quickly and obviously at document index time. These are the easiest errors to fix.
Here we set up an index with strict mapping:
Then, index a valid document, and a document with an extra field:
The second document wasn’t indexed.
By setting dynamic mapping to false, any fields not explicitly covered by the mapping aren’t indexed, but they are stored in the _source field, and they don’t cause errors.
Create a very similar index to above, but this time with dynamic set to false:
Index two documents, the second having extra unmapped data:
This time we see both documents are accepted without errors being raised.
The effect of unmapped, unindexed fields are clearly demonstrated by the difference between directly getting an object, and by searching for it:
There are some amazing fine-grained controls for dynamic mapping. It can be disabled both at the mapping type level, and at a sub-object level. It is even possible to use wildcard patterns do apply mappings semi-dynamically. Refer to the documentation for full details.
We can demonstrate this by creating a new index with strict mapping that includes a “parameters” field which dynamically maps nested data:
Add two documents to this new index:
Dynamic mapping based on the incoming “parameters” data will both change the mapping, and make different fields available for search:
In the next article we will discuss how helpful index aliases can be for your Elasticsearch cluster maintenance operations.
Ian Truslove is a co-founder of Cambium Consulting. He specializes in building large-scale resilient data processing systems using tools like Clojure and Elasticsearch. When not hunched over an Emacs terminal, you might find him on a bike in the wilds of Colorado.