- built and tested using Elasticsearch v2.x, v5.x, v6.x and v7.x.
Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an 'analytics database' for R&D, production-use or both. Installation is simple, it ships with sensible default settings that allow it to work effectively out-of-the-box, and all interaction is made via a set of intuitive and extremely well documented RESTful APIs. I've been using it for two years now and I am evangelical.
The elasticsearchr
package implements a simple Domain-Specific Language (DSL) for indexing, deleting, querying, sorting and aggregating data in Elasticsearch, from within R. The main purpose of this package is to remove the labour involved with assembling HTTP requests to Elasticsearch's REST APIs and processing the responses. Instead, users of this package need only send and receive data frames to Elasticsearch resources. Users needing richer functionality are encouraged to investigate the excellent elastic
package from the good people at rOpenSci.
This package is available on CRAN or from this GitHub repository. To install the latest development version from GitHub, make sure that you have the devtools
package installed (this comes bundled with RStudio), and then execute the following on the R command line:
devtools::install_github("alexioannides/elasticsearchr")
Elasticsearch can be downloaded here, where the instructions for installing and starting it can also be found. OS X users (such as myself) can also make use of Homebrew to install it with the command,
$ brew install elasticsearch
And then start it by executing $ elasticsearch
from within any Terminal window. Successful installation and start-up can be checked by navigating any web browser to http://localhost:9200
, where the following message should greet you (give or take the cluster name that changes with every restart),
{
"name" : "RF6t1Gr",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "pag7iIG-TK271EahH0B0yA",
"version" : {
"number" : "6.1.1",
"build_hash" : "bd92e7f",
"build_date" : "2017-12-17T20:23:25.338Z",
"build_snapshot" : false,
"lucene_version" : "7.1.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
If you followed the installation steps above, you have just installed a single Elasticsearch 'node'. When not testing on your laptop, Elasticsearch usually comes in clusters of nodes (usually there are at least 3). The easiest easy way to get access to a managed Elasticsearch cluster is by using the Elastic Cloud managed service provided by Elastic (note that Amazon Web Services offer something similar too). For the rest of this brief tutorial I will assuming you're running a single node on your laptop (a great way of working with data that is too big for memory).
In Elasticsearch a 'row' of data is stored as a 'document'. A document is a JSON object - for example, the first row of R's iris
dataset,
# sepal_length sepal_width petal_length petal_width species
# 1 5.1 3.5 1.4 0.2 setosa
would be represented as follows using JSON,
{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2,
"species": "setosa"
}
Documents are classified into 'types' and stored in an 'index'. In a crude analogy with traditional SQL databases that is often used, we would associate an index with a database instance and the document types as tables within that database. In practice this example is not accurate - it is better to think of all documents as residing in a single - possibly sparse - table (defined by the index), where the document types represent non-unique sub-sets of columns in the table. This is especially so as fields that occur in multiple document types (within the same index), must have the same data-type - for example, if "name"
exists in document type customer
as well as in document type address
, then "name"
will need to be a string
in both. Note, that 'types' are being slowly phased-out and in Elasticsearch v7.x there will only be indices.
Each document is considered a 'resource' that has a Uniform Resource Locator (URL) associated with it. Elasticsearch URLs all have the following format: http://your_cluster:9200/your_index/your_doc_type/your_doc_id
. For example, the above iris
document could be living at http://localhost:9200/iris/data/1
- you could even point a web browser to this location and investigate the document's contents.
Although Elasticsearch - like most NoSQL databases - is often referred to as being 'schema free', as we have already see this is not entirely correct. What is true, however, is that the schema - or 'mapping' as it's called in Elasticsearch - does not need to be declared up-front (although you certainly can do this). Elasticsearch is more than capable of guessing the types of fields based on new data indexed for the first time.
For more information on any of these basic concepts take a look here
elasticsearchr
is a lightweight client - by this I mean that it only aims to do 'just enough' work to make using Elasticsearch with R easy and intuitive. You will still need to read the Elasticsearch documentation to understand how to compose queries and aggregations. What follows is a quick summary of what is possible.
Elasticsearch resources, as defined by the URLs described above, are defined as elastic
objects in elasticsearchr
. For example,
es <- elastic("http://localhost:9200", "iris", "data")
Refers to documents of type 'data' in the 'iris' index located on an Elasticsearch node on my laptop. Note that:
- it is possible to leave the document type empty if you need to refer to all documents in an index; and,
elastic
objects can be defined even if the underling resources have yet to be brought into existence.
To index (insert) data from a data frame, use the %index%
operator as follows:
elastic("http://localhost:9200", "iris", "data") %index% iris
In this example, the iris
dataset is indexed into the 'iris' index and given a document type called 'data'. Note that I have not provided any document ids here. To explicitly specify document ids there must be a column in the data frame that is labelled id
, from which the document ids will be taken.
Documents can be deleted in three different ways using the %delete%
operator. Firstly, an entire index (including the mapping information) can be erased by referencing just the index in the resource - e.g.,
elastic("http://localhost:9200", "iris") %delete% TRUE
Alternatively, documents can be deleted on a type-by-type basis leaving the index and it's mappings untouched, by referencing both the index and the document type as the resource - e.g.,
elastic("http://localhost:9200", "iris", "data") %delete% TRUE
Finally, specific documents can be deleted by referencing their ids directly - e.g.,
elastic("http://localhost:9200", "iris", "data") %delete% c("1", "2", "3", "4", "5")
Any type of query that Elasticsearch makes available can be defined in a query
object using the native Elasticsearch JSON syntax - e.g. to match every document we could use the match_all
query,
for_everything <- query('{
"match_all": {}
}')
To execute this query we use the %search%
operator on the appropriate resource - e.g.,
elastic("http://localhost:9200", "iris", "data") %search% for_everything
# sepal_length sepal_width petal_length petal_width species
# 1 4.9 3.0 1.4 0.2 setosa
# 2 4.9 3.1 1.5 0.1 setosa
# 3 5.8 4.0 1.2 0.2 setosa
# 4 5.4 3.9 1.3 0.4 setosa
# 5 5.1 3.5 1.4 0.3 setosa
# 6 5.4 3.4 1.7 0.2 setosa
# ...
Sometimes only subset of all the available fields need to be returned, so it is much more efficient for Elasticsearch only to return the required data as opposed to all of it. This can be achieved as follows,
selected_fields <- select_fields('{
"includes": ["sepal_length", "species"]
}')
elastic("http://localhost:9200", "iris", "data") %search% (for_everything + selected_fields)
# species sepal_length
# 1 setosa 4.3
# 2 setosa 5.7
# 3 setosa 5.1
# 4 setosa 5.1
# 5 setosa 4.8
# 6 setosa 5.0
The selected fields are defined using Elasticsearch's source filtering API.
Query results can be sorted on multiple fields by defining a sort
object using the same Elasticsearch JSON syntax - e.g. to sort by sepal_width
in ascending order the required sort
object would be defined as,
by_sepal_width <- sort_on('{"sepal_width": {"order": "asc"}}')
This is then added to a query
object whose results we want sorted and executed using the %search%
operator as before - e.g.,
elastic("http://localhost:9200", "iris", "data") %search% (for_everything + by_sepal_width)
# sepal_length sepal_width petal_length petal_width species
# 1 5.0 2.0 3.5 1.0 versicolor
# 2 6.0 2.2 5.0 1.5 virginica
# 3 6.0 2.2 4.0 1.0 versicolor
# 4 6.2 2.2 4.5 1.5 versicolor
# 5 4.5 2.3 1.3 0.3 setosa
# 6 6.3 2.3 4.4 1.3 versicolor
# ...
Similarly, any type of aggregation that Elasticsearch makes available can be defined in an aggs
object - e.g. to compute the average sepal_width
per-species of flower we would specify the following aggregation,
avg_sepal_width <- aggs('{
"avg_sepal_width_per_species": {
"terms": {
"field": "species",
"size": 3
},
"aggs": {
"avg_sepal_width": {
"avg": {
"field": "sepal_width"
}
}
}
}
}')
(Elasticsearch 5.x and 6.x users please note that when using the out-of-the-box mappings the above aggregation requires that "field": "species"
be changed to "field": "species.keyword"
- see here for more information as to why)
This aggregation is also executed via the %search%
operator on the appropriate resource - e.g.,
elastic("http://localhost:9200", "iris", "data") %search% avg_sepal_width
# key doc_count avg_sepal_width.value
# 1 setosa 50 3.428
# 2 versicolor 50 2.770
# 3 virginica 50 2.974
Queries and aggregations can be combined such that the aggregations are computed on the results of the query. For example, to execute the combination of the above query and aggregation, we would execute,
elastic("http://localhost:9200", "iris", "data") %search% (for_everything + avg_sepal_width)
# key doc_count avg_sepal_width.value
# 1 setosa 50 3.428
# 2 versicolor 50 2.770
# 3 virginica 50 2.974
where the combination yields,
print(for_everything + avg_sepal_width)
# {
# "size": 0,
# "query": {
# "match_all": {
#
# }
# },
# "aggs": {
# "avg_sepal_width_per_species": {
# "terms": {
# "field": "species",
# "size": 0
# },
# "aggs": {
# "avg_sepal_width": {
# "avg": {
# "field": "sepal_width"
# }
# }
# }
# }
# }
# }
For comprehensive coverage of all query and aggregations types please refer to the rather excellent official documentation (newcomers to Elasticsearch are advised to start with the 'Query String' query).
We have also included the ability to create an empty index with a custom mapping, using the %create%
operator - e.g.,
elastic("http://localhost:9200", "iris") %create% mapping_default_simple()
Where in this instance mapping_default_simple()
is a default mapping that I have shipped with elasticsearchr
. It switches-off the text analyser for all fields of type 'string' (i.e. switches off free text search), allows all text search to work with case-insensitive lower-case terms, and maps any field with the name 'timestamp' to type 'date', so long as it has the appropriate string or long format.
We have also added the ability to retrieve basic information from the cluster, using the %info%
operator. For example, to retrieve a list of all avaliable indices in the cluster,
elastic("http://localhost:9200", "*") %info% list_indices()
# [1] "iris"
Or to list all of the available fields in an index,
elastic("http://localhost:9200", "iris") %info% list_fields()
# [1] "petal_length" "petal_width" "sepal_length" "sepal_width" "species"
A big thank you to Hadley Wickham and Jeroen Ooms, the authors of the httr
and jsonlite
packages that elasticsearchr
leans upon heavily. And, to the other contributors and supporters - your efforts are greatly appreciated!