layout | title | nav_order | permalink | canonical_url |
---|---|---|---|---|
default |
Intro to OpenSearch |
2 |
/intro/ |
OpenSearch is a distributed search and analytics engine based on Apache Lucene. After adding your data to OpenSearch, you can perform full-text searches on it with all of the features you might expect: search by field, search multiple indexes, boost fields, rank results by score, sort results by field, and aggregate results.
Unsurprisingly, people often use search engines like OpenSearch as the backend for a search application---think Wikipedia or an online store. It offers excellent performance and can scale up and down as the needs of the application grow or shrink.
An equally popular, but less obvious use case is log analytics, in which you take the logs from an application, feed them into OpenSearch, and use the rich search and visualization functionality to identify issues. For example, a malfunctioning web server might throw a 500 error 0.5% of the time, which can be hard to notice unless you have a real-time graph of all HTTP status codes that the server has thrown in the past four hours. You can use OpenSearch Dashboards to build these sorts of visualizations from data in OpenSearch.
Its distributed design means that you interact with OpenSearch clusters. Each cluster is a collection of one or more nodes, servers that store your data and process search requests.
You can run OpenSearch locally on a laptop---its system requirements are minimal---but you can also scale a single cluster to hundreds of powerful machines in a data center.
In a single node cluster, such as a laptop, one machine has to do everything: manage the state of the cluster, index and search data, and perform any preprocessing of data prior to indexing it. As a cluster grows, however, you can subdivide responsibilities. Nodes with fast disks and plenty of RAM might be great at indexing and searching data, whereas a node with plenty of CPU power and a tiny disk could manage cluster state. For more information on setting node types, see Cluster formation.
OpenSearch organizes data into indexes. Each index is a collection of JSON documents. If you have a set of raw encyclopedia articles or log lines that you want to add to OpenSearch, you must first convert them to JSON. A simple JSON document for a movie might look like this:
{
"title": "The Wind Rises",
"release_date": "2013-07-20"
}
When you add the document to an index, OpenSearch adds some metadata, such as the unique document ID:
{
"_index": "<index-name>",
"_type": "_doc",
"_id": "<document-id>",
"_version": 1,
"_source": {
"title": "The Wind Rises",
"release_date": "2013-07-20"
}
}
Indexes also contain mappings and settings:
- A mapping is the collection of fields that documents in the index have. In this case, those fields are
title
andrelease_date
. - Settings include data like the index name, creation date, and number of shards.
OpenSearch splits indexes into shards for even distribution across nodes in a cluster. For example, a 400 GB index might be too large for any single node in your cluster to handle, but split into ten shards, each one 40 GB, OpenSearch can distribute the shards across ten nodes and work with each shard individually.
By default, OpenSearch creates a replica shard for each primary shard. If you split your index into ten shards, for example, OpenSearch also creates ten replica shards. These replica shards act as backups in the event of a node failure---OpenSearch distributes replica shards to different nodes than their corresponding primary shards---but they also improve the speed and rate at which the cluster can process search requests. You might specify more than one replica per index for a search-heavy workload.
Despite being a piece of an OpenSearch index, each shard is actually a full Lucene index---confusing, we know. This detail is important, though, because each instance of Lucene is a running process that consumes CPU and memory. More shards is not necessarily better. Splitting a 400 GB index into 1,000 shards, for example, would place needless strain on your cluster. A good rule of thumb is to keep shard size between 10--50 GB.
You interact with OpenSearch clusters using the REST API, which offers a lot of flexibility. You can use clients like curl or any programming language that can send HTTP requests. To add a JSON document to an OpenSearch index (i.e. index a document), you send an HTTP request:
PUT https://<host>:<port>/<index-name>/_doc/<document-id>
{
"title": "The Wind Rises",
"release_date": "2013-07-20"
}
To run a search for the document:
GET https://<host>:<port>/<index-name>/_search?q=wind
To delete the document:
DELETE https://<host>:<port>/<index-name>/_doc/<document-id>
You can change most OpenSearch settings using the REST API, modify indexes, check the health of the cluster, get statistics---almost everything.
The following section describes more advanced OpenSearch concepts.
Any index changes, such as document indexing or deletion, are written to disk during a Lucene commit. However, Lucene commits are expensive operations, so they cannot be performed after every change to the index. Instead, each shard records every indexing operation in a transaction log called translog. When a document is indexed, it is added to the memory buffer and recorded in the translog. After a process or host restart, any data in the in-memory buffer is lost. Recording the document in the translog ensures durability because the translog is written to disk.
Frequent refresh operations write the documents in the memory buffer to a segment and then clear the memory buffer. Periodically, a flush performs a Lucene commit, which includes writing the segments to disk using fsync
, purging the old translog, and starting a new translog. Thus, a translog contains all operations that have not yet been flushed.
Periodically, OpenSearch performs a refresh operation, which writes the documents from the in-memory Lucene index to files. These files are not guaranteed to be durable because an fsync
is not performed. A refresh makes documents available for search.
A flush operation persists the files to disk using fsync
, ensuring durability. Flushing ensures that the data stored only in the translog is recorded in the Lucene index. OpenSearch performs a flush as needed to ensure that the translog does not grow too large.
In OpenSearch, a shard is a Lucene index, which consists of segments (or segment files). Segments store the indexed data and are immutable. Periodically, smaller segments are merged into larger ones. Merging reduces the overall number of segments on each shard, frees up disk space, and improves search performance. Eventually, segments reach a maximum size specified in the merge policy and are no longer merged into larger segments. The merge policy also specifies how often merges are performed.