Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Rewrite "What is Elasticsearch?" (Part 1) #112213

Merged
merged 15 commits into from
Aug 29, 2024
202 changes: 72 additions & 130 deletions docs/reference/intro.asciidoc
Original file line number Diff line number Diff line change
@@ -1,42 +1,77 @@
[[elasticsearch-intro]]
== What is {es}?
_**You know, for search (and analysis)**_

{es} is the distributed search and analytics engine at the heart of
the {stack}. {ls} and {beats} facilitate collecting, aggregating, and
enriching your data and storing it in {es}. {kib} enables you to
interactively explore, visualize, and share insights into your data and manage
and monitor the stack. {es} is where the indexing, search, and analysis
magic happens.

{es} provides near real-time search and analytics for all types of data. Whether you
have structured or unstructured text, numerical data, or geospatial data,
{es} can efficiently store and index it in a way that supports fast searches.
You can go far beyond simple data retrieval and aggregate information to discover
trends and patterns in your data. And as your data and query volume grows, the
distributed nature of {es} enables your deployment to grow seamlessly right
along with it.

While not _every_ problem is a search problem, {es} offers speed and flexibility
to handle data in a wide variety of use cases:

* Add a search box to an app or website
* Store and analyze logs, metrics, and security event data
* Use machine learning to automatically model the behavior of your data in real
time
* Use {es} as a vector database to create, store, and search vector embeddings
* Automate business workflows using {es} as a storage engine
* Manage, integrate, and analyze spatial information using {es} as a geographic
information system (GIS)
* Store and process genetic data using {es} as a bioinformatics research tool

We’re continually amazed by the novel ways people use search. But whether
your use case is similar to one of these, or you're using {es} to tackle a new
problem, the way you work with your data, documents, and indices in {es} is
the same.

https://github.com/elastic/elasticsearch[{es}] is a distributed, RESTful search and analytics engine, scalable data store, and vector database built in Java on top of the Apache Lucene library.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
Use {es} to search, index, store, and analyze data of all shapes and sizes in near real-time.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved

[TIP]
====
{es} has a lot of features. Explore the full list on the https://www.elastic.co/elasticsearch/features[product webpage^].
====

{es} is the heart of the the <<elasticsearch-intro-elastic-stack,{stack}>> and powers the Elastic https://www.elastic.co/enterprise-search[Search], https://www.elastic.co/observability[Observability] and https://www.elastic.co/security[Security] solutions.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved

{es} is used for a wide and growing range of use cases. Here are a few examples:

* *Monitor log and event data*. Store and analyze logs, metrics, and security event data for operational insights and SIEM.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
* *Build search applications*. Add search capabilities to apps or websites and build enterprise search engines over your organization's internal data sources.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
* *Vector database*. Store and search vectorized data, create vector embeddings with built-in and third-party NLP models.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
* *Retrieval augmented generation (RAG)*. Use {es} as a retrieval engine to augment Generative AI models.
* *Application and security monitoring*. Monitor and analyze application performance and security data effectively.
* *Machine learning*. Use {ml} to automatically model the behavior of your data in real-time.

This is just a sample of search, observability, and security use cases enabled by {es}.
Refer to our https://www.elastic.co/customers/success-stories[customer success stories] for concrete examples across a range of industry verticals.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
// Link to demos, search labs chatbots

[discrete]
[[elasticsearch-intro-elastic-stack]]
.What is the Elastic Stack?
*******************************
The Elastic Stack refers to the suite of products enabled by {es}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"enabled by" is a little confusing. I'm ok with leaving this as is for now, but I'd like this to speak a little more to the idea of products that feed data into es or consume data from es.

I think when we port in the stack overview, this can be pivoted to more of a "ES sits in the middle of lots of products" statement and link out, so it doesn't feel so overwhelming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I agree enabled isn't right


* https://www.elastic.co/guide/en/kibana/current/index.html[Kibana]. A UI for visualizing and exploring data in {es}.
* https://www.elastic.co/guide/en/elasticsearch/client/index.html[Client libraries]. Work with {es} in your preferred programming language.
* https://www.elastic.co/guide/en/logstash/current/introduction.html[Logstash]. A server-side data processing pipeline for ingesting and transforming data from multiple sources and indexing into {es}.
* https://www.elastic.co/guide/en/fleet/current/fleet-overview.html[Fleet and Elastic Agent.] Elastic Agents is a single, unified way to add monitoring for logs, metrics, and other types of data to a host. Fleet is a central place to configure and monitor your Elastic Agents.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
* https://www.elastic.co/guide/en/beats/libbeat/current/beats-reference.html[Beats]. Lightweight data shippers for sending data from edge machines to {es}.
* https://www.elastic.co/guide/en/observability/current/apm.html[APM]. Monitor the performance of your applications.
* https://www.elastic.co/guide/en/elasticsearch/hadoop/current/float.html[{es} Hadoop]. Use {es} as a Hadoop input/output format.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved

https://www.elastic.co/guide/en/starting-with-the-elasticsearch-platform-and-its-solutions/current/stack-components.html[Learn more about the Elastic Stack].
*******************************
// TODO: Remove once we've moved Stack Overview to a subpage?

[discrete]
[[elasticsearch-intro-deploy]]
=== Deployment options

To use {es}, you need a running instance of the {es} service.
You can deploy {es} in various ways:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we clearer "some of these cost money and some of these are free" messaging. ECE and ECK are especially confusing in comparison to self-managed. can also be a follow-up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to just list the options and let devs decide. Free trials imply paid services. And honestly I don't think anyone "gets started" with ES by using ECE or ECK. I like concision here and we have links to learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could categorize self-managed, ECE, and ECK into an "advanced deployment" options section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would work for me, although self-managed is also the beginner/builder path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why local dev has its own category and has top billing 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah my bad


* https://elastic.co/guide/en/cloud/current/ec-getting-started.html[*Elastic Cloud*]. {es} is available as part of our hosted Elastic Stack offering, deployed in the cloud with your provider of choice. Sign up for a https://cloud.elastic.co/registration[14 day free trial].
* https://elastic.co/guide/en/cloud-enterprise/current/Elastic-Cloud-Enterprise-overview.html[*Elastic Cloud Enterprise*]. Deploy Elastic Cloud on public or private clouds, virtual machines, or your own premises.
* https://elastic.co/guide/en/cloud-on-k8s/current/k8s-overview.html[*Elastic Cloud on Kubernetes*]. Deploy Elastic Cloud on Kubernetes.
* https://www.elastic.co/docs/current/serverless[*Elastic Cloud Serverless* (technical preview)]. Create serverless projects for autoscaled and fully-managed {es} deployments. Sign up for a https://cloud.elastic.co/serverless-registration[14 day free trial].
* <<elasticsearch-deployment-options,*Self managed*>>. Install, configure, and run {es} on your own premises.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
+
[TIP]
====
If you just want to get started quickly with a minimal local setup, refer to <<run-elasticsearch-locally,Run {es} locally>>.
====
leemthompo marked this conversation as resolved.
Show resolved Hide resolved

[discrete]
[[elasticsearch-next-steps]]
=== Learn more

* <<getting-started, Quickstart>>. A beginner's guide to deploying your first {es} instance, indexing data, and running queries.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
* https://elastic.co/webinars/getting-started-elasticsearch[Webinar: Introduction to {es}]. Register for our live webinars to learn directly from {es} experts.
* https://www.elastic.co/search-labs[Elastic Search Labs]. Tutorials and blogs that explore AI-powered search using the latest {es} features.
** Follow our tutorial https://www.elastic.co/search-labs/tutorials/search-tutorial/welcome[to build a hybrid search solution in Python].
** Check out the https://github.com/elastic/elasticsearch-labs?tab=readme-ov-file#elasticsearch-examples--apps[`elasticsearch-labs` repository] for a range of Python notebooks and apps for various use cases.

[[documents-indices]]
=== Data in: documents and indices
=== Documents and indices

{es} is a distributed document store. Instead of storing information as rows of
columnar data, {es} stores complex data structures that have been serialized
Expand Down Expand Up @@ -65,8 +100,7 @@ behavior makes it easy to index and explore your data--just start
indexing documents and {es} will detect and map booleans, floating point and
integer values, dates, and strings to the appropriate {es} data types.

Ultimately, however, you know more about your data and how you want to use it
than {es} can. You can define rules to control dynamic mapping and explicitly
You can define rules to control dynamic mapping and explicitly
define mappings to take full control of how fields are stored and indexed.

Defining your own mappings enables you to:
Expand All @@ -88,94 +122,6 @@ The analysis chain that is applied to a full-text field during indexing is also
used at search time. When you query a full-text field, the query text undergoes
the same analysis before the terms are looked up in the index.

[[search-analyze]]
=== Information out: search and analyze
leemthompo marked this conversation as resolved.
Show resolved Hide resolved

While you can use {es} as a document store and retrieve documents and their
metadata, the real power comes from being able to easily access the full suite
of search capabilities built on the Apache Lucene search engine library.

{es} provides a simple, coherent REST API for managing your cluster and indexing
and searching your data. For testing purposes, you can easily submit requests
directly from the command line or through the Developer Console in {kib}. From
your applications, you can use the
https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client]
for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python
or Ruby.

[discrete]
[[search-data]]
==== Searching your data

The {es} REST APIs support structured queries, full text queries, and complex
queries that combine the two. Structured queries are
similar to the types of queries you can construct in SQL. For example, you
could search the `gender` and `age` fields in your `employee` index and sort the
matches by the `hire_date` field. Full-text queries find all documents that
match the query string and return them sorted by _relevance_&mdash;how good a
match they are for your search terms.

In addition to searching for individual terms, you can perform phrase searches,
similarity searches, and prefix searches, and get autocomplete suggestions.

Have geospatial or other numerical data that you want to search? {es} indexes
non-textual data in optimized data structures that support
high-performance geo and numerical queries.

You can access all of these search capabilities using {es}'s
comprehensive JSON-style query language (<<query-dsl, Query DSL>>). You can also
construct <<sql-overview, SQL-style queries>> to search and aggregate data
natively inside {es}, and JDBC and ODBC drivers enable a broad range of
third-party applications to interact with {es} via SQL.

[discrete]
[[analyze-data]]
==== Analyzing your data

{es} aggregations enable you to build complex summaries of your data and gain
insight into key metrics, patterns, and trends. Instead of just finding the
proverbial “needle in a haystack”, aggregations enable you to answer questions
like:

* How many needles are in the haystack?
* What is the average length of the needles?
* What is the median length of the needles, broken down by manufacturer?
* How many needles were added to the haystack in each of the last six months?

You can also use aggregations to answer more subtle questions, such as:

* What are your most popular needle manufacturers?
* Are there any unusual or anomalous clumps of needles?

Because aggregations leverage the same data-structures used for search, they are
also very fast. This enables you to analyze and visualize your data in real time.
Your reports and dashboards update as your data changes so you can take action
based on the latest information.

What’s more, aggregations operate alongside search requests. You can search
documents, filter results, and perform analytics at the same time, on the same
data, in a single request. And because aggregations are calculated in the
context of a particular search, you’re not just displaying a count of all
size 70 needles, you’re displaying a count of the size 70 needles
that match your users' search criteria--for example, all size 70 _non-stick
embroidery_ needles.

[discrete]
[[more-features]]
===== But wait, there’s more

Want to automate the analysis of your time series data? You can use
{ml-docs}/ml-ad-overview.html[machine learning] features to create accurate
baselines of normal behavior in your data and identify anomalous patterns. With
machine learning, you can detect:

* Anomalies related to temporal deviations in values, counts, or frequencies
* Statistical rarity
* Unusual behaviors for a member of a population

And the best part? You can do this without having to specify algorithms, models,
or other data science-related configurations.

[[scalability]]
=== Scalability and resilience: clusters, nodes, and shards
++++
Expand Down Expand Up @@ -255,13 +201,9 @@ create secondary clusters to serve read requests in geo-proximity to your users.
the active leader index and handles all write requests. Indices replicated to
secondary clusters are read-only followers.

[discrete]
[[admin]]
==== Care and feeding
leemthompo marked this conversation as resolved.
Show resolved Hide resolved

As with any enterprise system, you need tools to secure, manage, and
monitor your {es} clusters. Security, monitoring, and administrative features
that are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}]
as a control center for managing a cluster. Features like <<downsampling,
downsampling>> and <<index-lifecycle-management, index lifecycle management>>
help you intelligently manage your data over time.
help you intelligently manage your data over time.
2 changes: 1 addition & 1 deletion docs/reference/modules/shard-ops.asciidoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[[shard-allocation-relocation-recovery]]
=== Shard allocation, relocation, and recovery

Each <<documents-indices,index>> in Elasticsearch is divided into one or more <<scalability,shards>>.
leemthompo marked this conversation as resolved.
Show resolved Hide resolved
Each index in {es} is divided into one or more <<scalability,shards>>.
Each document in an index belongs to a single shard.

A cluster can contain multiple copies of a shard. Each shard has one distinguished shard copy called the _primary_, and zero or more non-primary copies called _replicas_. The primary shard copy serves as the main entry point for all indexing operations. The operations on the primary shard copy are then forwarded to its replicas.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[[near-real-time]]
=== Near real-time search
The overview of <<documents-indices,documents and indices>> indicates that when a document is stored in {es}, it is indexed and fully searchable in _near real-time_--within 1 second. What defines near real-time search?
When a document is stored in {es}, it is indexed and fully searchable in _near real-time_--within 1 second. What defines near real-time search?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now realizing we often have a hyphen in the noun form (there's one here, _near real-time_), and arguably we'd need to add one to "near-real-time search" if we're being picky, which isn't worth it.

So it'd be fair to undo that other change I suggested, and just standardize on the hyphenated form for simplicity and readability.


Lucene, the Java libraries on which {es} is based, introduced the concept of per-segment search. A _segment_ is similar to an inverted index, but the word _index_ in Lucene means "a collection of segments plus a commit point". After a commit, a new segment is added to the commit point and the buffer is cleared.

Expand Down