Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add getting started content #6834

Merged
merged 24 commits into from
Apr 8, 2024
Merged

Add getting started content #6834

merged 24 commits into from
Apr 8, 2024

Conversation

kolchfa-aws
Copy link
Collaborator

Closes #6533

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
_getting-started/intro.md Outdated Show resolved Hide resolved
_getting-started/intro.md Outdated Show resolved Hide resolved
_getting-started/intro.md Outdated Show resolved Hide resolved
Copy link
Contributor

@vagimeli vagimeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Clear, simple instructions that cover the key concepts and information users need. It clarified my understanding of OpenSearch terms and use cases too :)

_getting-started/communicate.md Outdated Show resolved Hide resolved

You interact with OpenSearch clusters using the REST API, which offers a lot of flexibility. Through the REST API, you can change most OpenSearch settings, modify indexes, check the health of the cluster, get statistics---almost everything. You can use clients like [cURL](https://curl.se/) or any programming language that can send HTTP requests.

You can send HTTP requests in your terminal or in the Dev Tools console in OpenSearch Dashboards.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "Dev Tools console" hyperlink to the documentation? {{site.url}}{{site.baseurl}}/dashboards/dev-tools/index-dev/

_getting-started/communicate.md Outdated Show resolved Hide resolved

For more information about `pretty` and other useful query parameters, see [Common REST parameters]({{site.url}}{{site.baseurl}}/opensearch/common-parameters/).

For requests that contain a body, specify the `Content-Type` header and provide the request payload in the `-d` (data) oprion:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For requests that contain a body, specify the `Content-Type` header and provide the request payload in the `-d` (data) oprion:
For requests that contain a body, specify the `Content-Type` header and provide the request payload in the `-d` (data) option:

_getting-started/communicate.md Outdated Show resolved Hide resolved
_getting-started/intro.md Outdated Show resolved Hide resolved
_getting-started/intro.md Outdated Show resolved Hide resolved
_getting-started/quickstart.md Outdated Show resolved Hide resolved
_getting-started/search-data.md Outdated Show resolved Hide resolved
_getting-started/search-data.md Outdated Show resolved Hide resolved
kolchfa-aws and others added 2 commits April 3, 2024 15:32
Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
}
```

Both `John Doe` and `Jane Doe` matched the word `doe`, but `John Doe` is scored higher because it also matched `john`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth mentioning that the match query type uses OR as an operator by default, so the query is functionally doe OR john.


## Search methods

Along with the traditional BM25 search described in this tutorial, OpenSearch supports a range of machine learning (ML)-powered search methods, including k-NN, semantic, multimodal, sparse, hybrid, and conversational search. For information about all search methods, see [Search]({{site.url}}{{site.baseurl}}/search-plugins/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the latest commit removed the description of BM25. (I think this is the only mention of BM25 in the page now.)

Maybe "Along with the traditional full-text search described..." ?

Comment on lines 133 to 135
Any index changes, such as document indexing or deletion, are written to disk during a Lucene commit. However, Lucene commits are expensive operations, so they cannot be performed after every change to the index. Instead, each shard records every indexing operation in a transaction log called _translog_. When a document is indexed, it is added to the memory buffer and recorded in the translog. After a process or host restart, any data in the in-memory buffer is lost. Recording the document in the translog ensures durability because the translog is written to disk.

Frequent refresh operations write the documents in the memory buffer to a segment and then clear the memory buffer. Periodically, a [flush](#flush) performs a Lucene commit, which includes writing the segments to disk using `fsync`, purging the old translog, and starting a new translog. Thus, a translog contains all operations that have not yet been flushed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might be getting too detailed while muddying the key thing that users might need to know ("When is my data durable? When is my data searchable?").

I think we could say something like:

An indexing or bulk call responds when the documents have been written to the translog and the translog is flushed to disk, so the updates are durable. The updates will be visible from search requests until after a refresh operation (see below).

I almost feel like it would help to document these as steps in the lifecycle of an update, like:

  1. An update is received by a primary shard and gets written to the shard's transaction log, which is flushed to disk (followed by an fsync) before the update is acknowledged. This guarantees durability.
  2. The update is also passed to the Lucene index writer, which adds it to an in-memory buffer.
  3. On refresh, the Lucene index writer flushes the in-memory buffers to disk (with each buffer becoming a new Lucene segment), and a new index reader is opened over the resulting segment files. The updates are now visible for search.
  4. On a flush operation, the shard fsyncs the Lucene segments. Since the segment files are a durable representation of the updates, the translog is no longer needed to provide durability, do the updates can be purged from the translog.

If the OpenSearch process is terminated between the end of step 1 (when the update has been acknowledged) and the end of step 4 (when the updated Lucene segments have been flushed to disk), the updates will be replayed from the translog when the process restarts.

@smacrakis -- we talked briefly about this content. Is the above clearer or still too in-the-weeds?

}
```

You cannot change the mappings once the index is created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some mapping changes that are allowed. For example, new fields can be added. I believe you can change the search analyzer associated with a field.

Maybe "You cannot change the type of a field once it is created" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to your suggestion. Also added "Changing a field type requires deleting the index and recreating it with the new mappings."

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Great job putting this all together 😄. Please see my comments and changes and let me know if you have any questions. Thanks!

_getting-started/communicate.md Outdated Show resolved Hide resolved
_getting-started/communicate.md Outdated Show resolved Hide resolved
_getting-started/communicate.md Outdated Show resolved Hide resolved
_getting-started/communicate.md Outdated Show resolved Hide resolved
_getting-started/communicate.md Outdated Show resolved Hide resolved
```
{% include copy-curl.html %}

This request returns no hits because the `keyword` fields must be matched exactly.
Copy link
Collaborator

@natebower natebower Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This request returns no hits because the `keyword` fields must be matched exactly.
Then the request returns no hits because the `keyword` fields must exactly match.


This request returns no hits because the `keyword` fields must be matched exactly.

However, you can search for the exact text `John Doe`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment. Would this be better structured as "However, if you search for the exact text John Doe:

[Example]

Then OpenSearch returns..."?


### Filters

You can add a filter clause to your query for fields with exact values using a Boolean query.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax here is slightly confusing. Do we mean "Using a Boolean query, you can add a filter clause to your query for fields with exact values"?

```
{% include copy-curl.html %}

Range filters support specifying a range of values. For example, the following Boolean query searches for students whose GPA is greater than 3.6:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either "Range filters specify a range of values", "Range filters allow you to specify a range of values", or "With range filters, you can support a range of values".


## Search methods

Along with the traditional full-text search described in this tutorial, OpenSearch supports a range of machine learning (ML)-powered search methods, including k-NN, semantic, multimodal, sparse, hybrid, and conversational search. For information about all search methods, see [Search]({{site.url}}{{site.baseurl}}/search-plugins/).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Along with the traditional full-text search described in this tutorial, OpenSearch supports a range of machine learning (ML)-powered search methods, including k-NN, semantic, multimodal, sparse, hybrid, and conversational search. For information about all search methods, see [Search]({{site.url}}{{site.baseurl}}/search-plugins/).
Along with the traditional full-text search described in this tutorial, OpenSearch supports a range of machine learning (ML)-powered search methods, including k-NN, semantic, multimodal, sparse, hybrid, and conversational search. For information about all OpenSearch-supported search methods, see [Search]({{site.url}}{{site.baseurl}}/search-plugins/).


- In a database of students, a document might represent one student.
- When you search for information, OpenSearch returns documents related to your search.
- If you're familiar with traditional databases, a document represents a row.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If you're familiar with traditional databases, a document represents a row.
- A document represents a row in a traditional database.


You can think of an index in several ways:

- If you have a collection of encyclopedia articles, an index represents the whole collection.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If you have a collection of encyclopedia articles, an index represents the whole collection.
- In a database of students, an index represents all students in the database.


- If you have a collection of encyclopedia articles, an index represents the whole collection.
- When you search for information, you query data contained in an index.
- If you're familiar with traditional databases, a document represents a database table.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- If you're familiar with traditional databases, a document represents a database table.
- An index represents a database table in a traditional database.

- When you search for information, you query data contained in an index.
- If you're familiar with traditional databases, a document represents a database table.

For example, in a school database, an index might contain all students in the school.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, in a school database, an index might contain all students in the school.
For example, in a school database, an index might contain information about all students in the school.


## Clusters and nodes

OpenSearch is designed to be a distributed search engine. OpenSearch can run on one or more _nodes_---servers that store your data and process search requests. An OpenSearch *cluster* is a collection of nodes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OpenSearch is designed to be a distributed search engine. OpenSearch can run on one or more _nodes_---servers that store your data and process search requests. An OpenSearch *cluster* is a collection of nodes.
OpenSearch is designed to be a distributed search engine, meaning that it can run on one or more _nodes_---servers that store your data and process search requests. An OpenSearch *cluster* is a collection of nodes.


You can run OpenSearch locally on a laptop---its system requirements are minimal---but you can also scale a single cluster to hundreds of powerful machines in a data center.

In a single-node cluster, such as a laptop, one machine has to do everything: manage the state of the cluster, index and search data, and perform any preprocessing of data prior to indexing it. As a cluster grows, however, you can subdivide responsibilities. Nodes with fast disks and plenty of RAM might be great at indexing and searching data, whereas a node with plenty of CPU power and a tiny disk could manage cluster state.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In a single-node cluster, such as a laptop, one machine has to do everything: manage the state of the cluster, index and search data, and perform any preprocessing of data prior to indexing it. As a cluster grows, however, you can subdivide responsibilities. Nodes with fast disks and plenty of RAM might be great at indexing and searching data, whereas a node with plenty of CPU power and a tiny disk could manage cluster state.
In a single-node cluster, such as one deployed on a laptop, one machine has to perform every task: manage the state of the cluster, index and search data, and perform any preprocessing of data prior to indexing it. As a cluster grows, however, you can subdivide responsibilities. Nodes with fast disks and plenty of RAM might perform well when indexing and searching data, whereas a node with plenty of CPU power and a tiny disk could manage cluster state.


### Full-text search

You can run a full-text search on fields mapped as `text`. By default, text fields are analyzed by the `default` analyzer. The analyzer splits text into terms and makes it lowercase. For more information about OpenSearch analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/analyzers/).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can run a full-text search on fields mapped as `text`. By default, text fields are analyzed by the `default` analyzer. The analyzer splits text into terms and makes it lowercase. For more information about OpenSearch analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/analyzers/).
You can run a full-text search on fields mapped as `text`. By default, text fields are analyzed by the `default` analyzer. The analyzer splits text into terms and changes it to lowercase. For more information about OpenSearch analyzers, see [Analyzers]({{site.url}}{{site.baseurl}}/analyzers/).


### Keyword search

The `name` field contains the `name.keyword` subfield, which was added by OpenSearch automatically. You can try to search the `name.keyword` field in a manner similar to the previous request:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `name` field contains the `name.keyword` subfield, which was added by OpenSearch automatically. You can try to search the `name.keyword` field in a manner similar to the previous request:
The `name` field contains the `name.keyword` subfield, which is added by OpenSearch automatically. If you search the `name.keyword` field in a manner similar to the previous request:


This request returns no hits because the `keyword` fields must be matched exactly.

However, you can search for the exact text `John Doe`:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, you can search for the exact text `John Doe`:
However, if you search for the exact text `John Doe`:


### Filters

You can add a filter clause to your query for fields with exact values using a Boolean query.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can add a filter clause to your query for fields with exact values using a Boolean query.
Using a Boolean query, you can add a filter clause to your query for fields with exact values

```
{% include copy-curl.html %}

Range filters support specifying a range of values. For example, the following Boolean query searches for students whose GPA is greater than 3.6:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Range filters support specifying a range of values. For example, the following Boolean query searches for students whose GPA is greater than 3.6:
With range filters, you can specify a range of values. For example, the following Boolean query searches for students whose GPA is greater than 3.6:

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Just a few minor comments/changes.


1. An update is received by a primary shard and is written to the shard's transaction log ([translog](#translog)). The translog is flushed to disk (followed by an fsync) before the update is acknowledged. This guarantees durability.
1. The update is also passed to the Lucene index writer, which adds it to an in-memory buffer.
1. On a [refresh operation](#refresh), the Lucene index writer flushes the in-memory buffers to disk (with each buffer becoming a new Lucene segment), and a new index reader is opened over the resulting segment files. The updates are now visible for search.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "over" the right preposition here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the word that I've heard Lucene developers use, because an IndexReader is like a moving window providing a view "over" a set of segments.

Maybe "with" would make more sense to a casual reader? That doesn't sound quite right, though...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll keep "over"

_getting-started/intro.md Outdated Show resolved Hide resolved

### Translog

An indexing or bulk call responds when the documents have been written to the translog and the translog is flushed to disk, so the updates are durable. The updates will be visible from search requests until after a [refresh operation](#refresh).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "from" the right preposition here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to" is probably better

Also, the word "not" is missing -- The updates will not be visible to search requests until after a refresh operation.

kolchfa-aws and others added 4 commits April 4, 2024 10:46
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
_getting-started/intro.md Outdated Show resolved Hide resolved
@kolchfa-aws kolchfa-aws merged commit 246bb44 into main Apr 8, 2024
6 checks passed
@github-actions github-actions bot deleted the getting-started branch April 8, 2024 13:10
@kolchfa-aws kolchfa-aws added the backport 2.13 PR: Backport label for 2.13 label Apr 8, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 8, 2024
* First iteration

Signed-off-by: Fanit Kolchina <[email protected]>

* Add shard and node info

Signed-off-by: Fanit Kolchina <[email protected]>

* Communicate section additions

Signed-off-by: Fanit Kolchina <[email protected]>

* Change examples

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove extraneous files

Signed-off-by: Fanit Kolchina <[email protected]>

* Update _getting-started/communicate.md

Signed-off-by: kolchfa-aws <[email protected]>

* Update _getting-started/intro.md

Signed-off-by: kolchfa-aws <[email protected]>

* Update _getting-started/intro.md

Signed-off-by: kolchfa-aws <[email protected]>

* Update _getting-started/intro.md

Signed-off-by: kolchfa-aws <[email protected]>

* Update _getting-started/search-data.md

Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Add link to compound query section

Signed-off-by: Fanit Kolchina <[email protected]>

* Added install types section

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove further reading suggestions

Signed-off-by: Fanit Kolchina <[email protected]>

* Reorder sections

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Update _getting-started/intro.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Fix links

Signed-off-by: Fanit Kolchina <[email protected]>

* Reword

Signed-off-by: Fanit Kolchina <[email protected]>

* Reword

Signed-off-by: Fanit Kolchina <[email protected]>

* Update _getting-started/intro.md

Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit 246bb44)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
github-actions bot pushed a commit that referenced this pull request Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.13 PR: Backport label for 2.13
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Add documentation on Getting Started in OpenSearch
4 participants