Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add expand data corpus instructions #8807

Merged
merged 16 commits into from
Dec 16, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
layout: default
title: Expand data corpus
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "Expanding a data corpus"?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
nav_order: 20
parent: Optimizing benchmarks
grand_parent: User guide
---

# Expanding the data corpus of a workload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Expanding a workload data corpus"?

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

This tutorial shows you how to use the [`expand-data-corpus.py`](https://github.com/opensearch-project/opensearch-benchmark/blob/main/scripts/expand-data-corpus.py) script to increase the size of the data corpus for a OpenSearch Becnhmark workload. This is helpful when running time-series workloads like http_logs against a large scale OpenSearch cluster.

Check failure on line 11 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L11

[OpenSearch.Spelling] Error: Becnhmark. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: Becnhmark. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md", "range": {"start": {"line": 11, "column": 230}}}, "severity": "ERROR"}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

Only the `http_logs` workload is currently supported.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
{: .warning}

## Prerequisites

To use this tutorial, make sure you fulfill the following prerequsities:

Check failure on line 18 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L18

[OpenSearch.Spelling] Error: prerequsities. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: prerequsities. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md", "range": {"start": {"line": 18, "column": 59}}}, "severity": "ERROR"}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

1. Python 3.x or greater installed.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
2. The `http_logs` workload data corpus already in your load generation host where benchmark is running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus is already available in your load generation host where OSB is running.

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Understanding the script

The `expand-data-corpus.py` script is designed to generate a larger data corpus by duplicating and modifying existing documents from the `http_logs` workload. It primarily adjusts the timestamp field while keeping other fields intact. It also generates an offset file, which enables OpenSearch Benchmark to start up faster.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Using `expand-data-corpus.py`

To use `expand-data-corpus.py`, use the following syntax:

```bash
./expand-data-corpus.py [options]
```

You can adjust the script with the following options.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- `--corpus-size`: The desired corpus size in GB
- `--output-file-suffix`: The suffix for the output file name.

## Example

natebower marked this conversation as resolved.
Show resolved Hide resolved
This example generates a 100 GB corpus.
natebower marked this conversation as resolved.
Show resolved Hide resolved

```bash
./expand-data-corpus.py --corpus-size 100 --output-file-suffix 100gb
```

The script will start generating documents. For a 100 GB corpus, it can take up to 30 minutes to generate the full corpus.

You can generate multiple corpora by running the script multiple times with different output suffixes.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

natebower marked this conversation as resolved.
Show resolved Hide resolved
## Verifying the documents

After the script completes, check the following locations for new files:

- In the OSB data directory for `http_logs`:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `documents-100gb.json`: The generated corpus.

Check failure on line 57 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L57

[Vale.Terms] Use 'JSON' instead of 'json'.
Raw output
{"message": "[Vale.Terms] Use 'JSON' instead of 'json'.", "location": {"path": "_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md", "range": {"start": {"line": 57, "column": 23}}}, "severity": "ERROR"}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `documents-100gb.json.offset`: The offset file.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

1. In the `http_logs` workload directory:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `gen-docs-100gb.json`: The metadata for the generated corpus.

Check failure on line 61 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L61

[Vale.Terms] Use 'JSON' instead of 'json'.
Raw output
{"message": "[Vale.Terms] Use 'JSON' instead of 'json'.", "location": {"path": "_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md", "range": {"start": {"line": 61, "column": 22}}}, "severity": "ERROR"}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `gen-idx-100gb.json`: The index specification for the generated corpus.

Check failure on line 62 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L62

[Vale.Terms] Use 'JSON' instead of 'json'.
Raw output
{"message": "[Vale.Terms] Use 'JSON' instead of 'json'.", "location": {"path": "_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md", "range": {"start": {"line": 62, "column": 21}}}, "severity": "ERROR"}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

## Using the corpus in a test

To use the newly generated corpus in an OpenSearch Benchmark test, use the following syntax:

```bash
opensearch-benchmark execute-test --workload http_logs --workload-params=generated_corpus:t [other_options]
```

The `generated_corpus:t` parameter tells OSB to use the expanded corpus.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

natebower marked this conversation as resolved.
Show resolved Hide resolved
## Expert-level settings

Be cautious when using following expert options as they may affect the corpus structure:
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved

- `-f`: Specifies the input file to use as a base for generating new documents.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `-n`: Sets the number of documents to generate instead of the corpus size.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `-i`: Defines the interval between consecutive timestamps.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `-t`: Sets the starting timestamp for the generated documents.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
- `-b`: Defines the number of documents per batch when writing to the offset file.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved