-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add expand data corpus instructions #8807
Changes from 2 commits
b3fb0ed
e575773
56a3c75
729185a
3f7fda9
b8210d1
b5cc889
2824c85
e0d6e1c
4be8fb8
1383d5a
dbfabf0
baf4e23
0f522c9
fd2f4a9
2aa64e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
--- | ||
layout: default | ||
title: Expand data corpus | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
nav_order: 20 | ||
parent: Optimizing benchmarks | ||
grand_parent: User guide | ||
--- | ||
|
||
# Expanding the data corpus of a workload | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Expanding a workload data corpus"?
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This tutorial shows you how to use the [`expand-data-corpus.py`](https://github.com/opensearch-project/opensearch-benchmark/blob/main/scripts/expand-data-corpus.py) script to increase the size of the data corpus for a OpenSearch Becnhmark workload. This is helpful when running time-series workloads like http_logs against a large scale OpenSearch cluster. | ||
Check failure on line 11 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md GitHub Actions / vale[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L11
Raw output
|
||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Only the `http_logs` workload is currently supported. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
{: .warning} | ||
|
||
## Prerequisites | ||
|
||
To use this tutorial, make sure you fulfill the following prerequsities: | ||
Check failure on line 18 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md GitHub Actions / vale[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L18
Raw output
|
||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
1. Python 3.x or greater installed. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. The `http_logs` workload data corpus already in your load generation host where benchmark is running. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. corpus is already available in your load generation host where OSB is running.
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Understanding the script | ||
|
||
The `expand-data-corpus.py` script is designed to generate a larger data corpus by duplicating and modifying existing documents from the `http_logs` workload. It primarily adjusts the timestamp field while keeping other fields intact. It also generates an offset file, which enables OpenSearch Benchmark to start up faster. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Using `expand-data-corpus.py` | ||
|
||
To use `expand-data-corpus.py`, use the following syntax: | ||
|
||
```bash | ||
./expand-data-corpus.py [options] | ||
``` | ||
|
||
You can adjust the script with the following options. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- `--corpus-size`: The desired corpus size in GB | ||
- `--output-file-suffix`: The suffix for the output file name. | ||
|
||
## Example | ||
|
||
natebower marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This example generates a 100 GB corpus. | ||
natebower marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```bash | ||
./expand-data-corpus.py --corpus-size 100 --output-file-suffix 100gb | ||
``` | ||
|
||
The script will start generating documents. For a 100 GB corpus, it can take up to 30 minutes to generate the full corpus. | ||
|
||
You can generate multiple corpora by running the script multiple times with different output suffixes. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
natebower marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Verifying the documents | ||
|
||
After the script completes, check the following locations for new files: | ||
|
||
- In the OSB data directory for `http_logs`: | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `documents-100gb.json`: The generated corpus. | ||
Check failure on line 57 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md GitHub Actions / vale[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L57
Raw output
|
||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `documents-100gb.json.offset`: The offset file. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
1. In the `http_logs` workload directory: | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `gen-docs-100gb.json`: The metadata for the generated corpus. | ||
Check failure on line 61 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md GitHub Actions / vale[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L61
Raw output
|
||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `gen-idx-100gb.json`: The index specification for the generated corpus. | ||
Check failure on line 62 in _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md GitHub Actions / vale[vale] _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md#L62
Raw output
|
||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Using the corpus in a test | ||
|
||
To use the newly generated corpus in an OpenSearch Benchmark test, use the following syntax: | ||
|
||
```bash | ||
opensearch-benchmark execute-test --workload http_logs --workload-params=generated_corpus:t [other_options] | ||
``` | ||
|
||
The `generated_corpus:t` parameter tells OSB to use the expanded corpus. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
natebower marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Expert-level settings | ||
|
||
Be cautious when using following expert options as they may affect the corpus structure: | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- `-f`: Specifies the input file to use as a base for generating new documents. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `-n`: Sets the number of documents to generate instead of the corpus size. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `-i`: Defines the interval between consecutive timestamps. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `-t`: Sets the starting timestamp for the generated documents. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `-b`: Defines the number of documents per batch when writing to the offset file. | ||
Naarcha-AWS marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "Expanding a data corpus"?