Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Querying Functionality to OSB #409

Merged
merged 13 commits into from
Jun 21, 2022

Conversation

jmazanec15
Copy link
Member

@jmazanec15 jmazanec15 commented May 23, 2022

Description

Adds ability to run query workload from a data set with OpenSearch Benchmark tool for k-NN workloads. Refactors some of the code to better share components across extensions.

In addition, added unit tests for testing custom param sources.

For recall metrics, tracking issue here: opensearch-project/opensearch-benchmark#199. This will not be covered in this PR.

Issues Resolved

#373

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Adds random query workloads to both the train and no-train test
procedures. Adds custom parameter source to produce the queries. Add
usage of the parameter source to both json files. Updated documentation.

Signed-off-by: John Mazanec <[email protected]>
Adds custom param source that will allow users to pull queries from a
data set as opposed to using random queries. Along with this, refactored
parameter sources to share common functionality. Updated README

Signed-off-by: John Mazanec <[email protected]>
Reads query vecs from data set in batches to avoid making too many disk
reads. Batch size is hardcoded to 100.

Signed-off-by: John Mazanec <[email protected]>
Add custom query recall runner so that we can eventually compute the
recall of queries. Currently, recall value is hard coded but this will
be implemented in the future.

Signed-off-by: John Mazanec <[email protected]>
Add ability to compute recall score for the customer query runner.
Currently, to compute recall, it checks how many of the top k returned
results appear in the ground truth set.

Signed-off-by: John Mazanec <[email protected]>
Cleans up documentation and tracks with addition of query and compute
recall functionality.

Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 added the Infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. label May 23, 2022
@jmazanec15 jmazanec15 requested a review from a team May 23, 2022 04:09
@jmazanec15 jmazanec15 marked this pull request as draft May 23, 2022 04:09
@codecov-commenter
Copy link

codecov-commenter commented May 23, 2022

Codecov Report

Merging #409 (52563b1) into main (a5dd71c) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #409   +/-   ##
=========================================
  Coverage     84.01%   84.01%           
  Complexity      911      911           
=========================================
  Files           130      130           
  Lines          3879     3879           
  Branches        359      359           
=========================================
  Hits           3259     3259           
  Misses          458      458           
  Partials        162      162           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5dd71c...52563b1. Read the comment docs.

@travisbenedict
Copy link

I haven't used this functionality myself but this seems like it should work.

Is any of the data defined here showing up in the results? Is it just recall that's missing?

If you haven't already you could try configuring OpenSearch Benchmark to write to an OpenSearch cluster. That would give you access to the full set of raw metrics.

@jmazanec15
Copy link
Member Author

@travisbenedict Ive tried a few variations of it, but no, only query latency, thoughput, service time, and error rate get output. In Rally docs, it said that custom metrics would be added in meta data about the operation, but I am not sure how to find those or generate those if it is not connected to an OpenSearch cluster. Also, ideally, I would like to get the results in the summary.

Here is a current sample of the results: https://gist.github.com/jmazanec15/82b91eaad4af8acd773fbc97ba25b638.

Removes recall calculation from benchmarking logic as this is delayed
until
opensearch-project/opensearch-benchmark#199
can be implemented.

Signed-off-by: John Mazanec <[email protected]>
Removes random query. Random query may be misleading if the distribution
of the index data is significantly different than that of the
randomness.

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Adds unit tests for param sources for benchmarking. In addition, adds a
test utility to create data sets dynamically.

Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
@jmazanec15 jmazanec15 marked this pull request as ready for review June 20, 2022 22:15
Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few general questions:

  • how we gonna handle OSB version updates, I think we use officially supported extension points, but just want to re-confirm that we minimize changes of breaking things on our end with upgrade to a new OSB version
  • do you want to use multiple clients for queries in our benchmarks (say for k-NN release)? We probably need to come up with some formula to estimate number of clients based on cluster configuration.

Returns:
The parameter source for this particular partion
"""
if self.num_vectors % total_partitions != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mean that the data set size must be divisible by the number of parallel clients?
If so I think in next revision we need to relax this requirement and divide evenly except for last client that will have the remainder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats a good point. I can update this in a future PR. Will create an issue when PR is merged.

@jmazanec15
Copy link
Member Author

how we gonna handle OSB version updates, I think we use officially supported extension points, but just want to re-confirm that we minimize changes of breaking things on our end with upgrade to a new OSB version

@martin-gaievski Good question. I think it will most likely be addressed at a later date. Right now, we don't release the benchmarks as artifacts and we hard code dependency to OSB in requirements.txt. I think eventually we will want to transfer things to https://github.com/opensearch-project/opensearch-benchmark-workloads/ and when we do that we can ensure version compatibility.

@jmazanec15
Copy link
Member Author

do you want to use multiple clients for queries in our benchmarks (say for k-NN release)? We probably need to come up with some formula to estimate number of clients based on cluster configuration.

Yes, I think this PR will focus more on providing functionality of benchmarking tool. In a future PR, we will make decisions on configuration. We need some kind of standardization of performance testing for releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. v2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants