Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Measurement/Tracking #445

Open
jjhursey opened this issue Jun 30, 2016 · 5 comments
Open

Performance Measurement/Tracking #445

jjhursey opened this issue Jun 30, 2016 · 5 comments

Comments

@jjhursey
Copy link
Member

We should investigate better performance benchmark integration into the new MTT infrastructure. Performance regressions are hard to see currently and automated tracking of this will give us better visibility of performance issue when commits happen instead of when we are ramping up to release.

See open-mpi/ompi#1831 for one case where this would be useful for Open MPI.

We need to discuss how to store the data, and how we can organize the DB structure and REST interface to make accessing apples-to-apples comparisons. We looked at it in the past, and it's harder than one might think.

@gpaulsen
Copy link
Member

gpaulsen commented Aug 31, 2016

At the face to face, we recommended not trusting "stored away" numbers, and instead run both an old version and new version in sequence, to try to mitigate cluster changes, environmental changes, and other temporal abnormalities in the data.
I'd recommend a process of:

  1. BUILD OLD
  2. BUILD NEW
  3. BUILD TEST with OLD
  4. RUN TEST with OLD runtime
  5. RUN TEST with NEW runtime
  6. compare performance results.

Storing these performance diffs in the database might be more useful than storing raw results.
And if we don't recompile the test application, we might discover any binary compatible bugs introduced before OLD and NEW.

@jsquyres
Copy link
Member

jsquyres commented Sep 8, 2016

👍 on what @gpaulsen said. Except I'd make the backwards compatibility tests be separate from these performance tests (because the performance characteristics may/will be desirable to test over a longer period of time than our backwards compatibility guarantees).

Some possible requirements for the performance testing:

  1. Let's start with 3 easy benchmarks: latency, bandwidth, and message injection rates.
    • Each of these has a simple performance curve of time/Y vs. message size/X.
    • Disk space is cheap: I'd store the individual X/Y data points, not just the difference between OLD/NEW.
  2. The comparison of OLD vs. NEW can be a simple subtraction:
    • For latency: calculate (NEW_y - OLD_y) for each MESSAGE_SIZE_x. If any value is greater than Z% of OLD_y (where Z% is TBD/parameter of the test checker), FAIL the test.
    • For bandwidth and message rate: calculate (OLD_y - NEW_y) for each MESSAGE_SIZE_x. If any value is greater than Z% of OLD_y (where Z% is TBD/parameter of the test checker), FAIL the test.

We as a community just need to determine the versions of OLD that we want to compare against.

@rhc54
Copy link
Contributor

rhc54 commented Sep 9, 2016

Let's also remember that we have plugin support in the new MTT. So there is no problem creating a plugin that compares against some stored "good" measurement, and another that does old vs new, and another that does what someone wants for their own purposes. If we write the plugins intelligently so data retrieval can be shared code, then it will be relatively easy to add new comparison algorithms.

@jjhursey
Copy link
Member Author

Below are some thoughts...

Running Tests

Ability to run in the following modes (version could be a release tar ball, or git hash):

  1. Run perf test for the currently installed build only
    • Use case: single data point, developer point test during development
  2. Run perf test for version A only
    • Use case: single data point, bisect though history
  3. Run perf test for the currently installed build and version A
    • Use case: developer progress test against a baseline version
  4. Run perf test for version X and version Y
    • Use case: Compare delta between two points in time

Collecting/Reporting Data

  • Raw data is collected and archived by client
    • Use plugins to determine set of archive methods (e.g., XML, JSON, local/remote database(s), custom format)
  • Option to push data to the server to share more broadly.
    • This must be able to be a separate step that is run manually after local review of the results.
    • Allowed to be part of the automated process.
  • Data is going to be specific to a particular configuration, so we need to make clear in the reporting the:
    • System configuration (e.g., arch, network, ...)
      • User should set a description for the system - human readable
      • Maybe also add additional discovery via config.log and hwloc?
      • Would need any device/hw specific configurations too
    • Build configuration
      • User should set a description for this - human readable
      • Maybe also pull the configure results...
    • Runtime configuration
      • User should set a description for this - human readable
      • We need the command line, plus any environment variables, binding, ...
    • I think a human readable configuration line would help in understanding the testing environment. We will not be able to automatically discover everything that is necessary to report.
  • Ability to delete/hide/promote results that are pushed to the DB.
    • So we can remove known bad results
    • So we can easily share useful comparisons. (current permalinks mechanism could be used here)

Rendering Data

  • Each perf test has a description of how the data should be organized/compared
  • Comparison represented in tabular format
  • Comparison represented in graph format
  • Comparison can be rendered entirely on the client side
  • Alarm / Flag option
    • When a particular comparison is out-of-normal-range then it is flagged in an obvious way.

Other notes

  • We need to decide on a set of useful perf tests. Start small and gradually grow from there.
  • Must have the option to be selective in what is reported where.
    • Sometimes developers want frequent status updates on the performance impact
    • Some performance numbers should not be made public.
  • A useful rendering might be
    • Every week show me the performance difference between:
      • Release vX.Y.Z and the current HEAD of master
      • Release vX.Y.Z and the current HEAD of release branch
    • Ability to integrate git bisect to find where in history performance changed. This might get tricky...

@jjhursey
Copy link
Member Author

This would be great to do one day if someone is interested in tinkering with it.

@jjhursey jjhursey removed their assignment Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants