Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

measure disk, IO, memory and CPU usage in oasis-test-runner #2536

Closed
matevz opened this issue Jan 10, 2020 · 1 comment · Fixed by #2687
Closed

measure disk, IO, memory and CPU usage in oasis-test-runner #2536

matevz opened this issue Jan 10, 2020 · 1 comment · Fixed by #2687
Assignees
Labels
c:instrumentation Category: metrics and tracing c:performance Category: performance c:testing Category: testing

Comments

@matevz
Copy link
Member

matevz commented Jan 10, 2020

Currently, we use prometheus to report per-node metrics during the test. Add prometheus also to test-runner which measures disk, IO, memory and CPU usage during the execution for each oasis-node process and report it via prometheus.

Later, this should be used together with other metrics in the long-term-tests where prometheus would compare the results with the ones from the previous day and warn, if there are any major deviations.

@matevz matevz added the c:performance Category: performance label Jan 10, 2020
@kostko kostko added c:instrumentation Category: metrics and tracing c:testing Category: testing labels Jan 10, 2020
@matevz
Copy link
Member Author

matevz commented Feb 14, 2020

Extend oasis-test-runner, so that:

  • Each scenario will expose valid parameters which are settable from CLI, for example
$ oasis-test-runner list
Supported test cases:
  * basic
  * multiple-runtimes (parameters: num_compute_runtimes num_compute_runtime_txns num_compute_workers executor_group_size)
  • User provides scenario-specific parameters, for example
    oasis-test-runner -t multiple-runtimes --params.multiple-runtimes.num_compute_workers=2 --params.multiple-runtimes.num_runtimes=32
  • Add support for specifying combination of parameters, for example
    oasis-test-runner -t multiple-runtimes --params.multiple-runtimes.num_compute_workers=1,2,4 --params.multiple-runtimes.num_runtimes=16,32 would generate 6 instances of multiple-runtimes scenario with provided values.
  • New --num_runs n flag which runs the given scenario(s) n times.
  • Prometheus integration:
    • Add oasis-test-runner --metrics.address for existing Prometheus server in push-mode, and --metrics.push.interval for push interval in seconds. Forward those to oasis-node and run it in push mode.
    • Replace --metrics.push.instance_label with --metrics.push.labels StringToString parameter in oasis-node.
    • In addition oasis-test-runner should call oasis-node with the following parameters: --metrics.push.job_name=<role of the node e.g. validator-1> and labels --metrics.push.labels instance=<oasis-test-runerXXXXXX>,test=<test_name>,run=<number of run>,software_version=<oasis node software version>,runtime.tee_hardware=<tee_hardware> + specific parameter set key/values provided for the test.
    • Integrate and report following new metrics to oasis-node:
      • datadir space usage,
      • I/O read/written bytes,
      • memory usage,
      • CPU usage,
      • network usage,
    • Integrate and report following new metrics to oasis-test-runner:
      • alive (binary metric for measuring how long the test took to finish),
  • Benchmark analysis tool:
    • Rewrite existing bash/python scripts for querying prometheus server to golang.
    • Match source/target benchmark results (e.g. --metrics.target.git_branch=master and metrics.source.git_branch=matevz/feature/bench compares results from my branch to master).
    • Check benchmarks for abnormalities (e.g. >110% memory used, >110% disk space used, > 10% slowdown of the test compared to previous run).
    • Check benchmarks from the previous day and report major deviations (e.g. peak memory usage for any test should be at most 5% greater than from the previous day).
    • Add --metrics StringSlice flag to cmp for specific metrics to compare:
      • time for execution time of the test
      • du for disk usage
      • io for I/O work
      • mem for memory usage
      • cpu for CPU work
      • net for network usage
    • Add thresholds flags for each metric, e.g. --max_threshold.disk_usage.avg_ratio 1.1. If set to 0, then it is ignored.
    • Integrate all test names into ba. If no test name provided, query them all.
    • Add support for parameter sets:
      • Query by specific parameter (same flag as in oasis-test-runner)
      • If no parameters provided, query and correctly compare parameter sets
  • Buildkite integration:
    • Add new benchmarks buildkite pipeline.
    • Report errors to slack.
    • Figure out a way to make use of this pipeline when developing new features (in feature branches): source git branch is set to $BUILDKITE_BRANCH and target git branch to master.
    • Activate it as daily cron job on master once it's merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:instrumentation Category: metrics and tracing c:performance Category: performance c:testing Category: testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants