Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Benchmark discussion #2791

Open
MadVikingGod opened this issue Apr 14, 2022 · 20 comments
Open

CI Benchmark discussion #2791

MadVikingGod opened this issue Apr 14, 2022 · 20 comments
Assignees
Labels
bug Something isn't working pkg:tooling Related to the tooling package

Comments

@MadVikingGod
Copy link
Contributor

The Benchmark job is not giving us consistent results and is generally failing.

We should discuss if this is adding value enough to improve it, or if it should be removed.

Some datapoints:

  • Of the last 20 runs 2 passed
  • The benchmark names are not sufficient, they don't provide what package they are from.. eg: BenchmarkInt64Slice/KeyValue, BenchmarkString/KeyValue, and BenchmarkIterator_4
  • The failing test isn't consistent, of the last 5 runs it failed on:
    • BenchmarkHistogramSearchFloat64_1024
    • BenchmarkString/KeyValue
    • BenchmarkIterator_4
    • BenchmarkGaugeObserverObservationFloat64
    • BenchmarkInt64Slice/KeyValue
    • BenchmarkBoolSlice/Emit
    • BenchmarkInt/KeyValue
    • BenchmarkHistogramSearchInt64_1
@MadVikingGod MadVikingGod added the bug Something isn't working label Apr 14, 2022
@TylerHelmuth
Copy link
Member

I think benchmark tests are super important and are a step in the right direction towards being able to answer customer questions on expected performance and resource requirements. Performance of open telemetry was always a big topic when I was a customer of open telemetry, and it feels like it is still an open question; we ended up stress testing components ourselves.

But if the tests in their current state aren't meeting that goal and are actively prohibiting speedy development on the project then maybe it would be best to temporarily disable (or let them run but don't throw up a red x) while we re-evaluate what to do.

A similar issue was recently opened in the Collector Contrib. There was a suggestion to publish benchmark reports for every build that could then be reviewed instead of failing builds. The python repo has something similar that could be used as an example. Running on Github hardware , though, the python reports also suffer from noisy neighbors like ours do, so I don't think it is a perfect solution, but it is a step in the right direction.

@tigrannajaryan mentioned here that CNCF has provided private compute space to run the tests on, which, if its still true, would solve our noisy neighbor problem. Assuming we have our own space with which to run the tests, I think it would be great to get the benchmarks running in that space and reporting visual, reviewable results every run.

Long term and outside the scope of Go, it would be awesome to have a OpenTelemetry Project-wide performance report that could be shared. If we can standardize as a Project on a benchmark reporting solution, maybe we could have a single link with all the different OpenTelemetry components that are reporting performance results and resource requirements. With more components in OpenTelemetry becoming stable/putting out major releases, I think we need to have solid answers for customers when they ask the performance of traces/metrics/logs instrumentation or the collector.

I would love to work on this problem, whatever direction we decide to go.

@dmathieu
Copy link
Member

I agree with @TylerHelmuth entirely.
As a short term solution, we could stop failing them on pushes to main. PRs can be retried (though for some reason, I'm never seeing any failures in PRs).

@TylerHelmuth
Copy link
Member

For perspective, I went looking at what other SIGs are doing with benchmarks and this is what I found.

  • Java: Has benchmarks but not part of any CI.
  • Dotnet: Has benchmarks but not part of any CI.
  • Collector: Has benchmarks but not part of any CI.

I know Java and Dotnet ask for runs on a PR by PR basis. Dotnet publishes the results as comments in the code.

It wouldn't be unprecedented if we remove the benchmarks from our CI and instead ask for them to be run on a per-PR basis. If we do that I think we should add to the contributing guide the reasons when they would need to be run.

@MrAlias
Copy link
Contributor

MrAlias commented Apr 18, 2022

It wouldn't be unprecedented if we remove the benchmarks from our CI and instead ask for them to be run on a per-PR basis. If we do that I think we should add to the contributing guide the reasons when they would need to be run.

I'm reluctant to add overhead to the development process that in theory could be automated and performed by a computer. Especially since the benchmarks will likely not be comparable due to the unique hardware/OS the benchmarks will be run on for each developer.

If we could get the tests to run on the dedicated EC2 instance mentioned that would be amazing and the solution I would strive for. I wonder if there is a cron job we could setup on the instance and periodically run/publish benchmarks. Maybe from that a trigger could be devised so the job runs when a merge happens.

@TylerHelmuth
Copy link
Member

If we could get the tests to run on the dedicated EC2 instance mentioned that would be amazing and the solution I would strive for

Agreed, I think this is the best end state. In the meantime stopping the benchmarks from blocking CI/CD does feel ok.

@MrAlias
Copy link
Contributor

MrAlias commented Apr 19, 2022

@TylerHelmuth can you get a PR together to stop the benchmark workflow from failing the CI report?

At least for the meantime, while the long-term solution is explored.

@TylerHelmuth
Copy link
Member

Yes I can get that done tomorrow

@TylerHelmuth
Copy link
Member

During the Collector SIG today it was confirmed that the project does have hardware that could be used to run benchmarks. Based on @Aneurysm9's recommendation, I am going to investigate if it is feasible to use the boxes as private Github runners.

@TylerHelmuth
Copy link
Member

So the good news is that self-hosted runners are free, and we could use our boxes to run actions and it sounds like the integration and process for hooking into a self-hosted runner is pretty good.

The horrible news is that GitHub does not recommend using self-hosted runners for public repositories, because forks of the repo will also have their actions run on the self-hosted runner, which is a security risk. There may be ways to protect ourselves, but we're gonna have to get creative if we want to use self-hosted runners.

@TylerHelmuth
Copy link
Member

I think to use the boxes as self-hosted runners the workflows would need to live in a private repository. It seems possible to trigger workflows on one repository from a different repository, so it seems feasible that a workflow in opentelemetry-go could trigger benchmarks (and only benchmarks) in the private repository which would then run the benchmark tests. After the tests run I bet we could use the github API to post the results back to the main repository.

Seems like a lot of overhead to run benchmarks, but at the same time I want the benchmarks run lol. We should also look into options other than self-hosted runners and see if there are any good alternatives.

@tigrannajaryan
Copy link
Member

@jpkrohling can you tell more about the bare-metal boxes you said may be available? How many there, who owns them, etc.

@dmathieu
Copy link
Member

Running benchmarks on a private repository seems like a bad approach. Contributors wouldn't be able to know what we run, or when. Nor would they be able to add new benchmarks when a code change requires it.

An alternative solution could be to enable approval of PR workflows for either first-time contributors, or all of them. That would reduce the risk of having malicious actors running something inappropriate in the custom runner.

@TylerHelmuth
Copy link
Member

I think we could design the private repository in such a way that the Benchmark tests themselves are still in this repo and only the workflow definition would be in the private repo. The private repo workflow could clone this repo to get access to everything it needs, and all the results of the runs could be posted back to this repo somewhere.

But I agree, it's less than optimal. It would be much harder to maintain has only certain people would be able to troubleshoot and make changes to the job. Also would be a pretty complicated setup without a lot of transparency.

There is a section in this blog post that has an interesting read on self-hosted runners. The approval solution could work, but I think we'd have to do it for all runs; a second -time contributor could be malicious as well. I think the approval solution has a lot of human error opportunity though, like if a run-approver misses something malicious. Spread that across all the SIGs and it gets more likely that a mistake will be made.

@TylerHelmuth
Copy link
Member

As discussed during the Go SIG today, before continuing to solve how to use the boxes, we'd like to validate that the boxes will solve what we expect is a noisy neighbor problem. @jpkrohling and @tigrannajaryan, is it possible for me and/or @dmathieu to get temporary access to the boxes so that we can run the benchmark tests for Go (and load tests for collector-contrib) and verify that the benchmarks report consistently?

As part of this benchmark work I am also going to investigate what other large open source projects are doing for benchmarks, starting with Prometheus, which recently detected a memory increase in a new version using their benchmarks.

@jpkrohling
Copy link
Member

I have used a set of bare metal machines from Equinix that could be provisioned on-demand, sponsored by the CNCF. While I can try to get you all temporary access to the project I have there, the appropriate way to request access is by creating an issue here: https://github.com/cncf/cluster

Here's an issue you can use as reference: cncf/cluster#182

@tigrannajaryan
Copy link
Member

It is probably useful to have some preliminary thoughts on these before starting the actual implementation:

  • Who is going to be responsible for the ongoing maintenance? Is the answer "all core and contrib maintainers/approvers" or we will have more restrictions in place?
  • What kind of access there will be for maintainers/approvers/other contributors? Who can SSH into the machines and who cannot?

@TylerHelmuth
Copy link
Member

@jpkrohling is there a way to use this request to get access to the boxes that OpenTelemetry already has? I don't wan't to accidentally request more machines.

@jpkrohling
Copy link
Member

The machines are provisioned manually, on demand. At this moment, we have the ability to have bare metal machines, but we have none in active use.

@TylerHelmuth
Copy link
Member

Who is going to be responsible for the ongoing maintenance? Is the answer "all core and contrib maintainers/approvers" or we will have more restrictions in place?

It would be great to be able to identify a handful of maintainers of the machines, who are responsible for ensuring the boxes are available and capable of running tests. I don't think the maintainers can be responsible for ensuring any particular SIG's tool for benchmark tests are on the machine. Flushing out what "maintain" means in this context will also be important.

What kind of access there will be for maintainers/approvers/other contributors? Who can SSH into the machines and who cannot?

I think the answer to this will depend on whether or not we can find a way to use the machines as part of a workflow. If we can trigger the tests remotely then I think we should limit the number of people who can ssh in. In this scenario, I think only a few people from each SIG (maybe maintainers, maybe some combo of maintainers and approvers) should be able to ssh in to ensure the boxes have the tools necessary for the tests to trigger remotely. If we can't run the tests remotely then maybe more people will need access. I think limiting it only to approvers/maintainers is the safest though.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Apr 25, 2022

Followup on how others do benchmarking:

I found no examples of using github self-hosted runners (other than the blog I already posted). Other's investigations on this topic would be welcome.

@MrAlias MrAlias added the pkg:tooling Related to the tooling package label Oct 20, 2022
@pellared pellared moved this to Low priority in Go: Triage Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pkg:tooling Related to the tooling package
Projects
Status: Low priority
Development

No branches or pull requests

6 participants