Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

META: Benchmarks #2919

Open
2 tasks
ajnavarro opened this issue Oct 7, 2024 · 9 comments
Open
2 tasks

META: Benchmarks #2919

ajnavarro opened this issue Oct 7, 2024 · 9 comments

Comments

@ajnavarro
Copy link
Contributor

ajnavarro commented Oct 7, 2024

Description

Benchmark Results

Related issues:

Now that we have a Benchmark MVP working, I'd like to enumerate the requests from several parties regarding the next steps:

Tasks

  • Execute Benchmarks on Pull Requests
    We had to roll back this feature due to concerns from developers about the time it takes for benchmarks to complete. Here are some proposals to mitigate this issue:

    • Run a Subset of Benchmarks: Use the -short flag to execute a smaller set of benchmarks. While this won't completely prevent regressions in the master branch, it will reduce their frequency.
    • Run Benchmarks Only on Modified Packages: This approach requires further thought, as it might not be the most effective solution.
    • Limit Execution Time: Ensure that benchmark execution takes less than 10 minutes, on PRs at least.
  • Disable Non-Essential Benchmarks via Flags
    Some benchmarks, like those testing goleveldb (which are testing the database itself), don't need to run on every PR. Instead, we could focus on specific benchmarks that test particular cases. Here some examples:

    • Testing speed when running examples package.
    • Launching the chain
    • Database loading speed
    • Benchdata performance

Benchmark Tools

Currently, two GitHub Actions can run Go benchmarks:

If none of these tools meets all our requirements, we should consider developing a new action fitting our use cases.

This is a call for comments (cc @moul and @thehowl )

@sw360cab
Copy link
Contributor

sw360cab commented Oct 7, 2024

The options Execute Benchmarks on Pull Requests can be reconsidered disabling the option: failOnAlert

@sw360cab
Copy link
Contributor

sw360cab commented Oct 7, 2024

Consider #2915

@thehowl thehowl changed the title Benchmarks: Uber issue META: Benchmarks Oct 7, 2024
@thehowl
Copy link
Member

thehowl commented Oct 7, 2024

Run Benchmarks Only on Modified Packages: This approach requires further thought, as it might not be the most effective solution.

Yes, I'd rather we start off from having a small set of "core" benchmarks which are run on PRs.

Comparisons with master using benchstat would be nice.

@thehowl
Copy link
Member

thehowl commented Oct 7, 2024

Run a Subset of Benchmarks: Use the -short flag to execute a smaller set of benchmarks. While this won't completely prevent regressions in the master branch, it will reduce their frequency.

IMO this could also be just a list in the workflow YAML file; using -short means that benchmarks are by default included, which is not ideal because I'll definitely forget it when I review a PR adding a benchmark. Instead, having a list of benchmarks ensures they have to be explicitly added.

The non-essential benchmarks we can disable with build tags or flags maybe?

Rest LGTM. :)

@ajnavarro
Copy link
Contributor Author

IMO this could also be just a list in the workflow YAML file

That might be an extra step that is difficult to know from all devs. That is how it was implemented before, you needed to add your benchmark in a yaml file. To reduce the hassle to the minimum, I had to:

To avoid repeating the same mistakes, I propose using a specific tag similar to -short but used to indicate whether that benchmark will run regularly or not. It can be --ci

@thehowl WDYT?

@moul
Copy link
Member

moul commented Oct 8, 2024

I believe we should establish two levels of benchmarks:

  1. The first level is enforced quietly and applies to all developers, including those writing contracts or fixing typos in a README. There should be no noise unless a PR triggers a global benchmark issue. In that case, the PR will become a topic of discussion regarding benchmarks. We could implement high-level benchmarks, such as running a blockchain node and processing N transactions. While this may not identify the exact cause of any slowdown, knowing that overall performance has decreased is sufficient for the core team to pause a PR until we investigate further. Developers should not have to check a box or worry about benchmarks; they should be able to continue their work without concern until a CI check or the core team blocks the merge for performance reasons.

  2. The second level is for PRs that introduce features we want to benchmark from the outset, primarily in tm2 and gnovm, or for those focused on adding benchmarks rather than features. These PRs should prioritize benchmark configuration management, including defaults for CI and local setups.

However, upon reviewing the PR history, it’s unclear if we can conclude that everyone ignored benchmarks; most PRs were simply unrelated to them (case 1). Ignoring benchmarks is definitely more problematic in case 2.


A quick win that I suggest includes:

  • Removing the checkbox from the default PR template.
  • Adding a dynamic checkbox using the GitHub Bot we discussed in Torino, which could require core team review for benchmark updates if specific folders are modified (tm2, gnovm, amino).
  • Implementing high-level integration-like benchmarks to prevent PRs from reducing TPS. If this high-level benchmark fails, the PR should be locked until we investigate. This investigation could lead to adding more specific benchmarks, fixing performance issues, or deciding to ignore the perf decrease if it’s expected.
  • Running full benchmarks against the master branch asynchronously (RFC(ci): async infra for slow non-essential checks (fuzz, bench) #2915).
  • Running a short "global" check for PRs synchronously; there’s no need to check by folders. We just need a comprehensive integration test that runs a node, loads contracts, and processes a reasonable number of transactions. This links to another topic we discussed with Morgan about having a top-level integration test that we run regularly on each PR, on master, and in production. This test should fail (often?) without clear reasons, allowing engineers to investigate further. It's a catch-all. Catch-alls are great when we know how to investigate because they are cheap and efficient.
  • Utilizing asynchronous infrastructure to run longer benchmarks and fuzzing tests primarily against the master branch and, eventually, against some PRs, particularly those touching specific folders, or possibly all if it’s efficiently asynchronous.

@ajnavarro
Copy link
Contributor Author

ajnavarro commented Oct 9, 2024

Let's summarize all the requests:

Requests Summary

  • Remove the checkbox from the PR template
    Done: chore: Remove checkbox with the reminder of adding more Benchmarks. #2927
  • Add a dynamic checkbox using a GitHub Bot
    Require core team review for benchmark updates if specific folders are modified (e.g., tm2, gnovm, amino). (we will need more information about that feature. Maybe @aeddi is on it?).
  • High-level integration benchmarks
    Implement benchmarks that test integration-like scenarios. Tools like supernova could be used for this.
  • Run full benchmarks against the master branch asynchronously
    Keep running full benchmarks on the master branch to monitor ongoing performance.
  • Run a "global" check for each PR
    Integration tests that involve running a node, loading contracts, and processing transactions. (We lack a ton of definition here. I really think this is not going to give us useful information and it will be just another hassle full of cognitive overload to go over to be able to merge code.)
  • Run async long benchmarks on some PRs eventually
    Focus on PRs touching specific paths or folders. Define how these async benchmarks will be triggered. Maybe a tag?
  • Benchmark Tools Evaluation
    Continue with github-action-benchmark. If none fit all requirements, consider developing a custom benchmark tool.

Benchmark Execution on Pull Requests

  • Run a Subset of Benchmarks
    Execute a smaller set of core benchmarks (define explicitly in workflow YAML, rather than using -short flag). I would rather prefer to specify a new flag, like -ci. See my previous comment.
  • Limit Execution Time for Benchmarks on PRs
    Ensure benchmark execution time is less than 10 minutes.
  • Comparison with Master Branch
    Use benchstat for comparing current changes with master benchmarks.

Ping me if I missed something or some point needs more clarification/the output I got is wrong.

@aeddi
Copy link
Contributor

aeddi commented Oct 17, 2024

Working on it right now. Let me get a PoC ready and then we’ll figure out a specific rule to meet this need. :)

@thehowl
Copy link
Member

thehowl commented Oct 30, 2024

Just an update; the current benchmark flow only runs benchdata, since #3007. The execution time of the pipeline is now under a minute.

I think we can go from here. If somebody has a use for more benchmarks, they can obviously add them. But I think it's good to start off from here; and gradually add more when a need for a benchmark running every PR arises, rather than trying to run all the small, meaningless microbenchmarks we have.

Add a dynamic checkbox using a GitHub Bot

I'm not sure about this, I don't remember much of the discussion.

Run a "global" check for each PR

I think more than anything we can find some "crucial" paths whose performance we want to keep track of; the GnoVM benchmarks are one example of them.

We don't need to run a benchmark of all the operations described, but we can benchmark, for instance, how long VMKeeper.AddPackage takes or VMKeeper.Call.

For running a node, I don't know what would be good benchmarks. I'll hand it over to the Node Commander-in-Chief @zivkovicmilos to figure out good examples to have in gno.land/tm2.

Run async long benchmarks on some PRs eventually

I don't think this is necessary if we are conservative with the benchmarks we run.

PR checks

We now already have a subset and the execution time is well below 10 minutes. We could start by enabling them again on the PRs, and then adding benchstat comparisons as a nice-to-have. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

5 participants