META: Benchmarks #2919

ajnavarro · 2024-10-07T09:34:52Z

Description

Related issues:

RFC(ci): async infra for slow non-essential checks (fuzz, bench) #2915

Now that we have a Benchmark MVP working, I'd like to enumerate the requests from several parties regarding the next steps:

Tasks

Execute Benchmarks on Pull Requests
We had to roll back this feature due to concerns from developers about the time it takes for benchmarks to complete. Here are some proposals to mitigate this issue:
- Run a Subset of Benchmarks: Use the -short flag to execute a smaller set of benchmarks. While this won't completely prevent regressions in the master branch, it will reduce their frequency.
- Run Benchmarks Only on Modified Packages: This approach requires further thought, as it might not be the most effective solution.
- Limit Execution Time: Ensure that benchmark execution takes less than 10 minutes, on PRs at least.
Disable Non-Essential Benchmarks via Flags
Some benchmarks, like those testing goleveldb (which are testing the database itself), don't need to run on every PR. Instead, we could focus on specific benchmarks that test particular cases. Here some examples:
- Testing speed when running examples package.
- Launching the chain
- Database loading speed
- Benchdata performance

Benchmark Tools

Currently, two GitHub Actions can run Go benchmarks:

github-action-benchmark (currently in use): https://github.com/benchmark-action/github-action-benchmark
gobenchdata (previously used and discarded): https://github.com/bobheadxi/gobenchdata

If none of these tools meets all our requirements, we should consider developing a new action fitting our use cases.

This is a call for comments (cc @moul and @thehowl )

The text was updated successfully, but these errors were encountered:

sw360cab · 2024-10-07T09:56:36Z

The options Execute Benchmarks on Pull Requests can be reconsidered disabling the option: failOnAlert

sw360cab · 2024-10-07T09:58:09Z

Consider #2915

thehowl · 2024-10-07T15:09:02Z

Run Benchmarks Only on Modified Packages: This approach requires further thought, as it might not be the most effective solution.

Yes, I'd rather we start off from having a small set of "core" benchmarks which are run on PRs.

Comparisons with master using benchstat would be nice.

thehowl · 2024-10-07T15:27:42Z

Run a Subset of Benchmarks: Use the -short flag to execute a smaller set of benchmarks. While this won't completely prevent regressions in the master branch, it will reduce their frequency.

IMO this could also be just a list in the workflow YAML file; using -short means that benchmarks are by default included, which is not ideal because I'll definitely forget it when I review a PR adding a benchmark. Instead, having a list of benchmarks ensures they have to be explicitly added.

The non-essential benchmarks we can disable with build tags or flags maybe?

Rest LGTM. :)

ajnavarro · 2024-10-08T09:52:08Z

IMO this could also be just a list in the workflow YAML file

That might be an extra step that is difficult to know from all devs. That is how it was implemented before, you needed to add your benchmark in a yaml file. To reduce the hassle to the minimum, I had to:

Add a new check on the PR template (that everyone ignored): 60bab86#diff-b2496e80299b8c3150b1944450bd81c622e04e13d15c411d291db0927d75fd6bR15
That check was pointing to some instructions about how to add new benchmarks (that no one read :P): 60bab86#diff-56e366db169a5bb6f2a8ade8ad89efebcde7825b907b9e5609c6e27f2629eae1R1-R34
And finally the yaml itself (needed to be in another branch): 82e58a5#diff-632f689f80d979b7c273e32fd187b4e38f0b5cec60b0ff3af393eefb6b3d3fca

To avoid repeating the same mistakes, I propose using a specific tag similar to -short but used to indicate whether that benchmark will run regularly or not. It can be --ci

@thehowl WDYT?

moul · 2024-10-08T10:34:09Z

I believe we should establish two levels of benchmarks:

The first level is enforced quietly and applies to all developers, including those writing contracts or fixing typos in a README. There should be no noise unless a PR triggers a global benchmark issue. In that case, the PR will become a topic of discussion regarding benchmarks. We could implement high-level benchmarks, such as running a blockchain node and processing N transactions. While this may not identify the exact cause of any slowdown, knowing that overall performance has decreased is sufficient for the core team to pause a PR until we investigate further. Developers should not have to check a box or worry about benchmarks; they should be able to continue their work without concern until a CI check or the core team blocks the merge for performance reasons.
The second level is for PRs that introduce features we want to benchmark from the outset, primarily in tm2 and gnovm, or for those focused on adding benchmarks rather than features. These PRs should prioritize benchmark configuration management, including defaults for CI and local setups.

However, upon reviewing the PR history, it’s unclear if we can conclude that everyone ignored benchmarks; most PRs were simply unrelated to them (case 1). Ignoring benchmarks is definitely more problematic in case 2.

A quick win that I suggest includes:

Removing the checkbox from the default PR template.
Adding a dynamic checkbox using the GitHub Bot we discussed in Torino, which could require core team review for benchmark updates if specific folders are modified (tm2, gnovm, amino).
Implementing high-level integration-like benchmarks to prevent PRs from reducing TPS. If this high-level benchmark fails, the PR should be locked until we investigate. This investigation could lead to adding more specific benchmarks, fixing performance issues, or deciding to ignore the perf decrease if it’s expected.
Running full benchmarks against the master branch asynchronously (RFC(ci): async infra for slow non-essential checks (fuzz, bench) #2915).
Running a short "global" check for PRs synchronously; there’s no need to check by folders. We just need a comprehensive integration test that runs a node, loads contracts, and processes a reasonable number of transactions. This links to another topic we discussed with Morgan about having a top-level integration test that we run regularly on each PR, on master, and in production. This test should fail (often?) without clear reasons, allowing engineers to investigate further. It's a catch-all. Catch-alls are great when we know how to investigate because they are cheap and efficient.
Utilizing asynchronous infrastructure to run longer benchmarks and fuzzing tests primarily against the master branch and, eventually, against some PRs, particularly those touching specific folders, or possibly all if it’s efficiently asynchronous.

ajnavarro · 2024-10-09T10:25:09Z

aeddi · 2024-10-17T12:23:05Z

Working on it right now. Let me get a PoC ready and then we’ll figure out a specific rule to meet this need. :)

thehowl · 2024-10-30T22:40:00Z

Just an update; the current benchmark flow only runs benchdata, since #3007. The execution time of the pipeline is now under a minute.

I think we can go from here. If somebody has a use for more benchmarks, they can obviously add them. But I think it's good to start off from here; and gradually add more when a need for a benchmark running every PR arises, rather than trying to run all the small, meaningless microbenchmarks we have.

Add a dynamic checkbox using a GitHub Bot

I'm not sure about this, I don't remember much of the discussion.

Run a "global" check for each PR

I think more than anything we can find some "crucial" paths whose performance we want to keep track of; the GnoVM benchmarks are one example of them.

We don't need to run a benchmark of all the operations described, but we can benchmark, for instance, how long VMKeeper.AddPackage takes or VMKeeper.Call.

For running a node, I don't know what would be good benchmarks. I'll hand it over to the Node Commander-in-Chief @zivkovicmilos to figure out good examples to have in gno.land/tm2.

Run async long benchmarks on some PRs eventually

I don't think this is necessary if we are conservative with the benchmarks we run.

PR checks

We now already have a subset and the execution time is well below 10 minutes. We could start by enabling them again on the PRs, and then adding benchstat comparisons as a nice-to-have. :)

github-project-automation bot added this to 🧙‍♂️gno.land core team Oct 7, 2024

github-project-automation bot moved this to Triage in 🧙‍♂️gno.land core team Oct 7, 2024

thehowl changed the title ~~Benchmarks: Uber issue~~ META: Benchmarks Oct 7, 2024

Kouteki mentioned this issue Oct 8, 2024

Minutes: Core Staff Weekly Syncs [every Monday] gnolang/meetings#36

Open

moul mentioned this issue Oct 18, 2024

"User Flow" integration test #2159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

META: Benchmarks #2919

META: Benchmarks #2919

ajnavarro commented Oct 7, 2024 •

edited

Loading

sw360cab commented Oct 7, 2024 •

edited

Loading

sw360cab commented Oct 7, 2024

thehowl commented Oct 7, 2024

thehowl commented Oct 7, 2024

ajnavarro commented Oct 8, 2024

moul commented Oct 8, 2024

ajnavarro commented Oct 9, 2024 •

edited

Loading

aeddi commented Oct 17, 2024

thehowl commented Oct 30, 2024

META: Benchmarks #2919

META: Benchmarks #2919

Comments

ajnavarro commented Oct 7, 2024 • edited Loading

Description

Tasks

Benchmark Tools

sw360cab commented Oct 7, 2024 • edited Loading

sw360cab commented Oct 7, 2024

thehowl commented Oct 7, 2024

thehowl commented Oct 7, 2024

ajnavarro commented Oct 8, 2024

moul commented Oct 8, 2024

ajnavarro commented Oct 9, 2024 • edited Loading

Requests Summary

Benchmark Execution on Pull Requests

aeddi commented Oct 17, 2024

thehowl commented Oct 30, 2024

ajnavarro commented Oct 7, 2024 •

edited

Loading

sw360cab commented Oct 7, 2024 •

edited

Loading

ajnavarro commented Oct 9, 2024 •

edited

Loading