Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update test-all-versions configs to test fewer package versions #2464

Merged
merged 11 commits into from
Oct 15, 2024

Conversation

trentm
Copy link
Contributor

@trentm trentm commented Oct 8, 2024

The TAV tests in CI are taking 2.5h - 3h, for a total usage of over 13h
compute time. E.g.: https://github.com/open-telemetry/opentelemetry-js-contrib/actions/runs/11233652488/usage

This change reduces the number of tested versions from 839 to 317
(for a TAV run with Node.js v20). For a few of the .tav.yml files the
tested version range was actually increased to capture some versions
that the instrumentation supported, but were not being tested by TAV.

The TAV tests in CI are taking 2.5h - 3h, for a total usage of over 13h
compute time. E.g.: https://github.com/open-telemetry/opentelemetry-js-contrib/actions/runs/11233652488/usage

This change reduces the number of tested versions from 839 to 317
(for a TAV run with Node.js v20). For a few of the .tav.yml files the
tested version *range* was actually increased to capture some versions
that the instrumentation supported, but were not being tested by TAV.
versions: "2.308.0 || 2.556.0 || 2.801.0 || 2.1049.0 || 2.1297.0 || 2.1546.0 || >=2.1548.0"
versions:
include: "^2.308.0"
mode: max-7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that this mode exists. Nice. 👍

@pichlermarc
Copy link
Member

pichlermarc commented Oct 9, 2024

I don't recall how to force a TAV run on this PR.

@trentm I see in the ended up finding the way to do it (re-assigning a label).

It's kind of a hack though, the workflow is not triggered directly when we apply labels, as we don't use the token of an actual github user, so the action is then performed by https://github.com/apps/github-actions instead. We could use the token from @opentelemetrybot, but then we'd need to give it triage rights on the repo - I think such a change would need to go though a request in the https://github.com/open-telemetry/community/ repo first.

Copy link
Member

@pichlermarc pichlermarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, looks like we're hitting some versions that are failing tests, I think we need to take care of these first before merging.

@trentm
Copy link
Contributor Author

trentm commented Oct 9, 2024

Oh, looks like we're hitting some versions that are failing tests

Yup, I kind of expected some possible breakage -- I guess I should have started in draft. There were a few instrumentations that supported a version range larger than what TAV had been testing. I'll work through them.

@trentm trentm marked this pull request as draft October 9, 2024 14:49
trentm added 4 commits October 9, 2024 16:02
… issue with its spdy dep that is not worth digging into

Also fine-tune the restify version at which there was a slight behaviour in error handling that broke tests: behaviour change was in v8.2.0, not in v8.0.0.
Stop installing some '@nestjs/*' deps that aren't necessary for testing
with more recent major versions.

Also turn of nestjs logging for testing to make the output much cleaner.
I believe `@aws-sdk/client-sqs` releases 3.363.0 to 3.377.0 were bad
releases (fixed in 3.378.0).
open-telemetry#1828 (comment)
shows how `@aws-sdk/[email protected]` was a bad release.

Version 3.363.0 was when the `client-*` packages were moved to using
"migrated `@smithy` packages"
https://github.com/aws/aws-sdk-js-v3/releases/tag/v3.363.0

The comment above shows a particular failing for client-s3. I don't have
a smoking gun for the client-sqs old versions, but the tests fail on
those versions and it isn't worth digging into the reasons. Let's just
skip those versions.
@trentm
Copy link
Contributor Author

trentm commented Oct 10, 2024

Repeating the commit messages from the last 4 commits:

  • instr-aws-sdk TAV tests: skip bad releases of client-sqs

    I believe @aws-sdk/client-sqs releases 3.363.0 to 3.377.0 were bad releases (fixed in 3.378.0). [ci] test-all-versions failing for @cucumber/[email protected] and @aws-sdk/[email protected] #1828 (comment) shows how @aws-sdk/[email protected] was a bad release.

    Version 3.363.0 was when the client-* packages were moved to using "migrated @smithy packages" https://github.com/aws/aws-sdk-js-v3/releases/tag/v3.363.0

    The comment above shows a particular failing for client-s3. I don't have a smoking gun for the client-sqs old versions, but the tests fail on those versions and it isn't worth digging into the reasons. Let's just skip those versions.

  • instr-pg: drop setting IN_TAV which stopped being used in feat(pg): remove support for pg v7 #1393

  • instr-nestjs-core TAV testing: fix for @nestjs/core@9 and 10

    Stop installing some '@nestjs/*' deps that aren't necessary for testing with more recent major versions.

    Also turn of nestjs logging for testing to make the output much cleaner.

  • instr-restify: bump min-supported from 4.0.0 to 4.1.0 because of some issue with its spdy dep that is not worth digging into

    Also fine-tune the restify version at which there was a slight behaviour in error handling that broke tests: behaviour change was in v8.2.0, not in v8.0.0.

- first issue was instr or tests or both don't work with 8.0.0, but *do* with 8.0.3 and later
- second issue was an unrelated flakiness in pg.test.ts
@trentm
Copy link
Contributor Author

trentm commented Oct 11, 2024

A couple issues in the pg tests:

The instrumentation (or at least the tests) don't actually work with
pg >=8.0.0 <8.0.3. These are 4yo versions, not worth digging into the failures,
just bump the min version to [email protected].

  1) pg
       "before all" hook for "should return an instrumentation":
     Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/runner/work/opentelemetry-js-contrib/opentelemetry-js-contrib/plugins/node/opentelemetry-instrumentation-pg/test/pg.test.ts)
      at listOnTimeout (node:internal/timers:569:17)
      at processTimers (node:internal/timers:512:7)

Then there was a flakiness in pg-pool.test.ts:

  1) pg-pool
       "after all" hook in "pg-pool":
     Error: done() called multiple times in hook <pg-pool "after all" hook> of file /Users/trentm/tm/opentelemetry-js-contrib11/plugins/node/opentelemetry-instrumentation-pg/test/pg-pool.test.ts; in addition, done() received error: AssertionError [ERR_ASSERTION]: expected to have 1 used connection
      at process.emit (node:events:519:28)
      at processEmit (/Users/trentm/tm/opentelemetry-js-contrib11/node_modules/signal-exit/index.js:199:34)
      at process.emit (/Users/trentm/tm/opentelemetry-js-contrib11/node_modules/source-map-support/source-map-support.js:516:21)
      at process._fatalException (node:internal/process/execution:188:25)
      at throwUnhandledRejectionsMode (node:internal/process/promises:391:5)
      at processPromiseRejections (node:internal/process/promises:470:17)
      at processTicksAndRejections (node:internal/process/task_queues:96:32)

cause by this test:

'should generate `db.client.connection.count` and `db.client.connection.pending_requests` metrics'

It was an async test case that did not return a promise, so it could be running
after the test framework thought it was done.

@maryliag
Copy link
Contributor

maryliag commented Oct 11, 2024

I was the one who create the test should generate 'db.client.connection.count' and 'db.client.connection.pending_requests' metrics , and saw it fail on some PRs, but having a hard time to fix this locally. If you have any suggestion, I'm happy to try it out and open a PR with the fix

I also tried to change the approach and use done for the test

From 
it('should generate `db.client.connection.count` and `db.client.connection.pending_requests` metrics', async () => {

To
it('should generate `db.client.connection.count` and `db.client.connection.pending_requests` metrics', done => {
....
assert.strictEqual(
            metrics[2].dataPoints[0].value,
            0,
            'expected to have 0 pending requests'
          );
          done();

But got the same issue.

I created a PR with this change anyway, since I think is a good change. Then we can discuss anything else that is missing there and it doesn't lose the focus of this PR

@trentm
Copy link
Contributor Author

trentm commented Oct 11, 2024

  • I believe @aws-sdk/client-sqs releases 3.363.0 to 3.377.0 were bad releases (fixed in 3.378.0).

I believe we should avoid releases 3.363.0 to 3.377.0 for all @aws-sdk/client-*. The most recent test run failed on [email protected].

@trentm trentm marked this pull request as ready for review October 11, 2024 20:08
@trentm trentm requested a review from pichlermarc October 11, 2024 20:08
@trentm
Copy link
Contributor Author

trentm commented Oct 11, 2024

Now that the TAV run is completing, we can compare the compute-time usage differences.

Before: 13h 6m 31s

Screenshot 2024-10-11 at 2 58 32 PM

After: 4h 2m 43s

Screenshot 2024-10-11 at 2 59 55 PM

Copy link
Member

@pichlermarc pichlermarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing improvement, thanks for taking the time to work on this 🚀

@pichlermarc pichlermarc merged commit 0341e89 into open-telemetry:main Oct 15, 2024
20 checks passed
@trentm trentm deleted the tm-reduce-tav branch October 15, 2024 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment