Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/logictest: TestLogic_upsert failing due to hardware overload #119907

Closed
cockroach-teamcity opened this issue Mar 5, 2024 · 8 comments
Closed
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team X-duplicate Closed as a duplicate of another issue.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 5, 2024

pkg/sql/logictest/tests/local-vec-off/local-vec-off_test.TestLogic_upsert failed on master @ bf013ea0a5311726e65d37e8f047ce39ea2d5f10:

=== RUN   TestLogic_upsert
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestLogic_upsert280453658
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:144: cluster virtualization disabled in global scope due to issue: #76378 (expected label: C-bug)
[05:45:05] setting distsql_workmem='90583B';
    logic.go:4307: -- test log scope end --
test logs left over in: outputs.zip/logTestLogic_upsert280453658
--- FAIL: TestLogic_upsert (25.29s)
=== RUN   TestLogic_upsert/regression_35364
[05:45:09] --- progress: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert: 249 statements
[05:45:16] --- progress: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert: 260 statements
    logic.go:2964: 
         
        /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert:1262: SELECT count(*) FROM t54456
        expected success, but found
        (XXUUU) context canceled
[05:45:25] --- progress: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert: 264 statements
    logic.go:2206: 
         /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert:1267: too many errors encountered, skipping the rest of the input
[05:45:25] --- done: /var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert with config local-vec-off: 264 tests, 2 failures
[05:45:30] --- total progress: 264 statements
--- total: 264 tests, 2 failures
    --- FAIL: TestLogic_upsert/regression_35364 (18.62s)

Parameters:

  • attempt=1
  • run=29
  • shard=45
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-36379

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels Mar 5, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Mar 5, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Mar 5, 2024
@yuzefovich
Copy link
Member

I'm guessing this should be fixed by #119953

@github-project-automation github-project-automation bot moved this from Triage to Done in SQL Queries Mar 5, 2024
@rickystewart
Copy link
Collaborator

Hi @yuzefovich, that PR has nothing to do with this failure. That PR is about race and deadlock tests. This is not one of those.

@rickystewart rickystewart reopened this Mar 5, 2024
@github-project-automation github-project-automation bot moved this from Done to Triage in SQL Queries Mar 5, 2024
@yuzefovich
Copy link
Member

Oops, misclick.

Still, this failure doesn't really seem actionable to me, here is what we have in logs:

I240305 05:45:18.244124 25810 5@util/log/event_log.go:32 ⋮ [T1,Vsystem,n1,client=127.0.0.1:34706,hostssl,user=root] 673 ={"Timestamp":1709617517308450268,"EventType":"create_table","Statement":"CREATE TABLE ‹test›.public.‹t54456› (‹c› INT8 PRIMARY KEY)","Tag":"CREATE TABLE","User":"root","DescriptorID":146,"TableName":"‹test.public.t54456›"}
I240305 05:45:24.391026 24556 kv/kvserver/queue.go:613 ⋮ [T1,Vsystem,n1,s1,r52/1:‹/Table/5{0-1}›,raft] 674  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
...
E240305 05:45:25.801364 32955 sql/gcjob/table_garbage_collection.go:263 ⋮ [T1,Vsystem,n1,job=‹SCHEMA CHANGE GC id=948787993275564033›] 676  delete range /Table/145 - /Table/146 failed: ‹context canceled›
E240305 05:45:25.801778 32955 jobs/registry.go:1637 ⋮ [T1,Vsystem,n1] 677  job 948787993275564033: running execution encountered retriable error: non-cancelable: attempted to delete table data: delete range /Table/145 - /Table/146: context canceled
...
E240305 05:45:25.801778 32955 jobs/registry.go:1637 ⋮ [T1,Vsystem,n1] 677 +Wraps: (8) context canceled
E240305 05:45:25.801778 32955 jobs/registry.go:1637 ⋮ [T1,Vsystem,n1] 677 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *markers.withMark (4) *withstack.withStack (5) *errutil.withPrefix (6) *withstack.withStack (7) *errutil.withPrefix (8) *errors.errorString
E240305 05:45:25.801968 32955 jobs/adopt.go:461 ⋮ [T1,Vsystem,n1] 678  job 948787993275564033: adoption completed with error non-cancelable: attempted to delete table data: delete range /Table/145 - /Table/146: context canceled
E240305 05:45:25.813995 27994 sql/logictest/logic.go:4431 ⋮ [-] 679 +‹/var/lib/engflow/worker/work/1/exec/bazel-out/k8-fastbuild/bin/pkg/sql/logictest/tests/local-vec-off/local-vec-off_test_/local-vec-off_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/upsert:1262: SELECT count(*) FROM t54456›
E240305 05:45:25.813995 27994 sql/logictest/logic.go:4431 ⋮ [-] 679 +‹expected success, but found›

Table 145 is dropped right before we create t54456 (which is Table 146). This is local-vec-off config, so we're running only a single node. For some reason, there appear to be an interruption of unknown origin.

I have not seen this failure mode on logic tests in TeamCity, so I'm inclined to assume it's something about EngFlow environment.

@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 6, 2024
@DrewKimball
Copy link
Collaborator

"CPUUserPercent":133.9293

@DrewKimball
Copy link
Collaborator

[05:45:05] setting distsql_workmem='90583B';

Maybe just due to metamorphic settings making things slow?

@mgartner
Copy link
Collaborator

mgartner commented Mar 6, 2024

We're seeing a few of these "context canceled" failures in the last few days. So it seems like there's some recent change to code or infrastructure that is causing this.

@DrewKimball DrewKimball changed the title pkg/sql/logictest/tests/local-vec-off/local-vec-off_test: TestLogic_upsert failed sql/logictest: TestLogic_upsert failing due to hardware overload Mar 8, 2024
@DrewKimball DrewKimball added the P-3 Issues/test failures with no fix SLA label Mar 8, 2024
craig bot pushed a commit that referenced this issue Mar 16, 2024
120417: logictest: move large upsert tests to non-metamorphic file r=yuzefovich a=yuzefovich

We've seen a few suspicious failures around a couple of test cases in the `upsert` file that use large number of rows, so this commit moves them into the non-metamorphic file in hopes of preventing flakes.

Additionally, I'm guessing these test cases were the reason for recently added skip under race, so that is removed too.

Informs: #119907.

Epic: None

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@mgartner
Copy link
Collaborator

It seems like this is a duplicate of #120395, correct?

@yuzefovich
Copy link
Member

I think so, closing as a dup of #120395

@github-project-automation github-project-automation bot moved this from Triage to Done in SQL Queries Mar 22, 2024
@yuzefovich yuzefovich added the X-duplicate Closed as a duplicate of another issue. label Mar 22, 2024
@exalate-issue-sync exalate-issue-sync bot removed the X-duplicate Closed as a duplicate of another issue. label Mar 22, 2024
@mgartner mgartner added the X-duplicate Closed as a duplicate of another issue. label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team X-duplicate Closed as a duplicate of another issue.
Projects
Archived in project
Development

No branches or pull requests

5 participants