Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/sem/builtins: TestSerialNormalizationWithUniqueUnorderedID failed #106829

Closed
cockroach-teamcity opened this issue Jul 14, 2023 · 7 comments · Fixed by #107597
Closed

sql/sem/builtins: TestSerialNormalizationWithUniqueUnorderedID failed #106829

cockroach-teamcity opened this issue Jul 14, 2023 · 7 comments · Fixed by #107597
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). db-cy-23 O-robot Originated from a bot. T-sql-queries SQL Queries Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 14, 2023

sql/sem/builtins.TestSerialNormalizationWithUniqueUnorderedID failed with artifacts on master @ 334e261af905918edf3ffe43e07f724e9f9b0b01:

Fatal error:

panic: test timed out after 4m55s

Stack:

goroutine 23148 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

I230714 14:54:34.610793 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r6/1:/Table/{0-3}] 717  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.610885 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r45/1:/Table/4{3-4}] 718  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.610975 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r44/1:/Table/4{2-3}] 719  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611078 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r43/1:/Table/4{1-2}] 720  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611178 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r13/1:/Table/1{1-2}] 721  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611276 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r23/1:/Table/2{1-2}] 722  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611369 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r49/1:/Table/4{7-8}] 723  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611462 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r39/1:/Table/3{7-8}] 724  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611553 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r22/1:/Table/2{0-1}] 725  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611643 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r11/1:/Table/{7-8}] 726  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611733 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r51/1:/Table/5{0-1}] 727  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611832 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r31/1:/{Table/29-NamespaceTab…}] 728  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.611954 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r3/1:/System/{NodeLive…-tsd}] 729  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612058 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r63/1:/Table/{62-106}] 730  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612152 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r35/1:/Table/3{3-4}] 731  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612381 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r12/1:/Table/{8-11}] 732  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612536 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r14/1:/Table/1{2-3}] 733  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612636 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r56/1:/Table/5{5-6}] 734  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612733 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r20/1:/Table/1{8-9}] 735  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612826 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r59/1:/Table/5{8-9}] 736  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.612920 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r25/1:/Table/2{3-4}] 737  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.613016 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r10/1:/Table/{6-7}] 738  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.613111 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r42/1:/Table/4{0-1}] 739  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.613643 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r1/1:/{Min-System/NodeL…}] 740  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.613925 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r57/1:/Table/5{6-7}] 741  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.614052 3066 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r41/1:/Table/{39-40}] 742  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230714 14:54:34.621581 769 2@gossip/gossip.go:552  [T1,n1] 743  gossip status (ok, 1 node)
I230714 14:54:34.621581 769 2@gossip/gossip.go:552  [T1,n1] 743 +gossip client (0/3 cur/max conns)
I230714 14:54:34.621581 769 2@gossip/gossip.go:552  [T1,n1] 743 +gossip server (0/3 cur/max conns, infos 0/0 sent/received, bytes 0B/0B sent/received)
I230714 14:54:34.754765 772 2@util/log/event_log.go:32  [T1,n1] 744 ={"Timestamp":1689346474754759790,"EventType":"runtime_stats","MemRSSBytes":496705536,"GoroutineCount":605,"MemStackSysBytes":7143424,"GoAllocBytes":281017376,"GoTotalBytes":336684360,"GoStatsStaleness":0.001397958,"HeapFragmentBytes":14394336,"HeapReservedBytes":8183808,"HeapReleasedBytes":24674304,"CGoAllocBytes":86849984,"CGoTotalBytes":119357440,"CGoCallRate":0.09939432,"CPUUserPercent":115.19801,"CPUSysPercent":0.29818296,"GCPausePercent":0.16196814,"GCRunCount":76,"NetHostRecvBytes":23845822,"NetHostSendBytes":23845822}
I230714 14:54:34.754904 772 2@server/status/runtime_log.go:47  [T1,n1] 745  runtime stats: 474 MiB RSS, 605 goroutines (stacks: 6.8 MiB), 268 MiB/321 MiB Go alloc/total(stale) (heap fragmentation: 14 MiB, heap reserved: 7.8 MiB, heap released: 24 MiB), 83 MiB/114 MiB CGO alloc/total (0.1 CGO/sec), 115.2/0.3 %(u/s)time, 0.2 %gc (76x), 23 MiB/23 MiB (r/w)net
I230714 14:54:40.804794 20729 kv/kvserver/replica_command.go:439  [T1,n1,split,s1,r79/1:/{Table/106/1/…-Max}] 746  initiating a split of this range at key /Table/106/1/2031255669139308545 [r80] (load at key /Table/106/1/2031255669139308545 (cpu 1.9s, 13.26 batches/sec, 0.10 raft mutations/sec))
I230714 14:54:44.699811 772 2@util/log/event_log.go:32  [T1,n1] 747 ={"Timestamp":1689346484699806039,"EventType":"runtime_stats","MemRSSBytes":466800640,"GoroutineCount":605,"MemStackSysBytes":9830400,"GoAllocBytes":195329704,"GoTotalBytes":338511176,"GoStatsStaleness":0.001239617,"HeapFragmentBytes":23609688,"HeapReservedBytes":83927040,"HeapReleasedBytes":22847488,"CGoAllocBytes":86849984,"CGoTotalBytes":119357440,"CGoCallRate":0.100552574,"CPUUserPercent":88.98903,"CPUSysPercent":0.6033154,"GCRunCount":76,"NetHostRecvBytes":17837962,"NetHostSendBytes":17837962}
I230714 14:54:44.699944 772 2@server/status/runtime_log.go:47  [T1,n1] 748  runtime stats: 445 MiB RSS, 605 goroutines (stacks: 9.4 MiB), 186 MiB/323 MiB Go alloc/total(stale) (heap fragmentation: 22 MiB, heap reserved: 80 MiB, heap released: 22 MiB), 83 MiB/114 MiB CGO alloc/total (0.1 CGO/sec), 89.0/0.6 %(u/s)time, 0.0 %gc (76x), 17 MiB/17 MiB (r/w)net
I230714 14:54:45.126833 712 kv/kvserver/queue.go:616  [T1,n1,s1,r5/1:/{Systemtse-Table/0},raft] 749  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
I230714 14:54:51.987381 2818 13@kv/kvserver/split/decider.go:271  [T1,n1,s1,r80/1:/{Table/106/1/…-Max}] 750  no split key found: insufficient counters = 0, imbalance = 20, most popular key occurs in 11% of samples
I230714 14:54:54.712744 772 2@util/log/event_log.go:32  [T1,n1] 751 ={"Timestamp":1689346494712739585,"EventType":"runtime_stats","MemRSSBytes":467873792,"GoroutineCount":606,"MemStackSysBytes":11993088,"GoAllocBytes":259840000,"GoTotalBytes":338699592,"GoStatsStaleness":0.003298985,"HeapFragmentBytes":12969984,"HeapReservedBytes":27918336,"HeapReleasedBytes":22724608,"CGoAllocBytes":86849984,"CGoTotalBytes":119357440,"CGoCallRate":0.09987083,"CPUUserPercent":113.45327,"CPUSysPercent":0.69909585,"GCRunCount":76,"NetHostRecvBytes":14234193,"NetHostSendBytes":14234123}
I230714 14:54:54.712899 772 2@server/status/runtime_log.go:47  [T1,n1] 752  runtime stats: 446 MiB RSS, 606 goroutines (stacks: 11 MiB), 248 MiB/323 MiB Go alloc/total(stale) (heap fragmentation: 12 MiB, heap reserved: 27 MiB, heap released: 22 MiB), 83 MiB/114 MiB CGO alloc/total (0.1 CGO/sec), 113.5/0.7 %(u/s)time, 0.0 %gc (76x), 14 MiB/14 MiB (r/w)net
I230714 14:54:54.866710 21344 kv/kvserver/replica_command.go:439  [T1,n1,split,s1,r80/1:/{Table/106/1/…-Max}] 753  initiating a split of this range at key /Table/106/1/2115698162152505345 [r81] (load at key /Table/106/1/2115698162152505345 (cpu 2s, 12.99 batches/sec, 0.09 raft mutations/sec))
I230714 14:54:59.125495 728 kv/kvserver/queue.go:616  [T1,n1,s1,r30/1:/Table/2{8-9},raft] 754  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
I230714 14:55:04.722788 772 2@util/log/event_log.go:32  [T1,n1] 755 ={"Timestamp":1689346504722781405,"EventType":"runtime_stats","MemRSSBytes":462012416,"GoroutineCount":605,"MemStackSysBytes":9732096,"GoAllocBytes":201851480,"GoTotalBytes":330704200,"GoStatsStaleness":0.001327697,"HeapFragmentBytes":21093800,"HeapReservedBytes":72146944,"HeapReleasedBytes":30720000,"CGoAllocBytes":86849984,"CGoTotalBytes":119357440,"CGoCallRate":0.09989969,"CPUUserPercent":119.38013,"CPUSysPercent":0.5993981,"GCPausePercent":0.25646996,"GCRunCount":77,"NetHostRecvBytes":698596,"NetHostSendBytes":698596}
I230714 14:55:04.722937 772 2@server/status/runtime_log.go:47  [T1,n1] 756  runtime stats: 441 MiB RSS, 605 goroutines (stacks: 9.3 MiB), 192 MiB/315 MiB Go alloc/total(stale) (heap fragmentation: 20 MiB, heap reserved: 69 MiB, heap released: 29 MiB), 83 MiB/114 MiB CGO alloc/total (0.1 CGO/sec), 119.4/0.6 %(u/s)time, 0.3 %gc (77x), 682 KiB/682 KiB (r/w)net
I230714 14:55:05.964903 11573 13@kv/kvserver/split/decider.go:271  [T1,n1,s1,r81/1:/{Table/106/1/…-Max}] 757  no split key found: insufficient counters = 1, imbalance = 19, most popular key occurs in 12% of samples
I230714 14:55:10.933247 15932 kv/kvserver/replica_command.go:439  [T1,n1,split,s1,r81/1:/{Table/106/1/…-Max}] 758  initiating a split of this range at key /Table/106/1/2239969364370259969 [r82] (load at key /Table/106/1/2239969364370259969 (cpu 1.9s, 15.58 batches/sec, 0.09 raft mutations/sec))
I230714 14:55:14.709777 772 2@util/log/event_log.go:32  [T1,n1] 759 ={"Timestamp":1689346514709771103,"EventType":"runtime_stats","MemRSSBytes":436539392,"GoroutineCount":605,"MemStackSysBytes":7274496,"GoAllocBytes":143933312,"GoTotalBytes":303908168,"GoStatsStaleness":0.001343227,"HeapFragmentBytes":44744832,"HeapReservedBytes":81944576,"HeapReleasedBytes":57647104,"CGoAllocBytes":86849952,"CGoTotalBytes":119357440,"CGoCallRate":0.100130275,"CPUUserPercent":175.82875,"CPUSysPercent":0.9011724,"GCPausePercent":0.0016358683,"GCRunCount":78,"NetHostRecvBytes":22792290,"NetHostSendBytes":22792290}
I230714 14:55:14.709912 772 2@server/status/runtime_log.go:47  [T1,n1] 760  runtime stats: 416 MiB RSS, 605 goroutines (stacks: 6.9 MiB), 137 MiB/290 MiB Go alloc/total(stale) (heap fragmentation: 43 MiB, heap reserved: 78 MiB, heap released: 55 MiB), 83 MiB/114 MiB CGO alloc/total (0.1 CGO/sec), 175.8/0.9 %(u/s)time, 0.0 %gc (78x), 22 MiB/22 MiB (r/w)net
I230714 14:55:20.124788 747 kv/kvserver/queue.go:616  [T1,n1,s1,r71/1:/Table/106/1/1{23890…-33516…},raft] 761  rate limited in MaybeAdd (merge): throttled on async limiting semaphore
I230714 14:55:21.986332 22818 kv/kvserver/replica_command.go:439  [T1,n1,split,s1,r82/1:/{Table/106/1/…-Max}] 762  initiating a split of this range at key /Table/106/1/2324834069848522753 [r83] (load at key /Table/106/1/2324834069848522753 (cpu 1.9s, 14.54 batches/sec, 0.08 raft mutations/sec))
I230714 14:55:24.725506 772 2@util/log/event_log.go:32  [T1,n1] 763 ={"Timestamp":1689346524725500884,"EventType":"runtime_stats","MemRSSBytes":437600256,"GoroutineCount":605,"MemStackSysBytes":9764864,"GoAllocBytes":209501080,"GoTotalBytes":303908168,"GoStatsStaleness":0.00134913,"HeapFragmentBytes":21660776,"HeapReservedBytes":36970496,"HeapReleasedBytes":57647104,"CGoAllocBytes":86878624,"CGoTotalBytes":119484416,"CGoCallRate":0.09984294,"CPUUserPercent":121.808395,"CPUSysPercent":0.29952884,"GCRunCount":78,"NetHostRecvBytes":18100121,"NetHostSendBytes":18100121}
I230714 14:55:24.725630 772 2@server/status/runtime_log.go:47  [T1,n1] 764  runtime stats: 417 MiB RSS, 605 goroutines (stacks: 9.3 MiB), 200 MiB/290 MiB Go alloc/total(stale) (heap fragmentation: 21 MiB, heap reserved: 35 MiB, heap released: 55 MiB), 83 MiB/114 MiB CGO alloc/total (0.1 CGO/sec), 121.8/0.3 %(u/s)time, 0.0 %gc (78x), 17 MiB/17 MiB (r/w)net

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-29733

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jul 14, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 14, 2023
@rafiss
Copy link
Collaborator

rafiss commented Jul 14, 2023

cc @andyyang890 I wonder if the latest change to increase the number of samples makes this more likely to timeout. If so, we can raise the timeout in the BUILD.bazel file.

@andyyang890 andyyang890 self-assigned this Jul 17, 2023
@andyyang890
Copy link
Collaborator

Noting for the record that I am able to reproduce this under stress and it looks like the query here gets stuck/times out sometimes:

tdb.QueryRow(t, `
WITH boundaries AS (
SELECT i << (64 - $1) AS p FROM ROWS FROM (generate_series(0, (1<<($1-1)) -1)) AS t (i)
UNION ALL SELECT (((1 << 62) - 1) << 1)+1 -- int63 max value
),
groups AS (
SELECT *
FROM (SELECT p AS low, lead(p) OVER () AS high FROM boundaries)
WHERE high IS NOT NULL
),
counts AS (
SELECT count(i) AS c
FROM t, groups
WHERE low <= i AND high > i
GROUP BY (low, high)
)
SELECT array_agg(c)
FROM counts;`, prefixBits).Scan(&keyCounts)

Still digging into the root cause for this.

@andyyang890
Copy link
Collaborator

Some observations from the logs below, not sure if any of these are significant.

In the test logs, I see a bunch of messages like this initially, but then they stop showing up for the last few minutes before the test times out:

I230724 21:27:04.441883 3801 sql/stats/automatic_stats.go:910  [T1,n1] 616  automatically executing "CREATE STATISTICS __auto__ FROM [5] WITH OPTIONS THROTTLING 0.9 AS OF SYSTEM TIME '-30s'"
I230724 21:27:04.450343 4892 sql/catalog/lease/descriptor_state.go:237  [T1,n1,intExec=create-stats] 617  release: 1("system") ver=1:1690234263.880271249,0, refcount=1
I230724 21:27:04.450513 4892 sql/catalog/lease/descriptor_state.go:237  [T1,n1,intExec=create-stats] 618  release: 15("jobs") ver=1:1690234255.314125364,0, refcount=0
I230724 21:27:04.515012 4892 sql/catalog/lease/descriptor_state.go:237  [T1,n1,intExec=create-stats] 619  release: 1("system") ver=1:1690234263.880271249,0, refcount=1
I230724 21:27:04.515219 4892 sql/catalog/lease/descriptor_state.go:237  [T1,n1,intExec=create-stats] 620  release: 15("jobs") ver=1:1690234255.314125364,0, refcount=0
I230724 21:27:04.515329 4892 sql/catalog/lease/descriptor_state.go:237  [T1,n1,intExec=create-stats] 621  release: 53("job_info") ver=1:1690234250.863811683,0, refcount=0
I230724 21:27:04.548552 4869 jobs/registry.go:1606  [T1,n1] 622  AUTO CREATE STATS job 885272164599726081: stepping through state running
I230724 21:27:04.667762 4869 jobs/registry.go:1606  [T1,n1] 623  AUTO CREATE STATS job 885272164599726081: stepping through state succeeded

Throughout the logs, there are also a lot of messages like these:

I230724 21:30:51.871643 13887 kv/kvserver/replica_command.go:439  [T1,n1,split,s1,r80/1:/{Table/106/1/…-Max}] 1232  initiating a split of this range at key /Table/106/1/856190933522972673 [r81] (load at key /Table/106/1/856190933522972673 (cpu 1.9s, 5.47 batches/sec, 0.07 raft mutations/sec))
I230724 21:30:54.159493 516 2@util/log/event_log.go:32  [T1,n1] 1233 ={"Timestamp":1690234254159489642,"EventType":"runtime_stats","MemRSSBytes":472907776,"GoroutineCount":547,"MemStackSysBytes":5373952,"GoAllocBytes":184244096,"GoTotalBytes":301674840,"GoStatsStaleness":0.002481387,"HeapFragmentBytes":38627456,"HeapReservedBytes":39591936,"HeapReleasedBytes":105455616,"CGoAllocBytes":86377696,"CGoTotalBytes":103919616,"CGoCallRate":0.100111745,"CPUUserPercent":160.97969,"CPUSysPercent":0.60067046,"GCPausePercent":0.28827727,"GCRunCount":156,"NetHostRecvBytes":50666,"NetHostSendBytes":53270}
I230724 21:30:54.159681 516 2@server/status/runtime_log.go:47  [T1,n1] 1234  runtime stats: 451 MiB RSS, 547 goroutines (stacks: 5.1 MiB), 176 MiB/288 MiB Go alloc/total(stale) (heap fragmentation: 37 MiB, heap reserved: 38 MiB, heap released: 101 MiB), 82 MiB/99 MiB CGO alloc/total (0.1 CGO/sec), 161.0/0.6 %(u/s)time, 0.3 %gc (156x), 50 KiB/52 KiB (r/w)net
I230724 21:28:04.027915 2688 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r38/1:/Table/3{6-7}] 992  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230724 21:28:04.028197 2688 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r7/1:/Table/{3-4}] 993  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230724 21:28:04.028397 2688 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r63/1:/Table/{62-106}] 994  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster
I230724 21:28:04.028601 2688 13@kv/kvserver/replicate_queue.go:751  [T1,n1,replicate,s1,r41/1:/Table/{39-40}] 995  error processing replica: 0 of 1 live stores are able to take a new replica for the range (1 already has a voter, 0 already have a non-voter); likely not enough nodes in cluster

Pure speculation but could the stats collection be taking a very long time/hanging?

@andyyang890
Copy link
Collaborator

I don't see anything else that stands out in the logs and setting sql.trace.stmt.enable_threshold to 60s did not generate a trace.

Here's the test log for the failure (which includes a stack dump) after I was able to get it to fail under stress:
test.log

@mgartner
Copy link
Collaborator

cc @andyyang890 I wonder if the latest change to increase the number of samples makes this more likely to timeout. If so, we can raise the timeout in the BUILD.bazel file.

Why would this test need more than 5 minutes? It passes in a few seconds when I run it.

Noting for the record that I am able to reproduce this under stress and it looks like the query here gets stuck/times out sometimes:

How do you know this query is timing out?

@andyyang890
Copy link
Collaborator

How do you know this query is timing out?

From the stack dump, this is one of the stuck goroutines:

goroutine 98 [IO wait, 4 minutes]:
internal/poll.runtime_pollWait(0x7fb01401f408, 0x72)
	GOROOT/src/runtime/netpoll.go:305 +0x89
internal/poll.(*pollDesc).wait(0xc005add500?, 0xc005d97500?, 0x0)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc005add500, {0xc005d97500, 0x700, 0x700})
	GOROOT/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc005add500, {0xc005d97500?, 0xc0053a2c00?, 0x49baa01?})
	GOROOT/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc003e8a5a8, {0xc005d97500?, 0x5?, 0x700?})
	GOROOT/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc00ae24db0, {0xc005d97500?, 0x0?, 0xc005d97500?})
	GOROOT/src/crypto/tls/conn.go:787 +0x3d
bytes.(*Buffer).ReadFrom(0xc0054b2278, {0x62b5f60, 0xc00ae24db0})
	GOROOT/src/bytes/buffer.go:202 +0x98
crypto/tls.(*Conn).readFromUntil(0xc0054b2000, {0x62c1c60?, 0xc003e8a5a8}, 0xc005d97505?)
	GOROOT/src/crypto/tls/conn.go:809 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc0054b2000, 0x0)
	GOROOT/src/crypto/tls/conn.go:616 +0x116
crypto/tls.(*Conn).readRecord(...)
	GOROOT/src/crypto/tls/conn.go:582
crypto/tls.(*Conn).Read(0xc0054b2000, {0xc005a99000, 0x1000, 0xc000820dc0?})
	GOROOT/src/crypto/tls/conn.go:1315 +0x16f
bufio.(*Reader).Read(0xc0052a1620, {0xc005ab3220, 0x5, 0xc00740e8a0?})
	GOROOT/src/bufio/bufio.go:237 +0x1bb
io.ReadAtLeast({0x62b5d40, 0xc0052a1620}, {0xc005ab3220, 0x5, 0x200}, 0x5)
	GOROOT/src/io/io.go:332 +0x9a
io.ReadFull(...)
	GOROOT/src/io/io.go:351
github.com/lib/pq.(*conn).recvMessage(0xc005ab3200, 0xc00740e8c8)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1004 +0xca
github.com/lib/pq.(*conn).recv1Buf(0xc005ab3200, 0xc0054b2328?)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1059 +0x2e
github.com/lib/pq.(*conn).recv1(...)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1086
github.com/lib/pq.(*conn).readBindResponse(0xc005ab3200)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1856 +0x46
github.com/lib/pq.(*stmt).exec(0xc005d04320, {0xc00ae26510, 0x1, 0x1})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1448 +0x793
github.com/lib/pq.(*conn).query(0xc005ab3200, {0x50fc7d8, 0x286}, {0xc00ae26510, 0x1, 0x1})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:903 +0x312
github.com/lib/pq.(*conn).QueryContext(0xc00740ec18?, {0x62fbcd8, 0xc000082068}, {0x50fc7d8, 0x286}, {0xc008dfff80, 0x1, 0x100c00ae28060?})
	github.com/lib/pq/external/com_github_lib_pq/conn_go18.go:24 +0xd7
database/sql.ctxDriverQuery({0x62fbcd8?, 0xc000082068?}, {0x7fb00d657218?, 0xc005ab3200?}, {0x0?, 0x0?}, {0x50fc7d8?, 0x4481190?}, {0xc008dfff80, 0x1, ...})
	GOROOT/src/database/sql/ctxutil.go:48 +0xe5
database/sql.(*DB).queryDC.func1()
	GOROOT/src/database/sql/sql.go:1757 +0x175
database/sql.withLock({0x62d01d0, 0xc003309950}, 0xc00740ee60)
	GOROOT/src/database/sql/sql.go:3439 +0x8c
database/sql.(*DB).queryDC(0x1?, {0x62fbcd8, 0xc000082068}, {0x0, 0x0}, 0xc003309950, 0xc00ae26500, {0x50fc7d8, 0x286}, {0xc00ae264f0, ...})
	GOROOT/src/database/sql/sql.go:1752 +0x211
database/sql.(*DB).query(0x49806e?, {0x62fbcd8, 0xc000082068}, {0x50fc7d8, 0x286}, {0xc00ae264f0, 0x1, 0x1}, 0x2?)
	GOROOT/src/database/sql/sql.go:1735 +0xfd
database/sql.(*DB).QueryContext(0xc001359950?, {0x62fbcd8, 0xc000082068}, {0x50fc7d8, 0x286}, {0xc00ae264f0, 0x1, 0x1})
	GOROOT/src/database/sql/sql.go:1708 +0xda
database/sql.(*DB).QueryRowContext(0x7fb0126b5408?, {0x62fbcd8?, 0xc000082068?}, {0x50fc7d8?, 0xc0005c0000?}, {0xc00ae264f0?, 0x0?, 0xc00604f708?})
	GOROOT/src/database/sql/sql.go:1813 +0x37
github.com/cockroachdb/cockroach/pkg/testutils/sqlutils.(*SQLRunner).QueryRow(0xc00740f738, {0x62c2320?, 0xc00062f040}, {0x50fc7d8, 0x286}, {0xc00ae264f0, 0x1, 0x1})
	github.com/cockroachdb/cockroach/pkg/testutils/sqlutils/sql_runner.go:226 +0x114
github.com/cockroachdb/cockroach/pkg/sql/sem/builtins.TestSerialNormalizationWithUniqueUnorderedID.func1()
	github.com/cockroachdb/cockroach/pkg/sql/sem/builtins/builtins_test.go:150 +0x2c5
github.com/cockroachdb/cockroach/pkg/util/retry.WithMaxAttempts.func1({0x62fbcd8?, 0xc000082068?})
	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:195 +0x1b
github.com/cockroachdb/cockroach/pkg/util/retry.Options.Do({0x0, 0x0, 0x0, 0x1, 0x0, 0x0}, {0x62fbcd8, 0xc000082068}, 0xc00740fec0)
	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:172 +0x166
github.com/cockroachdb/cockroach/pkg/util/retry.WithMaxAttempts({0x62fbcd8?, 0xc000082068?}, {0x0, 0x0, 0x0, 0x1, 0x0, 0x0}, 0x8df36a0?, 0xc0006de760?)
	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:194 +0xed
github.com/cockroachdb/cockroach/pkg/sql/sem/builtins.TestSerialNormalizationWithUniqueUnorderedID(0xc00062f040)
	github.com/cockroachdb/cockroach/pkg/sql/sem/builtins/builtins_test.go:110 +0xd0
testing.tRunner(0xc00062f040, 0x5242430)
	GOROOT/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
	GOROOT/src/testing/testing.go:1493 +0x35f

Specifically this part of the stack trace (the line number is slightly different from master since I added a little bit of debugging code, but it reproduces under stress without the debugging code too):

github.com/cockroachdb/cockroach/pkg/sql/sem/builtins.TestSerialNormalizationWithUniqueUnorderedID.func1()
	github.com/cockroachdb/cockroach/pkg/sql/sem/builtins/builtins_test.go:150 +0x2c5

Maybe I'm interpreting what this means wrong, but I think it means this query has been running/stuck for 4 minutes?

@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Jul 25, 2023
@andyyang890 andyyang890 added the T-sql-queries SQL Queries Team label Jul 25, 2023
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jul 25, 2023
@yuzefovich
Copy link
Member

Maybe I'm interpreting what this means wrong, but I think it means this query has been running/stuck for 4 minutes?

I think your interpretation is correct - the query is running for at least 4 minutes. I don't think it's expected - locally for me it ran under 100ms. I'll investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). db-cy-23 O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
5 participants