o11y: error 53200 when querying jobs table on DRT cluster #119384

ajstorm · 2024-02-20T14:02:21Z

On node 4 of the DRT cluster, we're currently unable to query the jobs table:

In the logs we see

W240220 13:38:07.531620 386241686 sql/crdb_internal.go:1189 ⋮ [T3,Vapplication,client-tenant=‹3›,n4,intExec=‹admin-jobs›] 299597  error closing an iterator: ‹crdb-internal-jobs-table›: ‹system-jobs-scan›: scan with start key /Tenant/3/Table/53/1: root: memory budget exceeded: 10240 bytes requested, 7991296000 currently allocated, 7991304192 bytes in budget
W240220 13:38:07.531620 386241686 sql/crdb_internal.go:1189 ⋮ [T3,Vapplication,client-tenant=‹3›,n4,intExec=‹admin-jobs›] 299597 +HINT: ‹Consider increasing --max-sql-memory startup parameter.›
E240220 13:38:07.536326 386241593 server/admin.go:2291 ⋮ [T3,Vapplication,client-tenant=‹3›,n4] 299598  ‹admin-jobs›: ‹crdb-internal-jobs-table›: ‹system-jobs-scan›: scan with start key /Tenant/3/Table/53/1: root: memory budget exceeded: 10240 bytes requested, 7991296000 currently allocated, 7991304192 bytes in budget

The table can be queried successfully from the sql shell.

I've saved off the logs on node 4 so they don't get overwritten (cockroach.cct-232-0004.ubuntu.2024-02-20T13_03_18Z.1639320.log.saved)

Jira issue: CRDB-36175

The text was updated successfully, but these errors were encountered:

ajstorm · 2024-02-20T15:19:36Z

There were 3 nodes down on the cluster at the time. Now that those nodes are back up, the page renders successfully. Would the above behaviour be expected in a cluster which is partially down? What's odd is that I was able to query the table directly when the cluster was partially down.

stevendanna · 2024-02-21T11:42:08Z

What's odd is that I was able to query the table directly when the cluster was partially down.

Were you querying system.jobs or SHOW JOBS? The latter is substantially more work than the former.

Both SHOW JOBS and the admin API that drives the job page is driven by crdb_internal.jobs and will result in 2 full scans of system.job_info. The default query used by the job pages also sorts by the created time which will probably drive non-trivial memory usage. My hope is that ongoing work to move these to virtual views rather than virtual tables will help some of this.

In this case, the problem seems to be on node 9 which had 7GB of sql memory allocated. Looking at the cluster currently the typical memory used by crdb_internal.jobs is between 100 and 200MB according to the sql activities page. So if it was just the jobs query that sent us to the moon, it may have to do with our behavior during errors.

Unfortunately, we seem to have no heap profile from that time:

ubuntu@cct-232-0009:~$ ls -lart logs/heap_profiler/ | grep 'Feb 20 1'
-rw-r----- 1 ubuntu ubuntu     314 Feb 20 11:01 memmonitoring.2024-02-20T11_01_22.711.2850358552.txt
-rw-r----- 1 ubuntu ubuntu     314 Feb 20 11:01 memmonitoring.2024-02-20T11_01_32.711.3031072800.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 11:01 memmonitoring.2024-02-20T11_01_52.241.3195858864.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 11:05 memmonitoring.2024-02-20T11_05_02.611.3311617816.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 11:17 memmonitoring.2024-02-20T11_17_23.219.3375931448.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 12:17 memmonitoring.2024-02-20T12_17_26.783.2899812752.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 12:17 memmonitoring.2024-02-20T12_17_56.922.3340891976.txt

And our logs have also rotated away:

ubuntu@cct-232-0009:~$ ls -lart logs/cockroach.cct-232-0009.ubuntu.2024-02-2* | head -1
-rw-r----- 1 ubuntu ubuntu 10485530 Feb 20 13:39 logs/cockroach.cct-232-0009.ubuntu.2024-02-20T13_38_45Z.2020030.log

So that is probably the end of what I can look at quickly.

yuzefovich · 2024-02-21T18:18:57Z

This has also been discussed on slack. I agree with Steven that at this point we don't have profiles, logs, etc to say definitively what was using SQL memory, but the error seems expected given that some operation used up 7GiB of RAM on one node.

What would have helped is if we had memmonitoring profiler enabled in the application tenant. Currently we disable all "runtime stats" profilers in shared-process config, and it makes sense for all proper "runtime stats", but the memory accounting profiler dumps the state of SQL accounting system, so we only get system tenant's view which is insufficient. (This profiler was added recently in #114275.)

ajstorm · 2024-02-22T18:38:41Z

Thanks for the investigation @stevendanna and @yuzefovich. I'm closing this issue due to the lack of debug info. I've opened #119530 to track the enabling of the memmonitoring profiler in the application tenant.

ajstorm mentioned this issue Feb 21, 2024

kv: "new range lease" log entry is flooding logs #119451

Closed

ajstorm mentioned this issue Feb 22, 2024

multi-tenant: enable memmonitoring profiler in application tenant #119530

Open

ajstorm closed this as completed Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

o11y: error 53200 when querying jobs table on DRT cluster #119384

o11y: error 53200 when querying jobs table on DRT cluster #119384

ajstorm commented Feb 20, 2024 •

edited by jlinder

Loading

ajstorm commented Feb 20, 2024

stevendanna commented Feb 21, 2024 •

edited

Loading

yuzefovich commented Feb 21, 2024

ajstorm commented Feb 22, 2024

o11y: error 53200 when querying jobs table on DRT cluster #119384

o11y: error 53200 when querying jobs table on DRT cluster #119384

Comments

ajstorm commented Feb 20, 2024 • edited by jlinder Loading

ajstorm commented Feb 20, 2024

stevendanna commented Feb 21, 2024 • edited Loading

yuzefovich commented Feb 21, 2024

ajstorm commented Feb 22, 2024

ajstorm commented Feb 20, 2024 •

edited by jlinder

Loading

stevendanna commented Feb 21, 2024 •

edited

Loading