Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

o11y: error 53200 when querying jobs table on DRT cluster #119384

Closed
ajstorm opened this issue Feb 20, 2024 · 4 comments
Closed

o11y: error 53200 when querying jobs table on DRT cluster #119384

ajstorm opened this issue Feb 20, 2024 · 4 comments
Labels
A-cluster-observability Related to cluster observability C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-2 Issues/test failures with a fix SLA of 3 months

Comments

@ajstorm
Copy link
Collaborator

ajstorm commented Feb 20, 2024

On node 4 of the DRT cluster, we're currently unable to query the jobs table:

image

In the logs we see

W240220 13:38:07.531620 386241686 sql/crdb_internal.go:1189 ⋮ [T3,Vapplication,client-tenant=‹3›,n4,intExec=‹admin-jobs›] 299597  error closing an iterator: ‹crdb-internal-jobs-table›: ‹system-jobs-scan›: scan with start key /Tenant/3/Table/53/1: root: memory budget exceeded: 10240 bytes requested, 7991296000 currently allocated, 7991304192 bytes in budget
W240220 13:38:07.531620 386241686 sql/crdb_internal.go:1189 ⋮ [T3,Vapplication,client-tenant=‹3›,n4,intExec=‹admin-jobs›] 299597 +HINT: ‹Consider increasing --max-sql-memory startup parameter.›
E240220 13:38:07.536326 386241593 server/admin.go:2291 ⋮ [T3,Vapplication,client-tenant=‹3›,n4] 299598  ‹admin-jobs›: ‹crdb-internal-jobs-table›: ‹system-jobs-scan›: scan with start key /Tenant/3/Table/53/1: root: memory budget exceeded: 10240 bytes requested, 7991296000 currently allocated, 7991304192 bytes in budget

The table can be queried successfully from the sql shell.

I've saved off the logs on node 4 so they don't get overwritten (cockroach.cct-232-0004.ubuntu.2024-02-20T13_03_18Z.1639320.log.saved)

Jira issue: CRDB-36175

@ajstorm ajstorm added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cluster-observability A-cluster-observability Related to cluster observability O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-2 Issues/test failures with a fix SLA of 3 months labels Feb 20, 2024
@ajstorm
Copy link
Collaborator Author

ajstorm commented Feb 20, 2024

There were 3 nodes down on the cluster at the time. Now that those nodes are back up, the page renders successfully. Would the above behaviour be expected in a cluster which is partially down? What's odd is that I was able to query the table directly when the cluster was partially down.

@stevendanna
Copy link
Collaborator

stevendanna commented Feb 21, 2024

What's odd is that I was able to query the table directly when the cluster was partially down.

Were you querying system.jobs or SHOW JOBS? The latter is substantially more work than the former.

Both SHOW JOBS and the admin API that drives the job page is driven by crdb_internal.jobs and will result in 2 full scans of system.job_info. The default query used by the job pages also sorts by the created time which will probably drive non-trivial memory usage. My hope is that ongoing work to move these to virtual views rather than virtual tables will help some of this.

In this case, the problem seems to be on node 9 which had 7GB of sql memory allocated. Looking at the cluster currently the typical memory used by crdb_internal.jobs is between 100 and 200MB according to the sql activities page. So if it was just the jobs query that sent us to the moon, it may have to do with our behavior during errors.

Screenshot 2024-02-21 at 11 26 07

Unfortunately, we seem to have no heap profile from that time:

ubuntu@cct-232-0009:~$ ls -lart logs/heap_profiler/ | grep 'Feb 20 1'
-rw-r----- 1 ubuntu ubuntu     314 Feb 20 11:01 memmonitoring.2024-02-20T11_01_22.711.2850358552.txt
-rw-r----- 1 ubuntu ubuntu     314 Feb 20 11:01 memmonitoring.2024-02-20T11_01_32.711.3031072800.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 11:01 memmonitoring.2024-02-20T11_01_52.241.3195858864.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 11:05 memmonitoring.2024-02-20T11_05_02.611.3311617816.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 11:17 memmonitoring.2024-02-20T11_17_23.219.3375931448.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 12:17 memmonitoring.2024-02-20T12_17_26.783.2899812752.txt
-rw-r----- 1 ubuntu ubuntu     315 Feb 20 12:17 memmonitoring.2024-02-20T12_17_56.922.3340891976.txt

And our logs have also rotated away:

ubuntu@cct-232-0009:~$ ls -lart logs/cockroach.cct-232-0009.ubuntu.2024-02-2* | head -1
-rw-r----- 1 ubuntu ubuntu 10485530 Feb 20 13:39 logs/cockroach.cct-232-0009.ubuntu.2024-02-20T13_38_45Z.2020030.log

So that is probably the end of what I can look at quickly.

@yuzefovich
Copy link
Member

This has also been discussed on slack. I agree with Steven that at this point we don't have profiles, logs, etc to say definitively what was using SQL memory, but the error seems expected given that some operation used up 7GiB of RAM on one node.

What would have helped is if we had memmonitoring profiler enabled in the application tenant. Currently we disable all "runtime stats" profilers in shared-process config, and it makes sense for all proper "runtime stats", but the memory accounting profiler dumps the state of SQL accounting system, so we only get system tenant's view which is insufficient. (This profiler was added recently in #114275.)

@ajstorm
Copy link
Collaborator Author

ajstorm commented Feb 22, 2024

Thanks for the investigation @stevendanna and @yuzefovich. I'm closing this issue due to the lack of debug info. I've opened #119530 to track the enabling of the memmonitoring profiler in the application tenant.

@ajstorm ajstorm closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-observability Related to cluster observability C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-2 Issues/test failures with a fix SLA of 3 months
Projects
None yet
Development

No branches or pull requests

3 participants