sql,mon: expose the memory monitors as virtual table #97657

yuzefovich · 2023-02-24T21:38:45Z

This commit adjusts our memory monitors to be able to traverse the whole
monitor tree starting at the root monitor. In particular, this commit
introduces a doubly-linked list of "siblings" and stores the reference
to the head in the parent. Whenever a monitor is Started, it is
included as the new head of its parent's children list, whenever a monitor
is Stopped, it is removed from that list. The overhead of this
additional tracking should be negligible since only the parent's lock
needs to be acquired twice throughout the lifetime of a monitor (thus,
assuming relatevily long-lived sessions, this wouldn't affect the root
monitor) and the increase in allocations is minor.

This required clarification on how locks on a parent and a child can be
held at the same time. In particular, since the main code path is
acquiring locks "upwards" (meaning when growing the child's budget we
might need to grow the parent's budget, and "growing" locks the
corresponding monitor), whenever we want to traverse the tree from the
root down, we have to unlock the parent's monitor before recursing into
the children. As a result, the traversal might give us an inconsistent
view (where a recently stopped child can contribute to the usage of the
parent while we don't recurse into that child). This seems acceptable.

This ability to traverse the whole monitor tree is now exposed as a new
virtual table crdb_internal.node_memory_monitors which includes a line
for each monitor active at the time of table generation (subject to
possible inconsistency mentioned above). The table includes the name of
the monitors which can be suggestive about the activity on the cluster,
thus, access to this table is gated on the "view activity" permissions.
The usage of the virtual table to expose the memory monitors information
results in flattening of the tree; however, one of the fields is
a "level" (or "generation") in relation to the root, plus the ordering
of rows is very specific, so we can still format the output to see the
hierarchy. We also assign IDs to the monitors (which is their pointer
address). Exposing this information as a virtual table allows us to use
SQL to analyze it.

Here is one example of visualizing it:

[email protected]:26257/defaultdb> SELECT repeat('    ', level) || name || ' ' || crdb_internal.humanize_bytes(used) FROM crdb_internal.node_memory_monitors;
                             ?column?
-------------------------------------------------------------------
  root 0 B
      internal-planner.‹root›.‹resume-job-101› 0 B
      internal-planner.‹node›.‹resume-job-100› 0 B
      internal-planner.‹node›.‹resume-job-842810460057567233› 0 B
      sql 900 KiB
          session root 20 KiB
              txn 10 KiB
                  flow e595eb80 10 KiB
              session 0 B
              txn-fingerprint-id-cache 0 B
          internal SQL executor 0 B
          internal SQL executor 0 B
          internal sql executor 0 B
          conn 105 KiB
          internal SQL executor 70 KiB
          internal SQL executor 60 KiB
          SQLStats 540 KiB
          SQLStats 0 B
      distsql 0 B
      server-cache-mon 0 B
      bulk-mon 0 B
          backup-mon 0 B
          backfill-mon 0 B
      pre-conn 105 KiB
      closed-session-cache 190 KiB
      timeseries-results 0 B
      timeseries-workers 0 B
      kv-mem 20 KiB
          rangefeed-monitor 0 B
          rangefeed-system-monitor 0 B
(30 rows)

There are a couple of additional minor improvements:

we now include the short FlowID into the flow's memory monitor name.
Combined with the distsql_flows virtual table we'll be able to get the
stmt fingerprint for the remote flows running on a node.
new crdb_internal.humanize_bytes builtin function is introduced.

Note that the corresponding cluster_memory_monitors virtual table is
not introduced out of caution. In particular, this would lead to RPCs
issued to all nodes in the cluster, and since each node can have on the
order of hundreds of thousands monitors, the response to each RPC could
have non-trivial network cost. We can revisit this decision later if we
find that a cluster level view of the memory monitors is desirable, but
for now a node level view seems like a big improvement on its own.

Addresses: #35097.
Fixes: #90551.

Release note (sql change): New internal virtual table
crdb_internal.memory_monitors is introduced. It exposes all of the
current reservations with the memory accounting system on a single node.
Access to the table requires VIEWACTIVITY or VIEWACTIVITYREDACTED
permissions.

cockroach-teamcity · 2023-02-24T21:38:53Z

This change is

yuzefovich · 2023-02-25T01:01:56Z

I ended up exposing the data as the virtual table, thanks @michae2 for the idea! I didn't run any performance benchmarks cause I'm pretty confident the overhead should be negligible, however, I can do that if someone thinks it'd be worth it. Also it seems pretty safe for a backport (perhaps after a short baking period on master).

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @rharding6373)

-- commits line 86 at r1:
I initially included the requirement for VIEWACTIVITY permissions mainly out of performance considerations (in a previous iteration, when traversing the tree we'd hold the parent's lock until all children are fully traversed, which can be quite costly since it would lock the root monitor; however, that turned out to lead to deadlocks, so I refactored the logic), so I'm curious what others think whether this is necessary or not. Names of the monitors don't contain any PII but can contain things like "sort" or "hashjoiner" which is suggestive about the type of activity.

rharding6373

Very nice work!

Reviewed 25 of 25 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @yuzefovich)

-- commits line 86 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I initially included the requirement for VIEWACTIVITY permissions mainly out of performance considerations (in a previous iteration, when traversing the tree we'd hold the parent's lock until all children are fully traversed, which can be quite costly since it would lock the root monitor; however, that turned out to lead to deadlocks, so I refactored the logic), so I'm curious what others think whether this is necessary or not. Names of the monitors don't contain any PII but can contain things like "sort" or "hashjoiner" which is suggestive about the type of activity.

This seems like a reasonable permission to have.

pkg/util/mon/bytes_usage.go line 280 at r1 (raw file):

}

// ExportTree exports the current state of the tree of monitors rooted in mm.

nit: s/mm/this BytesMonitor

(imo other uses of mm in the comments are fine, but this one was confusing)

pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

	entries := append([]ExportEntry{}, entry)
	for _, c := range children {
		entries = append(entries, c.exportTree(level+1)...)

If one of the children is stopped, then we'll append nil here, is that desired behavior? It might make more sense to check if nil and only append non-nil values to entries.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)

-- commits line 86 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

This seems like a reasonable permission to have.

Ack, let's keep this gate then.

pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

If one of the children is stopped, then we'll append nil here, is that desired behavior? It might make more sense to check if nil and only append non-nil values to entries.

If a child is stopped, then exportTree will return a nil slice, so this append will become a no-op. I added some more comments and expanded the test to ensure that only non-empty ExportEntry objects are returned by ExportTree.

michae2

This is great! I think this will make a huge difference for observability of memory usage. Really nice.

Reviewed 1 of 25 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)

pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

	comment: `node-level table listing all currently active memory monitors`,
	schema: `
CREATE TABLE crdb_internal.memory_monitors (

random thought: should this be node_memory_monitors? (And should there be a corresponding cluster_memory_monitors?)

pkg/sql/sem/builtins/builtins.go line 7782 at r2 (raw file):

				return tree.NewDString(string(humanizeutil.IBytes(int64(b)))), nil
			},
			Info:       "Converts integer memory size (in bytes) into the human-readable form.",

nit: remove "memory" (this could be used for disk space or network usage or anything else measured in bytes).

pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

If a child is stopped, then exportTree will return a nil slice, so this append will become a no-op. I added some more comments and expanded the test to ensure that only non-empty ExportEntry objects are returned by ExportTree.

Because this is recursive, it will end up copying each entry level times to build the final level-0 slice. And we don't really need the full slice at once to create the virtual table anyway, we just need something to iterate over. For these reasons it might be worth changing ExportTree to take a closure instead of returning a slice, and that closure could be called with each ExportEntry.

pkg/util/mon/bytes_usage.go line 206 at r2 (raw file):

		// code path (reserveBytes() and releaseBytes()) might acquire the
		// parent's lock, so only locking "upwards" is allowed while keeping the
		// current monitor's lock.

Good catch!

pkg/util/mon/bytes_usage.go line 308 at r2 (raw file):

	entry := ExportEntry{
		Level:            level,
		Name:             string(mm.name),

suggestion: It might be nice to more formally expose the hierarchy using an ID and a parent ID, so that we could use recursive CTEs when querying crdb_internal.memory_monitors. (Maybe these IDs could be the addresses of the monitors cast to uint64?)

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)

pkg/util/mon/bytes_usage.go line 207 at r2 (raw file):

		// parent's lock, so only locking "upwards" is allowed while keeping the
		// current monitor's lock.
		head *BytesMonitor

random trivia: You might already know about this, but if you like linked lists, Linux has a really nice implementation of an intrusive, circular doubly-linked list which you might enjoy. Because it is circular (and because of various tricks possible in C) it doesn't need any conditionals for most list operations.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)

pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

random thought: should this be node_memory_monitors? (And should there be a corresponding cluster_memory_monitors?)

Renamed, but I'm a bit afraid of introducing the cluster level table. In particular, since each query can have dozens of monitors, a single node can have on the order of hundreds of thousands of them. Traversing that tree locally is cheap (especially so given the callback implementation of the traversal), but for the cluster level view we'd have to materialize all monitors into a single RPC response (or we'd need to implement some pagination scheme) for each remote node, so the network cost can be substantial. I want this to be backported, so I'm leaning on the side of caution here. In the future if we see that the cluster level view is desirable, we can still add it, but for now node level view seems like a big improvement on its own.

pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Because this is recursive, it will end up copying each entry level times to build the final level-0 slice. And we don't really need the full slice at once to create the virtual table anyway, we just need something to iterate over. For these reasons it might be worth changing ExportTree to take a closure instead of returning a slice, and that closure could be called with each ExportEntry.

That's a very good point. I briefly thought about this too but decided it wasn't a big deal. I just wrote a microbenchmark, and these allocations do add up - the exponential growth in the number of monitors in a tree can make it huge! I like your idea about the callback, thanks, done.

pkg/util/mon/bytes_usage.go line 207 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

random trivia: You might already know about this, but if you like linked lists, Linux has a really nice implementation of an intrusive, circular doubly-linked list which you might enjoy. Because it is circular (and because of various tricks possible in C) it doesn't need any conditionals for most list operations.

Wow, I couldn't decipher the code without the second link, that was an interesting read, thanks!

pkg/util/mon/bytes_usage.go line 308 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

suggestion: It might be nice to more formally expose the hierarchy using an ID and a parent ID, so that we could use recursive CTEs when querying crdb_internal.memory_monitors. (Maybe these IDs could be the addresses of the monitors cast to uint64?)

Thanks, I like it, done (only with minor modification is that it's exposed as int64 since the virtual table has INT8 type, so it seems cleaner to just always use the corresponding int64 Go type).

pkg/util/mon/bytes_usage.go line 479 at r3 (raw file):

	mm.mu.maxAllocated = 0
	mm.mu.curBudget = pool.MakeBoundAccount()
	mm.mu.stopped = false

This line is why during the demo earlier today the active queries weren't showing up. "txn" monitor is stopped and then started for each txn.

michae2

Looks good, just a few comments

Reviewed 8 of 25 files at r1, 15 of 17 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)

pkg/cli/zip_table_registry.go line 672 at r3 (raw file):

			"reserved_reserved",
		},
	},

This is going to be so useful!!

pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Renamed, but I'm a bit afraid of introducing the cluster level table. In particular, since each query can have dozens of monitors, a single node can have on the order of hundreds of thousands of them. Traversing that tree locally is cheap (especially so given the callback implementation of the traversal), but for the cluster level view we'd have to materialize all monitors into a single RPC response (or we'd need to implement some pagination scheme) for each remote node, so the network cost can be substantial. I want this to be backported, so I'm leaning on the side of caution here. In the future if we see that the cluster level view is desirable, we can still add it, but for now node level view seems like a big improvement on its own.

Thank you! Makes sense. (Yes, this is definitely a big improvement on its own!)

It looks like most other node_ tables start with a node_id column. I think we should follow that convention, too.

pkg/util/mon/bytes_usage.go line 340 at r3 (raw file):

	mm.mu.Unlock()
	if err := monitorStateCb(monitorState); err != nil {
		return err

smart

pkg/sql/logictest/testdata/logic_test/table line 602 at r3 (raw file):

leases                                 NULL
lost_descriptors_with_data             NULL
memory_monitors                        NULL

node_memory_monitors

This commit adjusts our memory monitors to be able to traverse the whole monitor tree starting at the root monitor. In particular, this commit introduces a doubly-linked list of "siblings" and stores the reference to the head in the parent. Whenever a monitor is `Start`ed, it is included as the new head of its parent's children list, whenever a monitor is `Stop`ped, it is removed from that list. The overhead of this additional tracking should be negligible since only the parent's lock needs to be acquired twice throughout the lifetime of a monitor (thus, assuming relatevily long-lived sessions, this wouldn't affect the root monitor) and the increase in allocations is minor. This required clarification on how locks on a parent and a child can be held at the same time. In particular, since the main code path is acquiring locks "upwards" (meaning when growing the child's budget we might need to grow the parent's budget, and "growing" locks the corresponding monitor), whenever we want to traverse the tree from the root down, we have to unlock the parent's monitor before recursing into the children. As a result, the traversal might give us an inconsistent view (where a recently stopped child can contribute to the usage of the parent while we don't recurse into that child). This seems acceptable. This ability to traverse the whole monitor tree is now exposed as a new virtual table `crdb_internal.node_memory_monitors` which includes a line for each monitor active at the time of table generation (subject to possible inconsistency mentioned above). The table includes the name of the monitors which can be suggestive about the activity on the cluster, thus, access to this table is gated on the "view activity" permissions. The usage of the virtual table to expose the memory monitors information results in flattening of the tree; however, one of the fields is a "level" (or "generation") in relation to the root, plus the ordering of rows is very specific, so we can still format the output to see the hierarchy. We also assign IDs to the monitors (which is their pointer address). Exposing this information as a virtual table allows us to use SQL to analyze it. Here is one example of visualizing it: ``` [email protected]:26257/defaultdb> SELECT repeat(' ', level) || name || ' ' || crdb_internal.humanize_bytes(used) FROM crdb_internal.node_memory_monitors; ?column? ------------------------------------------------------------------- root 0 B internal-planner.‹root›.‹resume-job-101› 0 B internal-planner.‹node›.‹resume-job-100› 0 B internal-planner.‹node›.‹resume-job-842810460057567233› 0 B sql 900 KiB session root 20 KiB txn 10 KiB flow e595eb80 10 KiB session 0 B txn-fingerprint-id-cache 0 B internal SQL executor 0 B internal SQL executor 0 B internal sql executor 0 B conn 105 KiB internal SQL executor 70 KiB internal SQL executor 60 KiB SQLStats 540 KiB SQLStats 0 B distsql 0 B server-cache-mon 0 B bulk-mon 0 B backup-mon 0 B backfill-mon 0 B pre-conn 105 KiB closed-session-cache 190 KiB timeseries-results 0 B timeseries-workers 0 B kv-mem 20 KiB rangefeed-monitor 0 B rangefeed-system-monitor 0 B (30 rows) ``` There are a couple of additional minor improvements: - we now include the short FlowID into the flow's memory monitor name. Combined with the distsql_flows virtual table we'll be able to get the stmt fingerprint for the remote flows running on a node. - new `crdb_internal.humanize_bytes` builtin function is introduced. Note that the corresponding `cluster_memory_monitors` virtual table is not introduced out of caution. In particular, this would lead to RPCs issued to all nodes in the cluster, and since each node can have on the order of hundreds of thousands monitors, the response to each RPC could have non-trivial network cost. We can revisit this decision later if we find that a cluster level view of the memory monitors is desirable, but for now a node level view seems like a big improvement on its own. Release note (sql change): New internal virtual table `crdb_internal.node_memory_monitors` is introduced. It exposes all of the current reservations with the memory accounting system on a single node. Access to the table requires VIEWACTIVITY or VIEWACTIVITYREDACTED permissions.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)

pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Thank you! Makes sense. (Yes, this is definitely a big improvement on its own!)

It looks like most other node_ tables start with a node_id column. I think we should follow that convention, too.

node_id seems redundant now since we only have node level view. We can always change the schema later if/ when adding the cluster level view. Do you see any benefits from having this column before that?

I think in other cases where we have node and cluster level views, we just kept the schema the same for both variants by default even though we don't need the node_id column in the node level view case. I'm a big fan of consistency with existing precedents, but for some reason I'm leaning towards not following this particular convention since it doesn't make much sense to me and is not fully applicable to this virtual table. Thoughts?

pkg/sql/logictest/testdata/logic_test/table line 602 at r3 (raw file):

Previously, michae2 (Michael Erickson) wrote…

node_memory_monitors

Done.

michae2

Nice work!

Reviewed 1 of 17 files at r3, 3 of 3 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rharding6373)

pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

We can always change the schema later if/ when adding the cluster level view. Do you see any benefits from having this column before that?

Good point. The only benefit I see is consistency with other node_ tables. We can leave it.

yuzefovich · 2023-03-01T17:44:44Z

TFTRs!

bors r+

craig · 2023-03-01T18:40:58Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-03-01T18:41:07Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from b6d04fd to blathers/backport-release-22.2-97657: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

yuzefovich force-pushed the monitor-tree branch 7 times, most recently from 9062116 to b750279 Compare February 25, 2023 00:58

yuzefovich added the backport-22.2.x label Feb 25, 2023

yuzefovich marked this pull request as ready for review February 25, 2023 00:59

yuzefovich requested a review from a team February 25, 2023 00:59

yuzefovich requested review from a team as code owners February 25, 2023 00:59

yuzefovich requested a review from a team February 25, 2023 00:59

yuzefovich requested a review from a team as a code owner February 25, 2023 00:59

yuzefovich requested review from rharding6373, michae2 and DrewKimball February 25, 2023 00:59

yuzefovich commented Feb 27, 2023

View reviewed changes

rharding6373 approved these changes Feb 27, 2023

View reviewed changes

yuzefovich force-pushed the monitor-tree branch from b750279 to 9439e6c Compare February 28, 2023 00:23

yuzefovich commented Feb 28, 2023

View reviewed changes

michae2 reviewed Feb 28, 2023

View reviewed changes

yuzefovich force-pushed the monitor-tree branch 3 times, most recently from 6d40705 to 1af565f Compare February 28, 2023 22:20

yuzefovich commented Feb 28, 2023

View reviewed changes

michae2 requested changes Mar 1, 2023

View reviewed changes

yuzefovich force-pushed the monitor-tree branch from 1af565f to b6d04fd Compare March 1, 2023 16:59

yuzefovich commented Mar 1, 2023

View reviewed changes

michae2 approved these changes Mar 1, 2023

View reviewed changes

craig bot merged commit c3348b0 into cockroachdb:master Mar 1, 2023

yuzefovich deleted the monitor-tree branch March 1, 2023 18:53

yuzefovich mentioned this pull request Mar 1, 2023

release-22.2: sql,mon: expose the memory monitors as virtual table #97865

Merged

cockroach-teamcity mentioned this pull request Mar 2, 2023

PR #97657 - sql,mon: expose the memory monitors as virtual table cockroachdb/docs#16400

Open

yuzefovich mentioned this pull request Mar 6, 2023

release-22.1: sql,mon: expose the memory monitors as virtual table #98043

Merged

yuzefovich mentioned this pull request Oct 3, 2023

sql: infrastructure to get visibility into the sql memory budget #35097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql,mon: expose the memory monitors as virtual table #97657

sql,mon: expose the memory monitors as virtual table #97657

yuzefovich commented Feb 24, 2023 •

edited

Loading

cockroach-teamcity commented Feb 24, 2023

yuzefovich commented Feb 25, 2023

yuzefovich left a comment

rharding6373 left a comment

yuzefovich left a comment

michae2 left a comment

michae2 left a comment

yuzefovich left a comment

michae2 left a comment

yuzefovich left a comment

michae2 left a comment

yuzefovich commented Mar 1, 2023

craig bot commented Mar 1, 2023

blathers-crl bot commented Mar 1, 2023

sql,mon: expose the memory monitors as virtual table #97657

sql,mon: expose the memory monitors as virtual table #97657

Conversation

yuzefovich commented Feb 24, 2023 • edited Loading

cockroach-teamcity commented Feb 24, 2023

yuzefovich commented Feb 25, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich commented Mar 1, 2023

craig bot commented Mar 1, 2023

blathers-crl bot commented Mar 1, 2023

yuzefovich commented Feb 24, 2023 •

edited

Loading