-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql,mon: expose the memory monitors as virtual table #97657
Conversation
9062116
to
b750279
Compare
I ended up exposing the data as the virtual table, thanks @michae2 for the idea! I didn't run any performance benchmarks cause I'm pretty confident the overhead should be negligible, however, I can do that if someone thinks it'd be worth it. Also it seems pretty safe for a backport (perhaps after a short baking period on master). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @rharding6373)
-- commits
line 86 at r1:
I initially included the requirement for VIEWACTIVITY permissions mainly out of performance considerations (in a previous iteration, when traversing the tree we'd hold the parent's lock until all children are fully traversed, which can be quite costly since it would lock the root monitor; however, that turned out to lead to deadlocks, so I refactored the logic), so I'm curious what others think whether this is necessary or not. Names of the monitors don't contain any PII but can contain things like "sort" or "hashjoiner" which is suggestive about the type of activity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 25 of 25 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I initially included the requirement for VIEWACTIVITY permissions mainly out of performance considerations (in a previous iteration, when traversing the tree we'd hold the parent's lock until all children are fully traversed, which can be quite costly since it would lock the root monitor; however, that turned out to lead to deadlocks, so I refactored the logic), so I'm curious what others think whether this is necessary or not. Names of the monitors don't contain any PII but can contain things like "sort" or "hashjoiner" which is suggestive about the type of activity.
This seems like a reasonable permission to have.
pkg/util/mon/bytes_usage.go
line 280 at r1 (raw file):
} // ExportTree exports the current state of the tree of monitors rooted in mm.
nit: s/mm/this BytesMonitor
(imo other uses of mm
in the comments are fine, but this one was confusing)
pkg/util/mon/bytes_usage.go
line 319 at r1 (raw file):
entries := append([]ExportEntry{}, entry) for _, c := range children { entries = append(entries, c.exportTree(level+1)...)
If one of the children is stopped, then we'll append nil here, is that desired behavior? It might make more sense to check if nil and only append non-nil values to entries
.
b750279
to
9439e6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)
Previously, rharding6373 (Rachael Harding) wrote…
This seems like a reasonable permission to have.
Ack, let's keep this gate then.
pkg/util/mon/bytes_usage.go
line 319 at r1 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
If one of the children is stopped, then we'll append nil here, is that desired behavior? It might make more sense to check if nil and only append non-nil values to
entries
.
If a child is stopped, then exportTree
will return a nil
slice, so this append
will become a no-op. I added some more comments and expanded the test to ensure that only non-empty ExportEntry
objects are returned by ExportTree
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I think this will make a huge difference for observability of memory usage. Really nice.
Reviewed 1 of 25 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)
pkg/sql/crdb_internal.go
line 7503 at r2 (raw file):
comment: `node-level table listing all currently active memory monitors`, schema: ` CREATE TABLE crdb_internal.memory_monitors (
random thought: should this be node_memory_monitors
? (And should there be a corresponding cluster_memory_monitors
?)
pkg/sql/sem/builtins/builtins.go
line 7782 at r2 (raw file):
return tree.NewDString(string(humanizeutil.IBytes(int64(b)))), nil }, Info: "Converts integer memory size (in bytes) into the human-readable form.",
nit: remove "memory" (this could be used for disk space or network usage or anything else measured in bytes).
pkg/util/mon/bytes_usage.go
line 319 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
If a child is stopped, then
exportTree
will return anil
slice, so thisappend
will become a no-op. I added some more comments and expanded the test to ensure that only non-emptyExportEntry
objects are returned byExportTree
.
Because this is recursive, it will end up copying each entry level
times to build the final level-0 slice. And we don't really need the full slice at once to create the virtual table anyway, we just need something to iterate over. For these reasons it might be worth changing ExportTree
to take a closure instead of returning a slice, and that closure could be called with each ExportEntry
.
pkg/util/mon/bytes_usage.go
line 206 at r2 (raw file):
// code path (reserveBytes() and releaseBytes()) might acquire the // parent's lock, so only locking "upwards" is allowed while keeping the // current monitor's lock.
Good catch!
pkg/util/mon/bytes_usage.go
line 308 at r2 (raw file):
entry := ExportEntry{ Level: level, Name: string(mm.name),
suggestion: It might be nice to more formally expose the hierarchy using an ID and a parent ID, so that we could use recursive CTEs when querying crdb_internal.memory_monitors
. (Maybe these IDs could be the addresses of the monitors cast to uint64
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)
pkg/util/mon/bytes_usage.go
line 207 at r2 (raw file):
// parent's lock, so only locking "upwards" is allowed while keeping the // current monitor's lock. head *BytesMonitor
random trivia: You might already know about this, but if you like linked lists, Linux has a really nice implementation of an intrusive, circular doubly-linked list which you might enjoy. Because it is circular (and because of various tricks possible in C) it doesn't need any conditionals for most list operations.
6d40705
to
1af565f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)
pkg/sql/crdb_internal.go
line 7503 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
random thought: should this be
node_memory_monitors
? (And should there be a correspondingcluster_memory_monitors
?)
Renamed, but I'm a bit afraid of introducing the cluster level table. In particular, since each query can have dozens of monitors, a single node can have on the order of hundreds of thousands of them. Traversing that tree locally is cheap (especially so given the callback implementation of the traversal), but for the cluster level view we'd have to materialize all monitors into a single RPC response (or we'd need to implement some pagination scheme) for each remote node, so the network cost can be substantial. I want this to be backported, so I'm leaning on the side of caution here. In the future if we see that the cluster level view is desirable, we can still add it, but for now node level view seems like a big improvement on its own.
pkg/util/mon/bytes_usage.go
line 319 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Because this is recursive, it will end up copying each entry
level
times to build the final level-0 slice. And we don't really need the full slice at once to create the virtual table anyway, we just need something to iterate over. For these reasons it might be worth changingExportTree
to take a closure instead of returning a slice, and that closure could be called with eachExportEntry
.
That's a very good point. I briefly thought about this too but decided it wasn't a big deal. I just wrote a microbenchmark, and these allocations do add up - the exponential growth in the number of monitors in a tree can make it huge! I like your idea about the callback, thanks, done.
pkg/util/mon/bytes_usage.go
line 207 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
random trivia: You might already know about this, but if you like linked lists, Linux has a really nice implementation of an intrusive, circular doubly-linked list which you might enjoy. Because it is circular (and because of various tricks possible in C) it doesn't need any conditionals for most list operations.
Wow, I couldn't decipher the code without the second link, that was an interesting read, thanks!
pkg/util/mon/bytes_usage.go
line 308 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
suggestion: It might be nice to more formally expose the hierarchy using an ID and a parent ID, so that we could use recursive CTEs when querying
crdb_internal.memory_monitors
. (Maybe these IDs could be the addresses of the monitors cast touint64
?)
Thanks, I like it, done (only with minor modification is that it's exposed as int64
since the virtual table has INT8
type, so it seems cleaner to just always use the corresponding int64
Go type).
pkg/util/mon/bytes_usage.go
line 479 at r3 (raw file):
mm.mu.maxAllocated = 0 mm.mu.curBudget = pool.MakeBoundAccount() mm.mu.stopped = false
This line is why during the demo earlier today the active queries weren't showing up. "txn"
monitor is stopped and then started for each txn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few comments
Reviewed 8 of 25 files at r1, 15 of 17 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)
pkg/cli/zip_table_registry.go
line 672 at r3 (raw file):
"reserved_reserved", }, },
This is going to be so useful!!
pkg/sql/crdb_internal.go
line 7503 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Renamed, but I'm a bit afraid of introducing the cluster level table. In particular, since each query can have dozens of monitors, a single node can have on the order of hundreds of thousands of them. Traversing that tree locally is cheap (especially so given the callback implementation of the traversal), but for the cluster level view we'd have to materialize all monitors into a single RPC response (or we'd need to implement some pagination scheme) for each remote node, so the network cost can be substantial. I want this to be backported, so I'm leaning on the side of caution here. In the future if we see that the cluster level view is desirable, we can still add it, but for now node level view seems like a big improvement on its own.
Thank you! Makes sense. (Yes, this is definitely a big improvement on its own!)
It looks like most other node_
tables start with a node_id
column. I think we should follow that convention, too.
pkg/util/mon/bytes_usage.go
line 340 at r3 (raw file):
mm.mu.Unlock() if err := monitorStateCb(monitorState); err != nil { return err
smart
pkg/sql/logictest/testdata/logic_test/table
line 602 at r3 (raw file):
leases NULL lost_descriptors_with_data NULL memory_monitors NULL
node_memory_monitors
This commit adjusts our memory monitors to be able to traverse the whole monitor tree starting at the root monitor. In particular, this commit introduces a doubly-linked list of "siblings" and stores the reference to the head in the parent. Whenever a monitor is `Start`ed, it is included as the new head of its parent's children list, whenever a monitor is `Stop`ped, it is removed from that list. The overhead of this additional tracking should be negligible since only the parent's lock needs to be acquired twice throughout the lifetime of a monitor (thus, assuming relatevily long-lived sessions, this wouldn't affect the root monitor) and the increase in allocations is minor. This required clarification on how locks on a parent and a child can be held at the same time. In particular, since the main code path is acquiring locks "upwards" (meaning when growing the child's budget we might need to grow the parent's budget, and "growing" locks the corresponding monitor), whenever we want to traverse the tree from the root down, we have to unlock the parent's monitor before recursing into the children. As a result, the traversal might give us an inconsistent view (where a recently stopped child can contribute to the usage of the parent while we don't recurse into that child). This seems acceptable. This ability to traverse the whole monitor tree is now exposed as a new virtual table `crdb_internal.node_memory_monitors` which includes a line for each monitor active at the time of table generation (subject to possible inconsistency mentioned above). The table includes the name of the monitors which can be suggestive about the activity on the cluster, thus, access to this table is gated on the "view activity" permissions. The usage of the virtual table to expose the memory monitors information results in flattening of the tree; however, one of the fields is a "level" (or "generation") in relation to the root, plus the ordering of rows is very specific, so we can still format the output to see the hierarchy. We also assign IDs to the monitors (which is their pointer address). Exposing this information as a virtual table allows us to use SQL to analyze it. Here is one example of visualizing it: ``` [email protected]:26257/defaultdb> SELECT repeat(' ', level) || name || ' ' || crdb_internal.humanize_bytes(used) FROM crdb_internal.node_memory_monitors; ?column? ------------------------------------------------------------------- root 0 B internal-planner.‹root›.‹resume-job-101› 0 B internal-planner.‹node›.‹resume-job-100› 0 B internal-planner.‹node›.‹resume-job-842810460057567233› 0 B sql 900 KiB session root 20 KiB txn 10 KiB flow e595eb80 10 KiB session 0 B txn-fingerprint-id-cache 0 B internal SQL executor 0 B internal SQL executor 0 B internal sql executor 0 B conn 105 KiB internal SQL executor 70 KiB internal SQL executor 60 KiB SQLStats 540 KiB SQLStats 0 B distsql 0 B server-cache-mon 0 B bulk-mon 0 B backup-mon 0 B backfill-mon 0 B pre-conn 105 KiB closed-session-cache 190 KiB timeseries-results 0 B timeseries-workers 0 B kv-mem 20 KiB rangefeed-monitor 0 B rangefeed-system-monitor 0 B (30 rows) ``` There are a couple of additional minor improvements: - we now include the short FlowID into the flow's memory monitor name. Combined with the distsql_flows virtual table we'll be able to get the stmt fingerprint for the remote flows running on a node. - new `crdb_internal.humanize_bytes` builtin function is introduced. Note that the corresponding `cluster_memory_monitors` virtual table is not introduced out of caution. In particular, this would lead to RPCs issued to all nodes in the cluster, and since each node can have on the order of hundreds of thousands monitors, the response to each RPC could have non-trivial network cost. We can revisit this decision later if we find that a cluster level view of the memory monitors is desirable, but for now a node level view seems like a big improvement on its own. Release note (sql change): New internal virtual table `crdb_internal.node_memory_monitors` is introduced. It exposes all of the current reservations with the memory accounting system on a single node. Access to the table requires VIEWACTIVITY or VIEWACTIVITYREDACTED permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)
pkg/sql/crdb_internal.go
line 7503 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Thank you! Makes sense. (Yes, this is definitely a big improvement on its own!)
It looks like most other
node_
tables start with anode_id
column. I think we should follow that convention, too.
node_id
seems redundant now since we only have node level view. We can always change the schema later if/ when adding the cluster level view. Do you see any benefits from having this column before that?
I think in other cases where we have node and cluster level views, we just kept the schema the same for both variants by default even though we don't need the node_id
column in the node level view case. I'm a big fan of consistency with existing precedents, but for some reason I'm leaning towards not following this particular convention since it doesn't make much sense to me and is not fully applicable to this virtual table. Thoughts?
pkg/sql/logictest/testdata/logic_test/table
line 602 at r3 (raw file):
Previously, michae2 (Michael Erickson) wrote…
node_memory_monitors
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 17 files at r3, 3 of 3 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rharding6373)
pkg/sql/crdb_internal.go
line 7503 at r2 (raw file):
We can always change the schema later if/ when adding the cluster level view. Do you see any benefits from having this column before that?
Good point. The only benefit I see is consistency with other node_
tables. We can leave it.
TFTRs! bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from b6d04fd to blathers/backport-release-22.2-97657: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit adjusts our memory monitors to be able to traverse the whole
monitor tree starting at the root monitor. In particular, this commit
introduces a doubly-linked list of "siblings" and stores the reference
to the head in the parent. Whenever a monitor is
Start
ed, it isincluded as the new head of its parent's children list, whenever a monitor
is
Stop
ped, it is removed from that list. The overhead of thisadditional tracking should be negligible since only the parent's lock
needs to be acquired twice throughout the lifetime of a monitor (thus,
assuming relatevily long-lived sessions, this wouldn't affect the root
monitor) and the increase in allocations is minor.
This required clarification on how locks on a parent and a child can be
held at the same time. In particular, since the main code path is
acquiring locks "upwards" (meaning when growing the child's budget we
might need to grow the parent's budget, and "growing" locks the
corresponding monitor), whenever we want to traverse the tree from the
root down, we have to unlock the parent's monitor before recursing into
the children. As a result, the traversal might give us an inconsistent
view (where a recently stopped child can contribute to the usage of the
parent while we don't recurse into that child). This seems acceptable.
This ability to traverse the whole monitor tree is now exposed as a new
virtual table
crdb_internal.node_memory_monitors
which includes a linefor each monitor active at the time of table generation (subject to
possible inconsistency mentioned above). The table includes the name of
the monitors which can be suggestive about the activity on the cluster,
thus, access to this table is gated on the "view activity" permissions.
The usage of the virtual table to expose the memory monitors information
results in flattening of the tree; however, one of the fields is
a "level" (or "generation") in relation to the root, plus the ordering
of rows is very specific, so we can still format the output to see the
hierarchy. We also assign IDs to the monitors (which is their pointer
address). Exposing this information as a virtual table allows us to use
SQL to analyze it.
Here is one example of visualizing it:
There are a couple of additional minor improvements:
Combined with the distsql_flows virtual table we'll be able to get the
stmt fingerprint for the remote flows running on a node.
crdb_internal.humanize_bytes
builtin function is introduced.Note that the corresponding
cluster_memory_monitors
virtual table isnot introduced out of caution. In particular, this would lead to RPCs
issued to all nodes in the cluster, and since each node can have on the
order of hundreds of thousands monitors, the response to each RPC could
have non-trivial network cost. We can revisit this decision later if we
find that a cluster level view of the memory monitors is desirable, but
for now a node level view seems like a big improvement on its own.
Addresses: #35097.
Fixes: #90551.
Release note (sql change): New internal virtual table
crdb_internal.memory_monitors
is introduced. It exposes all of thecurrent reservations with the memory accounting system on a single node.
Access to the table requires VIEWACTIVITY or VIEWACTIVITYREDACTED
permissions.