Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,mon: expose the memory monitors as virtual table #97657

Merged
merged 1 commit into from
Mar 1, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Feb 24, 2023

This commit adjusts our memory monitors to be able to traverse the whole
monitor tree starting at the root monitor. In particular, this commit
introduces a doubly-linked list of "siblings" and stores the reference
to the head in the parent. Whenever a monitor is Started, it is
included as the new head of its parent's children list, whenever a monitor
is Stopped, it is removed from that list. The overhead of this
additional tracking should be negligible since only the parent's lock
needs to be acquired twice throughout the lifetime of a monitor (thus,
assuming relatevily long-lived sessions, this wouldn't affect the root
monitor) and the increase in allocations is minor.

This required clarification on how locks on a parent and a child can be
held at the same time. In particular, since the main code path is
acquiring locks "upwards" (meaning when growing the child's budget we
might need to grow the parent's budget, and "growing" locks the
corresponding monitor), whenever we want to traverse the tree from the
root down, we have to unlock the parent's monitor before recursing into
the children. As a result, the traversal might give us an inconsistent
view (where a recently stopped child can contribute to the usage of the
parent while we don't recurse into that child). This seems acceptable.

This ability to traverse the whole monitor tree is now exposed as a new
virtual table crdb_internal.node_memory_monitors which includes a line
for each monitor active at the time of table generation (subject to
possible inconsistency mentioned above). The table includes the name of
the monitors which can be suggestive about the activity on the cluster,
thus, access to this table is gated on the "view activity" permissions.
The usage of the virtual table to expose the memory monitors information
results in flattening of the tree; however, one of the fields is
a "level" (or "generation") in relation to the root, plus the ordering
of rows is very specific, so we can still format the output to see the
hierarchy. We also assign IDs to the monitors (which is their pointer
address). Exposing this information as a virtual table allows us to use
SQL to analyze it.

Here is one example of visualizing it:

[email protected]:26257/defaultdb> SELECT repeat('    ', level) || name || ' ' || crdb_internal.humanize_bytes(used) FROM crdb_internal.node_memory_monitors;
                             ?column?
-------------------------------------------------------------------
  root 0 B
      internal-planner.‹root›.‹resume-job-101› 0 B
      internal-planner.‹node›.‹resume-job-100› 0 B
      internal-planner.‹node›.‹resume-job-842810460057567233› 0 B
      sql 900 KiB
          session root 20 KiB
              txn 10 KiB
                  flow e595eb80 10 KiB
              session 0 B
              txn-fingerprint-id-cache 0 B
          internal SQL executor 0 B
          internal SQL executor 0 B
          internal sql executor 0 B
          conn 105 KiB
          internal SQL executor 70 KiB
          internal SQL executor 60 KiB
          SQLStats 540 KiB
          SQLStats 0 B
      distsql 0 B
      server-cache-mon 0 B
      bulk-mon 0 B
          backup-mon 0 B
          backfill-mon 0 B
      pre-conn 105 KiB
      closed-session-cache 190 KiB
      timeseries-results 0 B
      timeseries-workers 0 B
      kv-mem 20 KiB
          rangefeed-monitor 0 B
          rangefeed-system-monitor 0 B
(30 rows)

There are a couple of additional minor improvements:

  • we now include the short FlowID into the flow's memory monitor name.
    Combined with the distsql_flows virtual table we'll be able to get the
    stmt fingerprint for the remote flows running on a node.
  • new crdb_internal.humanize_bytes builtin function is introduced.

Note that the corresponding cluster_memory_monitors virtual table is
not introduced out of caution. In particular, this would lead to RPCs
issued to all nodes in the cluster, and since each node can have on the
order of hundreds of thousands monitors, the response to each RPC could
have non-trivial network cost. We can revisit this decision later if we
find that a cluster level view of the memory monitors is desirable, but
for now a node level view seems like a big improvement on its own.

Addresses: #35097.
Fixes: #90551.

Release note (sql change): New internal virtual table
crdb_internal.memory_monitors is introduced. It exposes all of the
current reservations with the memory accounting system on a single node.
Access to the table requires VIEWACTIVITY or VIEWACTIVITYREDACTED
permissions.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the monitor-tree branch 7 times, most recently from 9062116 to b750279 Compare February 25, 2023 00:58
@yuzefovich yuzefovich marked this pull request as ready for review February 25, 2023 00:59
@yuzefovich yuzefovich requested a review from a team February 25, 2023 00:59
@yuzefovich yuzefovich requested review from a team as code owners February 25, 2023 00:59
@yuzefovich yuzefovich requested a review from a team February 25, 2023 00:59
@yuzefovich yuzefovich requested a review from a team as a code owner February 25, 2023 00:59
@yuzefovich
Copy link
Member Author

I ended up exposing the data as the virtual table, thanks @michae2 for the idea! I didn't run any performance benchmarks cause I'm pretty confident the overhead should be negligible, however, I can do that if someone thinks it'd be worth it. Also it seems pretty safe for a backport (perhaps after a short baking period on master).

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @rharding6373)


-- commits line 86 at r1:
I initially included the requirement for VIEWACTIVITY permissions mainly out of performance considerations (in a previous iteration, when traversing the tree we'd hold the parent's lock until all children are fully traversed, which can be quite costly since it would lock the root monitor; however, that turned out to lead to deadlocks, so I refactored the logic), so I'm curious what others think whether this is necessary or not. Names of the monitors don't contain any PII but can contain things like "sort" or "hashjoiner" which is suggestive about the type of activity.

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! :lgtm:

Reviewed 25 of 25 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @yuzefovich)


-- commits line 86 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I initially included the requirement for VIEWACTIVITY permissions mainly out of performance considerations (in a previous iteration, when traversing the tree we'd hold the parent's lock until all children are fully traversed, which can be quite costly since it would lock the root monitor; however, that turned out to lead to deadlocks, so I refactored the logic), so I'm curious what others think whether this is necessary or not. Names of the monitors don't contain any PII but can contain things like "sort" or "hashjoiner" which is suggestive about the type of activity.

This seems like a reasonable permission to have.


pkg/util/mon/bytes_usage.go line 280 at r1 (raw file):

}

// ExportTree exports the current state of the tree of monitors rooted in mm.

nit: s/mm/this BytesMonitor

(imo other uses of mm in the comments are fine, but this one was confusing)


pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

	entries := append([]ExportEntry{}, entry)
	for _, c := range children {
		entries = append(entries, c.exportTree(level+1)...)

If one of the children is stopped, then we'll append nil here, is that desired behavior? It might make more sense to check if nil and only append non-nil values to entries.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)


-- commits line 86 at r1:

Previously, rharding6373 (Rachael Harding) wrote…

This seems like a reasonable permission to have.

Ack, let's keep this gate then.


pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

If one of the children is stopped, then we'll append nil here, is that desired behavior? It might make more sense to check if nil and only append non-nil values to entries.

If a child is stopped, then exportTree will return a nil slice, so this append will become a no-op. I added some more comments and expanded the test to ensure that only non-empty ExportEntry objects are returned by ExportTree.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I think this will make a huge difference for observability of memory usage. Really nice.

Reviewed 1 of 25 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)


pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

	comment: `node-level table listing all currently active memory monitors`,
	schema: `
CREATE TABLE crdb_internal.memory_monitors (

random thought: should this be node_memory_monitors? (And should there be a corresponding cluster_memory_monitors?)


pkg/sql/sem/builtins/builtins.go line 7782 at r2 (raw file):

				return tree.NewDString(string(humanizeutil.IBytes(int64(b)))), nil
			},
			Info:       "Converts integer memory size (in bytes) into the human-readable form.",

nit: remove "memory" (this could be used for disk space or network usage or anything else measured in bytes).


pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

If a child is stopped, then exportTree will return a nil slice, so this append will become a no-op. I added some more comments and expanded the test to ensure that only non-empty ExportEntry objects are returned by ExportTree.

Because this is recursive, it will end up copying each entry level times to build the final level-0 slice. And we don't really need the full slice at once to create the virtual table anyway, we just need something to iterate over. For these reasons it might be worth changing ExportTree to take a closure instead of returning a slice, and that closure could be called with each ExportEntry.


pkg/util/mon/bytes_usage.go line 206 at r2 (raw file):

		// code path (reserveBytes() and releaseBytes()) might acquire the
		// parent's lock, so only locking "upwards" is allowed while keeping the
		// current monitor's lock.

Good catch!


pkg/util/mon/bytes_usage.go line 308 at r2 (raw file):

	entry := ExportEntry{
		Level:            level,
		Name:             string(mm.name),

suggestion: It might be nice to more formally expose the hierarchy using an ID and a parent ID, so that we could use recursive CTEs when querying crdb_internal.memory_monitors. (Maybe these IDs could be the addresses of the monitors cast to uint64?)

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)


pkg/util/mon/bytes_usage.go line 207 at r2 (raw file):

		// parent's lock, so only locking "upwards" is allowed while keeping the
		// current monitor's lock.
		head *BytesMonitor

random trivia: You might already know about this, but if you like linked lists, Linux has a really nice implementation of an intrusive, circular doubly-linked list which you might enjoy. Because it is circular (and because of various tricks possible in C) it doesn't need any conditionals for most list operations.

@yuzefovich yuzefovich force-pushed the monitor-tree branch 3 times, most recently from 6d40705 to 1af565f Compare February 28, 2023 22:20
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)


pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

random thought: should this be node_memory_monitors? (And should there be a corresponding cluster_memory_monitors?)

Renamed, but I'm a bit afraid of introducing the cluster level table. In particular, since each query can have dozens of monitors, a single node can have on the order of hundreds of thousands of them. Traversing that tree locally is cheap (especially so given the callback implementation of the traversal), but for the cluster level view we'd have to materialize all monitors into a single RPC response (or we'd need to implement some pagination scheme) for each remote node, so the network cost can be substantial. I want this to be backported, so I'm leaning on the side of caution here. In the future if we see that the cluster level view is desirable, we can still add it, but for now node level view seems like a big improvement on its own.


pkg/util/mon/bytes_usage.go line 319 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Because this is recursive, it will end up copying each entry level times to build the final level-0 slice. And we don't really need the full slice at once to create the virtual table anyway, we just need something to iterate over. For these reasons it might be worth changing ExportTree to take a closure instead of returning a slice, and that closure could be called with each ExportEntry.

That's a very good point. I briefly thought about this too but decided it wasn't a big deal. I just wrote a microbenchmark, and these allocations do add up - the exponential growth in the number of monitors in a tree can make it huge! I like your idea about the callback, thanks, done.


pkg/util/mon/bytes_usage.go line 207 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

random trivia: You might already know about this, but if you like linked lists, Linux has a really nice implementation of an intrusive, circular doubly-linked list which you might enjoy. Because it is circular (and because of various tricks possible in C) it doesn't need any conditionals for most list operations.

Wow, I couldn't decipher the code without the second link, that was an interesting read, thanks!


pkg/util/mon/bytes_usage.go line 308 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

suggestion: It might be nice to more formally expose the hierarchy using an ID and a parent ID, so that we could use recursive CTEs when querying crdb_internal.memory_monitors. (Maybe these IDs could be the addresses of the monitors cast to uint64?)

Thanks, I like it, done (only with minor modification is that it's exposed as int64 since the virtual table has INT8 type, so it seems cleaner to just always use the corresponding int64 Go type).


pkg/util/mon/bytes_usage.go line 479 at r3 (raw file):

	mm.mu.maxAllocated = 0
	mm.mu.curBudget = pool.MakeBoundAccount()
	mm.mu.stopped = false

This line is why during the demo earlier today the active queries weren't showing up. "txn" monitor is stopped and then started for each txn.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few comments

Reviewed 8 of 25 files at r1, 15 of 17 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @rharding6373, and @yuzefovich)


pkg/cli/zip_table_registry.go line 672 at r3 (raw file):

			"reserved_reserved",
		},
	},

This is going to be so useful!!


pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Renamed, but I'm a bit afraid of introducing the cluster level table. In particular, since each query can have dozens of monitors, a single node can have on the order of hundreds of thousands of them. Traversing that tree locally is cheap (especially so given the callback implementation of the traversal), but for the cluster level view we'd have to materialize all monitors into a single RPC response (or we'd need to implement some pagination scheme) for each remote node, so the network cost can be substantial. I want this to be backported, so I'm leaning on the side of caution here. In the future if we see that the cluster level view is desirable, we can still add it, but for now node level view seems like a big improvement on its own.

Thank you! Makes sense. (Yes, this is definitely a big improvement on its own!)

It looks like most other node_ tables start with a node_id column. I think we should follow that convention, too.


pkg/util/mon/bytes_usage.go line 340 at r3 (raw file):

	mm.mu.Unlock()
	if err := monitorStateCb(monitorState); err != nil {
		return err

smart


pkg/sql/logictest/testdata/logic_test/table line 602 at r3 (raw file):

leases                                 NULL
lost_descriptors_with_data             NULL
memory_monitors                        NULL

node_memory_monitors

This commit adjusts our memory monitors to be able to traverse the whole
monitor tree starting at the root monitor. In particular, this commit
introduces a doubly-linked list of "siblings" and stores the reference
to the head in the parent. Whenever a monitor is `Start`ed, it is
included as the new head of its parent's children list, whenever a monitor
is `Stop`ped, it is removed from that list. The overhead of this
additional tracking should be negligible since only the parent's lock
needs to be acquired twice throughout the lifetime of a monitor (thus,
assuming relatevily long-lived sessions, this wouldn't affect the root
monitor) and the increase in allocations is minor.

This required clarification on how locks on a parent and a child can be
held at the same time. In particular, since the main code path is
acquiring locks "upwards" (meaning when growing the child's budget we
might need to grow the parent's budget, and "growing" locks the
corresponding monitor), whenever we want to traverse the tree from the
root down, we have to unlock the parent's monitor before recursing into
the children. As a result, the traversal might give us an inconsistent
view (where a recently stopped child can contribute to the usage of the
parent while we don't recurse into that child). This seems acceptable.

This ability to traverse the whole monitor tree is now exposed as a new
virtual table `crdb_internal.node_memory_monitors` which includes a line
for each monitor active at the time of table generation (subject to
possible inconsistency mentioned above). The table includes the name of
the monitors which can be suggestive about the activity on the cluster,
thus, access to this table is gated on the "view activity" permissions.
The usage of the virtual table to expose the memory monitors information
results in flattening of the tree; however, one of the fields is
a "level" (or "generation") in relation to the root, plus the ordering
of rows is very specific, so we can still format the output to see the
hierarchy. We also assign IDs to the monitors (which is their pointer
address). Exposing this information as a virtual table allows us to use
SQL to analyze it.

Here is one example of visualizing it:
```
[email protected]:26257/defaultdb> SELECT repeat('    ', level) || name || ' ' || crdb_internal.humanize_bytes(used) FROM crdb_internal.node_memory_monitors;
                             ?column?
-------------------------------------------------------------------
  root 0 B
      internal-planner.‹root›.‹resume-job-101› 0 B
      internal-planner.‹node›.‹resume-job-100› 0 B
      internal-planner.‹node›.‹resume-job-842810460057567233› 0 B
      sql 900 KiB
          session root 20 KiB
              txn 10 KiB
                  flow e595eb80 10 KiB
              session 0 B
              txn-fingerprint-id-cache 0 B
          internal SQL executor 0 B
          internal SQL executor 0 B
          internal sql executor 0 B
          conn 105 KiB
          internal SQL executor 70 KiB
          internal SQL executor 60 KiB
          SQLStats 540 KiB
          SQLStats 0 B
      distsql 0 B
      server-cache-mon 0 B
      bulk-mon 0 B
          backup-mon 0 B
          backfill-mon 0 B
      pre-conn 105 KiB
      closed-session-cache 190 KiB
      timeseries-results 0 B
      timeseries-workers 0 B
      kv-mem 20 KiB
          rangefeed-monitor 0 B
          rangefeed-system-monitor 0 B
(30 rows)
```

There are a couple of additional minor improvements:
- we now include the short FlowID into the flow's memory monitor name.
Combined with the distsql_flows virtual table we'll be able to get the
stmt fingerprint for the remote flows running on a node.
- new `crdb_internal.humanize_bytes` builtin function is introduced.

Note that the corresponding `cluster_memory_monitors` virtual table is
not introduced out of caution. In particular, this would lead to RPCs
issued to all nodes in the cluster, and since each node can have on the
order of hundreds of thousands monitors, the response to each RPC could
have non-trivial network cost. We can revisit this decision later if we
find that a cluster level view of the memory monitors is desirable, but
for now a node level view seems like a big improvement on its own.

Release note (sql change): New internal virtual table
`crdb_internal.node_memory_monitors` is introduced. It exposes all of the
current reservations with the memory accounting system on a single node.
Access to the table requires VIEWACTIVITY or VIEWACTIVITYREDACTED
permissions.
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @rharding6373)


pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Thank you! Makes sense. (Yes, this is definitely a big improvement on its own!)

It looks like most other node_ tables start with a node_id column. I think we should follow that convention, too.

node_id seems redundant now since we only have node level view. We can always change the schema later if/ when adding the cluster level view. Do you see any benefits from having this column before that?

I think in other cases where we have node and cluster level views, we just kept the schema the same for both variants by default even though we don't need the node_id column in the node level view case. I'm a big fan of consistency with existing precedents, but for some reason I'm leaning towards not following this particular convention since it doesn't make much sense to me and is not fully applicable to this virtual table. Thoughts?


pkg/sql/logictest/testdata/logic_test/table line 602 at r3 (raw file):

Previously, michae2 (Michael Erickson) wrote…

node_memory_monitors

Done.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: Nice work!

Reviewed 1 of 17 files at r3, 3 of 3 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @rharding6373)


pkg/sql/crdb_internal.go line 7503 at r2 (raw file):

We can always change the schema later if/ when adding the cluster level view. Do you see any benefits from having this column before that?

Good point. The only benefit I see is consistency with other node_ tables. We can leave it.

@yuzefovich
Copy link
Member Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 1, 2023

Build succeeded:

@craig craig bot merged commit c3348b0 into cockroachdb:master Mar 1, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 1, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from b6d04fd to blathers/backport-release-22.2-97657: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memory: make memory monitor trees observable
4 participants