-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql, ui: query count metrics #17050
distsql, ui: query count metrics #17050
Conversation
It's customary to attach screenshots of the changed admin UI page(s), so could you please do that? Also, @RaduBerinde might have thoughts. Review status: 0 of 7 files reviewed at latest revision, all discussions resolved, all commit checks successful. Comments from Reviewable |
Had some comments on the graphs and the metric structs, the actual usage of the metrics looks good. Review status: 0 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/server/server.go, line 280 at r1 (raw file):
By my eye, these metrics should likely be part of the new DistSQLMetrics struct. It was a little difficult to read this because of the similarity with the new Since we're only using two of the metrics from the MemMetrics struct (out of the six described), maybe it would be more readable to put those two metrics directly on the DistSQLMetrics struct. pkg/sql/distsqlrun/metrics.go, line 1 at r1 (raw file):
2017 pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file):
metric Metadata structs are defined in top-level pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file):
use ${tooltipSelection} here. It is either the string "across all nodes" or "on node X" if a single node is selected. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 66 at r1 (raw file):
${tooltipSelection}. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 75 at r1 (raw file):
Don't need to specify sources here, the sources are applied to the individual metrics. Comments from Reviewable |
162af2a
to
d164b9c
Compare
Inline feedback addressed (except for the one TODO comment added). I've also added some questions myself about the charts, would appreciate any opinions there. Will add screenshots when I get into the office, as my setup for making these charts show interesting data is there. Review status: 0 of 7 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful. pkg/server/server.go, line 280 at r1 (raw file): Previously, mrtracy (Matt Tracy) wrote…
Totally agree, that's where I was thinking of going as a next step. Even thought I had added a TODO. Just added one now, will take care of that next. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r1 (raw file):
I'm wondering if this should be labeled "in flight" to make it clearer what it's showing? OTOH, "flows in flight" doesn't make much sense, and there's something to be said for keeping them consistent. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file): Previously, mrtracy (Matt Tracy) wrote…
Nice. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r1 (raw file):
I'm going back and forth on whether sure this chart is all that valuable. Split out by nodes seems to be more useful. The only thing is that it would be nice if it were easier to see totals on the by nodes charts generally (where it makes sense). pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 66 at r1 (raw file): Previously, mrtracy (Matt Tracy) wrote…
Changed, but per my comment right above I may just remove this one. Comments from Reviewable |
These graphs look useful! What do you think about changing "DistSQL Queries Active" to "DistSQL Active Queries"? Also, what about "Flows Active" to "Active Flows"? |
I agree @cuongdo that sounds more natural. |
caa8c40
to
521fbbb
Compare
on the DistSQL side, wait for Matt's LGTM on the visual side as well. These are useful metrics to have! I have a production cluster I'm poking right now where it would be useful to have these metrics, which is a great sign. Reviewed 2 of 7 files at r1, 2 of 4 files at r2, 3 of 3 files at r3. pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file): Previously, mrtracy (Matt Tracy) wrote…
Looks done to me. @couchand did you forget to ping this as done? pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r1 (raw file): Previously, couchand (Andrew Couch) wrote…
"Running Queries" is what I would have called it, but active queries seems fine. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r1 (raw file): Previously, couchand (Andrew Couch) wrote…
I agree that split out by nodes is the most useful. Comments from Reviewable |
Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed. pkg/server/server.go, line 280 at r1 (raw file): Previously, couchand (Andrew Couch) wrote…
Done. pkg/sql/distsqlrun/metrics.go, line 1 at r1 (raw file): Previously, mrtracy (Matt Tracy) wrote…
Done. pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file): Previously, arjunravinarayan (Arjun Narayan) wrote…
I clicked the thumbs up, but apparently that doesn't propagate anywhere? In any case, this is done. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r1 (raw file): Previously, arjunravinarayan (Arjun Narayan) wrote…
Fair point. Leaving it unless somebody makes a good argument. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r1 (raw file): Previously, arjunravinarayan (Arjun Narayan) wrote…
Done. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 75 at r1 (raw file): Previously, mrtracy (Matt Tracy) wrote…
Done. Comments from Reviewable |
This looks great! The only question from me is whether or not "flows" is a concept people outside of CockroachDB will understand. If it isn't, let's try to use generally understandable terminology versus forcing users to rely on the tooltips. If yes, let's keep as is. Could you also make sure to include a tool tip defining the metric? |
Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed. pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file): Previously, couchand (Andrew Couch) wrote…
The thumbs up is what the raiser of the issue does when they are happy with you doing the thing (and at Cockroach people don't always do this). But when you do the thing, you write an explicit response (there's a shorthand one-button "Done." you can use) so that they know to check the diff. Comments from Reviewable |
@dianasaur323 Great point about the term "flows", it's probably not clear unless you've studied the internals of DistSQL. Do you have a suggestion for what to use instead? The tooltips currently say "The total number of active DistSQL queries (across all nodes/on node #)." for the queries chart and "The number of DistSQL flows running on each node." for the flows one. Are you looking for more details there, and do you have input on what that should say? My guess is we need more clarification for the flows chart, but I'm not sure I know what to write there. |
521fbbb
to
e9313a5
Compare
How about "workers" instead of "flows"? For someone technical, "workers" should be common enough terminology. |
I'm a big fan of the concept of Ubiquitous Language as described by Eric Evans in Domain-Driven Design. I think there's a concern that by having different terms internally and externally for the same thing it can be hard to learn and describe precisely what it means. If we think "flows" is going to be hard to bring people over to and we generally prefer "workers", we should use that everywhere, in the code base as well as in our own discussions. If, however, we think "flows" is a good term (because it's more precise?) to describe them, we should use that everywhere, including in documentation and the Admin UI. We can of course use "workers" or anything else to add clarification for new users in the docs & tooltips, but I'd argue that keeping the terminology consistent will be a big win in the long term. Thoughts? @dianasaur323 @cuongdo @cockroachdb/distsql |
I agree in principle. cc @RaduBerinde for his thoughts, since this would be a (hopefully mostly mechanical) code change |
A flow is not a worker. It is (more or less) a work item. We don't quite have a clearly defined "worker" entity in DistSQL code (we have a "flow scheduler" that manages currently running flows), but if we pretend we do, then a worker processes flows. It's one thing to call the stat "workers" in the UI - which makes sense because the number of currently executing flows is conceptually equivalent to the number of active workers. But renaming "flow" to "worker" in the code makes no sense. I question if having consistency between UI and internal details of our code is something we should strive for in all cases. The concept of the flow is a detail of distsql implementation, and a user should not be expected to understand these details. Alternative wording for "DistSQL Flows Active, by node" is something along the lines of "number of DistSQL queries that are executing (in part) on node X". But "active DistSQL workers on node X" also works. The "total number of active DistSQL flows on all nodes" stat is very hard to explain. It is roughly the sum, for all running queries, of the number of nodes involved in that query. With the worker terminology, it's the total number of active DistSQL workers across all nodes. Either way, I'm not sure how useful is this stat. |
@couchand Sorry for the slow response to your comment. I think the distributed flow description needs to be a bit more accurate. We should either go with a really well defined "flow" or with worker, which I find intuitively easier to understand. Latching onto what Radu said around internal terminology, I actually think DistSQL is also an internal term? Distributed SQL Queries, Total (Number of queries running through distributed SQL) seems fine to me? Regarding flows by node, I'm grasping at straws, but perhaps one reason someone would want this would be to understand if certain nodes were hotter than others, and why? Would exposing average query distribution hide outliers, which is probably what people want to observe? |
@dianasaur323 To be clear, I am proposing keeping the "per node" stat (which is indeed useful to see hot spots and imbalance) and replacing the "total # of flows" stat with "average query distribution". Note that "total # of flows" is different than "total # of distsql queries" (which we should keep). |
If "a flow is not a worker" it seems pretty strange to make a chart showing a count of flows that we title as though it's a count of "workers". It's actually showing something different than it says it's showing, which will be confusing for everyone: Cockroach devs using the chart for development, support when helping a client debugging, and a user-turned-contributor. This all goes to the idea of Ubiquitous Language: "Translation blunts communication and makes knowledge crunching anemic ... A project needs a common language that is more robust than the lowest common denominator". Changing DistSQL to Distributed SQL does seem like a net win. @RaduBerinde the total flows chart has been dropped (see the latest screenshot). |
A flow doesn't have to be a worker for the count of flows to be conceptually equivalent to a count of active workers. A car is not a parking space but the number of cars parked outside my building is the same as the number of occupied parking spaces. I am good with naming it in terms of workers, or in a more generic description like the one I suggested above. I am against trying to document the term "flow" which we made up to talk about internal implementation (and may the "Ubiquitous Language" Gods strike me where I stand). |
CC @andreimatei in case he has some thoughts here. |
If I understand it correctly, the count of active flows won't be the count of active workers, then. If a parking lot is completely full and the number of cars circulating looking for a spot is a large fraction of the total, then representing the number of occupied spots as the number of cars is misleading. |
"Active flows" means flows which are currently executing. It does not include flows that are queued (circling cars). |
(at least that's what I thought, I haven't looked at the code to see how we calculate it). |
e9313a5
to
136a387
Compare
Yes, it means the flow's "currently executing", but since it's running in a goroutine, as I understand it that flow might not be currently executing. I guess without a precise definition for "worker" it's hard to know what the difference would be. If it's not the same as a flow, what is it? The goroutine running the flow, the thread or the processor, the flow scheduler that started it, or the node? Or something else entirely? All this is to say, "flows" has a nice precise definition that actually means something specific in our code. I agree that it's important to explain it in a way that will be meaningful to a user. But I don't think it's wise to introduce a new term that doesn't precisely describe what the statistic is showing. As it is, we can point to a specific place in the code where the metric is calculated, and all the words around there say "flow", at least today. In general I think the bar should be quite high to introduce a disconnect between the code that generates a metric and the label for the metric. I've gone ahead and updated "DistSQL" to say "Distributed SQL" everywhere, and added some clarification in the description of the flows chart. I'm inclined to leave "flows" in the title unless there's a strong opinion that this is bad. |
136a387
to
95ea75b
Compare
If we keep flows, the tool tip should explain what they are. Along the lines of "DistSQL queries run on multiple nodes; the portion that is executed on a specific node is called a flow. The graphs show the number of flows running on each node." |
95ea75b
to
d3e21fa
Compare
That makes sense to add clarification, my last change expanded it a bit but I like your language better. Updated. |
pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 65 at r4 (raw file):
Since we're keeping "flows", why are we also using the "worker" concept you have been arguing against? Comments from Reviewable |
CCing @knz also in case he has better suggestions for the titles/tooltips. |
Reviewed 2 of 7 files at r1, 5 of 5 files at r4. pkg/sql/distsqlrun/metrics.go, line 44 at r4 (raw file):
Number of distributed SQL queries currently active pkg/sql/distsqlrun/metrics.go, line 50 at r4 (raw file):
Number of distributed SQL flows currently active pkg/sql/distsqlrun/metrics.go, line 56 at r4 (raw file):
High water mark of the memory usage by SQL processors for distributed queries pkg/sql/distsqlrun/metrics.go, line 59 at r4 (raw file):
Current memory usage by SQL processors for distributed queries pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file): Previously, couchand (Andrew Couch) wrote…
The total number of distributed SQL queries currently running ... pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r4 (raw file):
"Active distributed SQL queries" pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r4 (raw file):
"Active flows supporting distributed SQL queries" pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 65 at r4 (raw file): Previously, RaduBerinde wrote…
Yeah the word "worker" doesn't work for me at all. I suggest remove the first sentence altogether. Just keep "This is the number of ..." Comments from Reviewable |
d3e21fa
to
8c74895
Compare
Review status: 4 of 7 files reviewed at latest revision, 13 unresolved discussions. pkg/sql/distsqlrun/metrics.go, line 44 at r4 (raw file): Previously, knz (kena) wrote…
Done. pkg/sql/distsqlrun/metrics.go, line 50 at r4 (raw file): Previously, knz (kena) wrote…
Done. pkg/sql/distsqlrun/metrics.go, line 56 at r4 (raw file): Previously, knz (kena) wrote…
These metrics are verbatim from pkg/sql/distsqlrun/metrics.go, line 59 at r4 (raw file): Previously, knz (kena) wrote…
See comment above. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file): Previously, knz (kena) wrote…
Done. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r4 (raw file): Previously, knz (kena) wrote…
Done. pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r4 (raw file): Previously, knz (kena) wrote…
How about "for" instead of "supporting"? pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 65 at r4 (raw file): Previously, knz (kena) wrote…
Because I thought it was the way to clarify this. Done. Comments from Reviewable |
The current approach you all have converged on sounds fine to me. Comments from Reviewable |
3dd6ae1
to
2863fdd
Compare
Can I get a final review from somebody on @cockroachdb/distsql please? |
LGTM |
This adds metrics for counts of active DistSQL queries and flows, which will give some visibility into the status of long-running queries.
The previous commit introduced a struct to contain all the DistSQL metrics. This completes that goal by moving the two memory metrics into that struct as well.
2863fdd
to
90e87c7
Compare
Thanks @couchand! |
This adds metrics for counts of active DistSQL queries and flows, which
will give some visibility into the status of long-running queries.
Contributes to #12143.