Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql, ui: query count metrics #17050

Merged
merged 2 commits into from
Jul 31, 2017

Conversation

couchand
Copy link
Contributor

This adds metrics for counts of active DistSQL queries and flows, which
will give some visibility into the status of long-running queries.

Contributes to #12143.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@couchand couchand changed the title distsql: query count metrics distsql, ui: query count metrics Jul 17, 2017
@couchand
Copy link
Contributor Author

cc: @arjunravinarayan @cuongdo @mrtracy

@cuongdo
Copy link
Contributor

cuongdo commented Jul 18, 2017

It's customary to attach screenshots of the changed admin UI page(s), so could you please do that?

Also, @RaduBerinde might have thoughts.


Review status: 0 of 7 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@mrtracy
Copy link
Contributor

mrtracy commented Jul 18, 2017

Had some comments on the graphs and the metric structs, the actual usage of the metrics looks good.


Review status: 0 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/server/server.go, line 280 at r1 (raw file):

	rootSQLMemoryMonitor.Start(context.Background(), nil, mon.MakeStandaloneBudget(s.cfg.SQLMemoryPoolSize))

	distSQLMemMetrics := sql.MakeMemMetrics("distsql", cfg.HistogramWindowInterval())

By my eye, these metrics should likely be part of the new DistSQLMetrics struct. It was a little difficult to read this because of the similarity with the new distSQLMetrics variable.

Since we're only using two of the metrics from the MemMetrics struct (out of the six described), maybe it would be more readable to put those two metrics directly on the DistSQLMetrics struct.


pkg/sql/distsqlrun/metrics.go, line 1 at r1 (raw file):

// Copyright 2016 The Cockroach Authors.

2017


pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file):

	metaQueriesActive := metric.Metadata{
		Name: "sql.distsql.queries.active",
		Help: "Number of currently active DistSQL queries"}

metric Metadata structs are defined in top-level var blocks everywhere else; they're essentially constants. The one exception was sql.MakeMemMetrics.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file):

      title="DistSQL Queries Active"
      sources={nodeSources}
      tooltip="The total number of active DistSQL queries on all nodes."

use ${tooltipSelection} here. It is either the string "across all nodes" or "on node X" if a single node is selected.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 66 at r1 (raw file):

      title="DistSQL Flows Active, total"
      sources={nodeSources}
      tooltip="The total number of active DistSQL flows on all nodes."

${tooltipSelection}.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 75 at r1 (raw file):

    <LineGraph
      title="DistSQL Flows Active, by node"
      sources={nodeSources}

Don't need to specify sources here, the sources are applied to the individual metrics.


Comments from Reviewable

@couchand couchand force-pushed the feature/distsql-query-count branch from 162af2a to d164b9c Compare July 18, 2017 11:26
@couchand
Copy link
Contributor Author

Inline feedback addressed (except for the one TODO comment added). I've also added some questions myself about the charts, would appreciate any opinions there.

Will add screenshots when I get into the office, as my setup for making these charts show interesting data is there.


Review status: 0 of 7 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


pkg/server/server.go, line 280 at r1 (raw file):

Previously, mrtracy (Matt Tracy) wrote…

By my eye, these metrics should likely be part of the new DistSQLMetrics struct. It was a little difficult to read this because of the similarity with the new distSQLMetrics variable.

Since we're only using two of the metrics from the MemMetrics struct (out of the six described), maybe it would be more readable to put those two metrics directly on the DistSQLMetrics struct.

Totally agree, that's where I was thinking of going as a next step. Even thought I had added a TODO. Just added one now, will take care of that next.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r1 (raw file):

    <LineGraph
      title="DistSQL Queries Active"

I'm wondering if this should be labeled "in flight" to make it clearer what it's showing? OTOH, "flows in flight" doesn't make much sense, and there's something to be said for keeping them consistent.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file):

Previously, mrtracy (Matt Tracy) wrote…

use ${tooltipSelection} here. It is either the string "across all nodes" or "on node X" if a single node is selected.

Nice.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r1 (raw file):

    <LineGraph
      title="DistSQL Flows Active, total"

I'm going back and forth on whether sure this chart is all that valuable. Split out by nodes seems to be more useful. The only thing is that it would be nice if it were easier to see totals on the by nodes charts generally (where it makes sense).


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 66 at r1 (raw file):

Previously, mrtracy (Matt Tracy) wrote…

${tooltipSelection}.

Changed, but per my comment right above I may just remove this one.


Comments from Reviewable

@couchand
Copy link
Contributor Author

screen shot 2017-07-18 at 9 41 42 am

@cuongdo
Copy link
Contributor

cuongdo commented Jul 18, 2017

These graphs look useful!

What do you think about changing "DistSQL Queries Active" to "DistSQL Active Queries"? Also, what about "Flows Active" to "Active Flows"?

@couchand
Copy link
Contributor Author

I agree @cuongdo that sounds more natural.

@couchand couchand force-pushed the feature/distsql-query-count branch 2 times, most recently from caa8c40 to 521fbbb Compare July 18, 2017 14:41
@couchand
Copy link
Contributor Author

After recent updates. As you can see I've just gone ahead and removed the chart with the total count of flows in favor of the one broken down by node.
screen shot 2017-07-18 at 10 41 06 am

@rjnn
Copy link
Contributor

rjnn commented Jul 18, 2017

:lgtm: on the DistSQL side, wait for Matt's LGTM on the visual side as well. These are useful metrics to have! I have a production cluster I'm poking right now where it would be useful to have these metrics, which is a great sign.


Reviewed 2 of 7 files at r1, 2 of 4 files at r2, 3 of 3 files at r3.
Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file):

Previously, mrtracy (Matt Tracy) wrote…

metric Metadata structs are defined in top-level var blocks everywhere else; they're essentially constants. The one exception was sql.MakeMemMetrics.

Looks done to me. @couchand did you forget to ping this as done?


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r1 (raw file):

Previously, couchand (Andrew Couch) wrote…

I'm wondering if this should be labeled "in flight" to make it clearer what it's showing? OTOH, "flows in flight" doesn't make much sense, and there's something to be said for keeping them consistent.

"Running Queries" is what I would have called it, but active queries seems fine.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r1 (raw file):

Previously, couchand (Andrew Couch) wrote…

I'm going back and forth on whether sure this chart is all that valuable. Split out by nodes seems to be more useful. The only thing is that it would be nice if it were easier to see totals on the by nodes charts generally (where it makes sense).

I agree that split out by nodes is the most useful.


Comments from Reviewable

@couchand
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


pkg/server/server.go, line 280 at r1 (raw file):

Previously, couchand (Andrew Couch) wrote…

Totally agree, that's where I was thinking of going as a next step. Even thought I had added a TODO. Just added one now, will take care of that next.

Done.


pkg/sql/distsqlrun/metrics.go, line 1 at r1 (raw file):

Previously, mrtracy (Matt Tracy) wrote…

2017

Done.


pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

Looks done to me. @couchand did you forget to ping this as done?

I clicked the thumbs up, but apparently that doesn't propagate anywhere? In any case, this is done.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r1 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

"Running Queries" is what I would have called it, but active queries seems fine.

Fair point. Leaving it unless somebody makes a good argument.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r1 (raw file):

Previously, arjunravinarayan (Arjun Narayan) wrote…

I agree that split out by nodes is the most useful.

Done.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 75 at r1 (raw file):

Previously, mrtracy (Matt Tracy) wrote…

Don't need to specify sources here, the sources are applied to the individual metrics.

Done.


Comments from Reviewable

@dianasaur323
Copy link
Contributor

This looks great! The only question from me is whether or not "flows" is a concept people outside of CockroachDB will understand. If it isn't, let's try to use generally understandable terminology versus forcing users to rely on the tooltips. If yes, let's keep as is. Could you also make sure to include a tool tip defining the metric?

@rjnn
Copy link
Contributor

rjnn commented Jul 18, 2017

Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/sql/distsqlrun/metrics.go, line 41 at r1 (raw file):

Previously, couchand (Andrew Couch) wrote…

I clicked the thumbs up, but apparently that doesn't propagate anywhere? In any case, this is done.

The thumbs up is what the raiser of the issue does when they are happy with you doing the thing (and at Cockroach people don't always do this). But when you do the thing, you write an explicit response (there's a shorthand one-button "Done." you can use) so that they know to check the diff.


Comments from Reviewable

@couchand
Copy link
Contributor Author

couchand commented Jul 18, 2017

@dianasaur323 Great point about the term "flows", it's probably not clear unless you've studied the internals of DistSQL. Do you have a suggestion for what to use instead? The tooltips currently say "The total number of active DistSQL queries (across all nodes/on node #)." for the queries chart and "The number of DistSQL flows running on each node." for the flows one. Are you looking for more details there, and do you have input on what that should say? My guess is we need more clarification for the flows chart, but I'm not sure I know what to write there.

@cuongdo
Copy link
Contributor

cuongdo commented Jul 20, 2017

How about "workers" instead of "flows"? For someone technical, "workers" should be common enough terminology.

@couchand
Copy link
Contributor Author

I'm a big fan of the concept of Ubiquitous Language as described by Eric Evans in Domain-Driven Design. I think there's a concern that by having different terms internally and externally for the same thing it can be hard to learn and describe precisely what it means. If we think "flows" is going to be hard to bring people over to and we generally prefer "workers", we should use that everywhere, in the code base as well as in our own discussions. If, however, we think "flows" is a good term (because it's more precise?) to describe them, we should use that everywhere, including in documentation and the Admin UI. We can of course use "workers" or anything else to add clarification for new users in the docs & tooltips, but I'd argue that keeping the terminology consistent will be a big win in the long term. Thoughts? @dianasaur323 @cuongdo @cockroachdb/distsql

@cuongdo
Copy link
Contributor

cuongdo commented Jul 25, 2017

I agree in principle. cc @RaduBerinde for his thoughts, since this would be a (hopefully mostly mechanical) code change

@RaduBerinde
Copy link
Member

A flow is not a worker. It is (more or less) a work item. We don't quite have a clearly defined "worker" entity in DistSQL code (we have a "flow scheduler" that manages currently running flows), but if we pretend we do, then a worker processes flows. It's one thing to call the stat "workers" in the UI - which makes sense because the number of currently executing flows is conceptually equivalent to the number of active workers. But renaming "flow" to "worker" in the code makes no sense.

I question if having consistency between UI and internal details of our code is something we should strive for in all cases. The concept of the flow is a detail of distsql implementation, and a user should not be expected to understand these details.

Alternative wording for "DistSQL Flows Active, by node" is something along the lines of "number of DistSQL queries that are executing (in part) on node X". But "active DistSQL workers on node X" also works.

The "total number of active DistSQL flows on all nodes" stat is very hard to explain. It is roughly the sum, for all running queries, of the number of nodes involved in that query. With the worker terminology, it's the total number of active DistSQL workers across all nodes. Either way, I'm not sure how useful is this stat.
It can give you an idea of how widely queries are distributed (when comparing against the # of running queries); but then perhaps we want to instead divide this by the number of running queries and expose it as "average query distribution" (meaning that if it's 5, on distsql queries have 5 flows on 5 nodes on average).

@dianasaur323
Copy link
Contributor

@couchand Sorry for the slow response to your comment. I think the distributed flow description needs to be a bit more accurate. We should either go with a really well defined "flow" or with worker, which I find intuitively easier to understand.

Latching onto what Radu said around internal terminology, I actually think DistSQL is also an internal term? Distributed SQL Queries, Total (Number of queries running through distributed SQL) seems fine to me?

Regarding flows by node, I'm grasping at straws, but perhaps one reason someone would want this would be to understand if certain nodes were hotter than others, and why? Would exposing average query distribution hide outliers, which is probably what people want to observe?

@RaduBerinde
Copy link
Member

@dianasaur323 To be clear, I am proposing keeping the "per node" stat (which is indeed useful to see hot spots and imbalance) and replacing the "total # of flows" stat with "average query distribution". Note that "total # of flows" is different than "total # of distsql queries" (which we should keep).

@couchand
Copy link
Contributor Author

If "a flow is not a worker" it seems pretty strange to make a chart showing a count of flows that we title as though it's a count of "workers". It's actually showing something different than it says it's showing, which will be confusing for everyone: Cockroach devs using the chart for development, support when helping a client debugging, and a user-turned-contributor. This all goes to the idea of Ubiquitous Language: "Translation blunts communication and makes knowledge crunching anemic ... A project needs a common language that is more robust than the lowest common denominator".

Changing DistSQL to Distributed SQL does seem like a net win.

@RaduBerinde the total flows chart has been dropped (see the latest screenshot).

@RaduBerinde
Copy link
Member

A flow doesn't have to be a worker for the count of flows to be conceptually equivalent to a count of active workers. A car is not a parking space but the number of cars parked outside my building is the same as the number of occupied parking spaces.

I am good with naming it in terms of workers, or in a more generic description like the one I suggested above. I am against trying to document the term "flow" which we made up to talk about internal implementation (and may the "Ubiquitous Language" Gods strike me where I stand).

@RaduBerinde
Copy link
Member

CC @andreimatei in case he has some thoughts here.

@couchand
Copy link
Contributor Author

If I understand it correctly, the count of active flows won't be the count of active workers, then. If a parking lot is completely full and the number of cars circulating looking for a spot is a large fraction of the total, then representing the number of occupied spots as the number of cars is misleading.

@RaduBerinde
Copy link
Member

"Active flows" means flows which are currently executing. It does not include flows that are queued (circling cars).

@RaduBerinde
Copy link
Member

(at least that's what I thought, I haven't looked at the code to see how we calculate it).
It's a good question whether we want the count to include queued flows (in which case, I agree that "workers" is not a good name).

@couchand couchand force-pushed the feature/distsql-query-count branch from e9313a5 to 136a387 Compare July 25, 2017 18:33
@couchand
Copy link
Contributor Author

Yes, it means the flow's "currently executing", but since it's running in a goroutine, as I understand it that flow might not be currently executing. I guess without a precise definition for "worker" it's hard to know what the difference would be. If it's not the same as a flow, what is it? The goroutine running the flow, the thread or the processor, the flow scheduler that started it, or the node? Or something else entirely?

All this is to say, "flows" has a nice precise definition that actually means something specific in our code. I agree that it's important to explain it in a way that will be meaningful to a user. But I don't think it's wise to introduce a new term that doesn't precisely describe what the statistic is showing. As it is, we can point to a specific place in the code where the metric is calculated, and all the words around there say "flow", at least today. In general I think the bar should be quite high to introduce a disconnect between the code that generates a metric and the label for the metric.

I've gone ahead and updated "DistSQL" to say "Distributed SQL" everywhere, and added some clarification in the description of the flows chart. I'm inclined to leave "flows" in the title unless there's a strong opinion that this is bad.

@couchand couchand force-pushed the feature/distsql-query-count branch from 136a387 to 95ea75b Compare July 25, 2017 18:56
@RaduBerinde
Copy link
Member

If we keep flows, the tool tip should explain what they are. Along the lines of "DistSQL queries run on multiple nodes; the portion that is executed on a specific node is called a flow. The graphs show the number of flows running on each node."

@couchand couchand force-pushed the feature/distsql-query-count branch from 95ea75b to d3e21fa Compare July 27, 2017 16:36
@couchand
Copy link
Contributor Author

That makes sense to add clarification, my last change expanded it a bit but I like your language better. Updated.

@RaduBerinde
Copy link
Member

pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 65 at r4 (raw file):

    <LineGraph
      title="Distributed SQL Active Flows"
      tooltip="Distributed queries run on multiple nodes, and a flow is the worker on a single node.  This is the number of flows on each node contributing to currently running distributed SQL queries."

Since we're keeping "flows", why are we also using the "worker" concept you have been arguing against?


Comments from Reviewable

@RaduBerinde
Copy link
Member

CCing @knz also in case he has better suggestions for the titles/tooltips.

@knz
Copy link
Contributor

knz commented Jul 27, 2017

Reviewed 2 of 7 files at r1, 5 of 5 files at r4.
Review status: all files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/metrics.go, line 44 at r4 (raw file):

	metaQueriesActive = metric.Metadata{
		Name: "sql.distsql.queries.active",
		Help: "Number of currently active distributed SQL queries"}

Number of distributed SQL queries currently active


pkg/sql/distsqlrun/metrics.go, line 50 at r4 (raw file):

	metaFlowsActive = metric.Metadata{
		Name: "sql.distsql.flows.active",
		Help: "Number of currently active distributed SQL flows"}

Number of distributed SQL flows currently active


pkg/sql/distsqlrun/metrics.go, line 56 at r4 (raw file):

	metaMemMaxBytes = metric.Metadata{
		Name: "sql.mem.distsql.max",
		Help: "Memory usage per sql statement for distsql"}

High water mark of the memory usage by SQL processors for distributed queries


pkg/sql/distsqlrun/metrics.go, line 59 at r4 (raw file):

	metaMemCurBytes = metric.Metadata{
		Name: "sql.mem.distsql.current",
		Help: "Current sql statement memory usage for distsql"}

Current memory usage by SQL processors for distributed queries


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file):

Previously, couchand (Andrew Couch) wrote…

Nice.

The total number of distributed SQL queries currently running ...


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r4 (raw file):

    <LineGraph
      title="Distributed SQL Active Queries"

"Active distributed SQL queries"


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r4 (raw file):

    <LineGraph
      title="Distributed SQL Active Flows"

"Active flows supporting distributed SQL queries"


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 65 at r4 (raw file):

Previously, RaduBerinde wrote…

Since we're keeping "flows", why are we also using the "worker" concept you have been arguing against?

Yeah the word "worker" doesn't work for me at all. I suggest remove the first sentence altogether. Just keep "This is the number of ..."


Comments from Reviewable

@couchand couchand force-pushed the feature/distsql-query-count branch from d3e21fa to 8c74895 Compare July 27, 2017 17:51
@couchand
Copy link
Contributor Author

Review status: 4 of 7 files reviewed at latest revision, 13 unresolved discussions.


pkg/sql/distsqlrun/metrics.go, line 44 at r4 (raw file):

Previously, knz (kena) wrote…

Number of distributed SQL queries currently active

Done.


pkg/sql/distsqlrun/metrics.go, line 50 at r4 (raw file):

Previously, knz (kena) wrote…

Number of distributed SQL flows currently active

Done.


pkg/sql/distsqlrun/metrics.go, line 56 at r4 (raw file):

Previously, knz (kena) wrote…

High water mark of the memory usage by SQL processors for distributed queries

These metrics are verbatim from pkg/sql/mem_metrics.go. If we change them here I recommend that we also change them there to keep everything consistent. Leaving this for now, if we decide to make this change everywhere that can be done in a future PR.


pkg/sql/distsqlrun/metrics.go, line 59 at r4 (raw file):

Previously, knz (kena) wrote…

Current memory usage by SQL processors for distributed queries

See comment above.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 56 at r1 (raw file):

Previously, knz (kena) wrote…

The total number of distributed SQL queries currently running ...

Done.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 54 at r4 (raw file):

Previously, knz (kena) wrote…

"Active distributed SQL queries"

Done.


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 64 at r4 (raw file):

Previously, knz (kena) wrote…

"Active flows supporting distributed SQL queries"

How about "for" instead of "supporting"?


pkg/ui/src/views/cluster/containers/nodeGraphs/dashboards/sql.tsx, line 65 at r4 (raw file):

Previously, knz (kena) wrote…

Yeah the word "worker" doesn't work for me at all. I suggest remove the first sentence altogether. Just keep "This is the number of ..."

Because I thought it was the way to clarify this. Done.


Comments from Reviewable

@dianasaur323
Copy link
Contributor

The current approach you all have converged on sounds fine to me.


Comments from Reviewable

@couchand couchand force-pushed the feature/distsql-query-count branch 2 times, most recently from 3dd6ae1 to 2863fdd Compare July 31, 2017 15:33
@couchand
Copy link
Contributor Author

Can I get a final review from somebody on @cockroachdb/distsql please?

@RaduBerinde
Copy link
Member

LGTM

couchand added 2 commits July 31, 2017 12:37
This adds metrics for counts of active DistSQL queries and flows, which
will give some visibility into the status of long-running queries.
The previous commit introduced a struct to contain all the DistSQL
metrics.  This completes that goal by moving the two memory metrics into
that struct as well.
@couchand couchand force-pushed the feature/distsql-query-count branch from 2863fdd to 90e87c7 Compare July 31, 2017 16:37
@CLAassistant
Copy link

CLAassistant commented Jul 31, 2017

CLA assistant check
All committers have signed the CLA.

@couchand couchand merged commit 2fed9e1 into cockroachdb:master Jul 31, 2017
@couchand couchand deleted the feature/distsql-query-count branch July 31, 2017 18:05
@couchand couchand mentioned this pull request Jul 31, 2017
6 tasks
@RaduBerinde
Copy link
Member

Thanks @couchand!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants