Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explain: Clarify meaning of distribution full vs local #10020

Closed
sheaffej opened this issue Mar 19, 2021 · 0 comments · Fixed by #10240
Closed

explain: Clarify meaning of distribution full vs local #10020

sheaffej opened this issue Mar 19, 2021 · 0 comments · Fixed by #10240
Assignees

Comments

@sheaffej
Copy link
Contributor

It would be helpful to clarify the meaning distribution in explain plans. The possible values appear to be full and local.

On the page: https://www.cockroachlabs.com/docs/v20.2/explain#default-query-plans

It states:

distribution:full
The query plan will be distributed across all nodes on a distributed cluster.

However, it does not explain what local means. And it is not clear what distribution is referring to.

Reference internal Slack:
https://cockroachlabs.slack.com/archives/CHKQGKYEM/p1616110866018700
and follow-on
https://cockroachlabs.slack.com/archives/CHKQGKYEM/p1616110946019100

Summarizing the slack conversation.

distribution: full
Means that the planner chose a distributed execution plan, where execution of the query is performed by multiple nodes in parallel. Such as a distributed join or aggregation, where each node performs a portion of the work in parallel, and then final or downstream results are returned by the gateway node.

distribution: local
Means the planner chose a local execution plan, where execution of the query is performed only on the local node (aka gateway node). It is important to note that rows may be fetched by one or more remote nodes, but the processing of those rows is performed only by the local node.

Therefore "distribution" refers more to the style of processing of the records during query execution (e.g. distributed processing, or local processing), vs the fetching of the rows that will be processed by the query.

Examples:

distribution: local

[email protected]:59856/movr> explain 
select u.name, u.id
from users u
where u.city = 'Amsterdam' and u.id = 'b3333333-3333-4000-8000-000000000023';
  tree |        field        |                                                 description
-------+---------------------+--------------------------------------------------------------------------------------------------------------
       | distribution        | local
       | vectorized          | false
  scan |                     |
       | estimated row count | 1
       | table               | users@primary
       | spans               | [/'Amsterdam'/'b3333333-3333-4000-8000-000000000023' - /'Amsterdam'/'b3333333-3333-4000-8000-000000000023']
(6 rows)

image

Notice that all query processing occurs on node 1.

distribution: full

[email protected]:59856/movr> explain 
select u.city, count(*) as num
from users u
where u.city = 'Amsterdam' group by u.city;
    tree    |        field        |          description
------------+---------------------+--------------------------------
            | distribution        | full
            | vectorized          | false
  group     |                     |
   └── scan |                     |
            | estimated row count | 0
            | table               | users@primary
            | spans               | [/'Amsterdam' - /'Amsterdam']
(7 rows)

image

Notice that query processing involves node 1 and node5.

Therefore it is important to note the full does not mean that processing happens on ALL nodes, but that it happens on multiple nodes simultaneously.

ianjevans pushed a commit that referenced this issue Apr 5, 2021
Fixes #10020, clarify full vs local distributed execution plans.
ianjevans pushed a commit that referenced this issue Apr 5, 2021
ianjevans pushed a commit that referenced this issue Apr 12, 2021
…0240)

* New EXPLAIN ANALYZE PLAN statement.

* Pass to update all EXPLAIN output.

* Fixes #10127 estimated row count in all operators of EXPLAIN.
Fixes #10020, clarify full vs local distributed execution plans.

* Updates to EXPLAIN ANALYZE.

* Adding new demo workload include.

* Adding new statistics, and renaming KV tuple -> KV row.

* Fixes #9582.

* Fixes #9154.

* Fixes #9109.

* Backport fix for #10020.

* Removing insecure performance tuning tutorial.

* Addressed Radu's review comments.

* Fixes #9553.

* Refactoring to use demo cluster with locality flags.

* Incorporated Andy's review comments.

* Incorporated Eric's feedback.

* Generated new `EXPLAIN ANALYZE` diagram.

* Fixes #10262
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants