Skip to content

Commit

Permalink
explain LIR attribution
Browse files Browse the repository at this point in the history
  • Loading branch information
mgree committed Dec 23, 2024
1 parent bc51a40 commit 98a08fd
Showing 1 changed file with 304 additions and 0 deletions.
304 changes: 304 additions & 0 deletions doc/user/content/transform-data/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,310 @@ If you've gone through the dataflow troubleshooting and do not want to make any
changes to your query, consider [sizing up your cluster](/sql/create-cluster/#size).
A larger size cluster will provision more memory and CPU resources.

## Which part of my query runs slowly or uses a lot of memory?

{{< public-preview />}}

You can [`EXPLAIN`](/sql/explain-plan/) a query to see how it will be
run as a dataflow. In particular, `EXPLAIN PHYSICAL PLAN` will show
the concrete, fully optimized plan that Materialize will run. (That
plan is written in our "low-level intermediate representation", which
we call _LIR_.) If you have [created an index](/concepts/indexes) or
[materialized view](/concepts/views#materialized-views), you can use
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to attribute various performance characteristics to the operators
inside your query.

Every time you create an index or materialized view, Materialize uses
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to map the higher-level LIR operators to zero or more lower-level
dataflow operators. You can construct queries that will combine
information from
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
and other internal views to discover which parts of your query are
computationally expensive (e.g.,
[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram), [`mz_introspection.mz_scheduling_elapsed`](/sql/system-catalog/mz_introspection/#mz_scheduling_elapsed))
or consuming excessive memory (e.g., ).

We show how to use
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
by example based on the [`Auction` load
generator](/sql/create-source/load-generator/#auction) from the
[quickstart guide](/get-started/quickstart/#step-2-create-the-source):

```sql
CREATE SOURCE auction_house
FROM LOAD GENERATOR AUCTION
(TICK INTERVAL '1s', AS OF 100000)
FOR ALL TABLES;

CREATE VIEW winning_bids AS
SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
FROM auctions AS a
JOIN bids AS b
ON a.id = b.auction_id
WHERE b.bid_time < a.end_time
AND mz_now() >= a.end_time
ORDER BY a.id, b.amount DESC, b.bid_time, b.buyer;

CREATE INDEX wins_by_item ON winning_bids (item);
```

We attribute four different kinds of performance data to parts of the
`wins_by_item` query: computation time, memory usage, [top-k](/transform-data/idiomatic-materialize-sql/top-k/) query [sizing hints](/transform-data/optimization/#query-hints), and worker
skew.

### Attributing computation time

One way to understand which parts of your query are 'expensive' is to
look at the overall time spent computing in each part. Materialize
reports the time spent in each _dataflow operator_ in
[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram). By
joining it with
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping),
we can attribute the time spent in each operator to the higher-level,
more intelligible LIR operators:

```sql
SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
SUM(duration_ns)/1000 * '1 microsecond'::INTERVAL AS duration, SUM(count) AS count
FROM mz_introspection.mz_lir_mapping mlm
LEFT JOIN mz_introspection.mz_compute_operator_durations_histogram mcodh
ON (mlm.operator_id_start <= mcodh.id AND mcodh.id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
ORDER BY global_id, lir_id DESC;
```

Running this query on an auction generator will produce results that look something like the following (though the specific numbers will vary, of course):


| name | global_id | lir_id | parent_lir_id | operator | duration | count |
| ------------ | --------- | ------ | ------------- | -------------------------------- | --------------- | ------ |
| winning_bids | u148 | 6 | null | TopK::Basic 5 | 00:00:00.74516 | 108650 |
| winning_bids | u148 | 5 | 6 | Join::Differential 2 » 4 | 00:00:00.017005 | 19099 |
| winning_bids | u148 | 4 | 5 | Arrange 3 | 00:00:00.058835 | 11506 |
| winning_bids | u148 | 3 | 4 | Get::PassArrangements u145 | null | null |
| winning_bids | u148 | 2 | 5 | Arrange 1 | 00:00:00.013885 | 7413 |
| winning_bids | u148 | 1 | 2 | Get::Collection u144 | null | null |
| wins_by_item | u149 | 8 | null | Arrange 7 | 00:00:00.013887 | 9347 |
| wins_by_item | u149 | 7 | 8 | Get::PassArrangements u148 | null | null |

The `duration` column lets us see that the `TopK` operator is where we
spend the bulk of the query's computation time.

The query works by finding every dataflow operator in the range
(`mz_lir_mapping.operator_id_start` inclusive to
`mz_lir_mapping.operator_id_end` exclusive) and summing up the time
spent (`SUM(duration_ns)`). The query joins with
[`mz_catalog.mz_objects`](/sql/system-catalog/mz_catalog/#mz_objects)
to find the actual name corresponding to the `global_id`.

Notice that the index is actually _two_ `global_id`s: `u148` is the
`winning_bids` view, while `u149` is the `wins_by_item` index.

The `operator` is indented using [`REPEAT`](/sql/functions/#repeat) and
`mz_lir_mapping.nesting`). The indenting, combined with ordering by
`mz_lir_mapping.lir_id` descending, gives us a tree-shaped view of the
LIR plan.

The LIR operators reported in `mz_lir_mapping.operator` are terser
than those in `EXPLAIN PHYSICAL PLAN`. Each operator is restricted to
a single line, of the form `OPERATOR [INPUT_LIR_ID ...]`. That is,
`lir_id` 4 is the operator `Arrange 3`, the operator that will arrange
in memory the results of reading dataflow `u145` (here, the `bids`
table from the source).

### Attributing memory usage

(If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.)

We can attribute memory usage to our query by joining
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
with
[`mz_introspection.mz_arrangement_sizes`](/sql/system-catalog/mz_introspection/#mz_arrangement_sizes):

```sql
SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
pg_size_pretty(SUM(size)) AS size
FROM mz_introspection.mz_lir_mapping mlm
LEFT JOIN mz_introspection.mz_arrangement_sizes mas
ON (mlm.operator_id_start <= mas.operator_id AND mas.operator_id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
ORDER BY global_id, lir_id DESC;
```

| name | global_id | lir_id | parent_lir_id | operator | size |
| ------------ | --------- | ------ | ------------- | -------------------------------- | ------- |
| winning_bids | u148 | 6 | null | TopK::Basic 5 | 38 MB |
| winning_bids | u148 | 5 | 6 | Join::Differential 2 » 4 | null |
| winning_bids | u148 | 4 | 5 | Arrange 3 | 2008 kB |
| winning_bids | u148 | 3 | 4 | Get::PassArrangements u145 | null |
| winning_bids | u148 | 2 | 5 | Arrange 1 | 900 kB |
| winning_bids | u148 | 1 | 2 | Get::Collection u144 | null |
| wins_by_item | u149 | 8 | null | Arrange 7 | 707 kB |
| wins_by_item | u149 | 7 | 8 | Get::PassArrangements u148 | null |

The `TopK` is overwhelmingly responsible for memory usage: arranging
the `bids` table (`lir_id` 4) and `auctions` table (`lir_id` 2) are
cheap in comparison, as is arranging the final output in the
`wins_by_item` index (`lir_id` 8).

### Attributing `TopK` hints

The
[`mz_introspection.mz_expected_group_size_advice`](/sql/system-catalog/mz_introspection/#mz_expected_group_size_advice)
looks at your running dataflow and suggests parameters you can set. We
can attribute this to particular parts of our query using
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping):

```sql
SELECT mo.name AS name, mlm.global_id AS global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
levels, to_cut, pg_size_pretty(savings), hint
FROM mz_introspection.mz_lir_mapping mlm
JOIN mz_introspection.mz_dataflow_global_ids mdgi
ON (mlm.global_id = mdgi.global_id)
LEFT JOIN mz_introspection.mz_expected_group_size_advice megsa
ON (megsa.dataflow_id = mdgi.id AND
mlm.operator_id_start <= megsa.region_id AND megsa.region_id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
ORDER BY mlm.global_id, lir_id DESC;
```

Each `TopK` operator will have associated hints:

| name | global_id | lir_id | parent_lir_id | operator | levels | to_cut | pg_size_pretty | hint |
| ------------ | --------- | ------ | ------------- | -------------------------------- | ------ | ------ | -------------- | ----- |
| winning_bids | u148 | 6 | null | TopK::Basic 5 | 8 | 6 | 27 MB | 255.0 |
| winning_bids | u148 | 5 | 6 | Join::Differential 2 » 4 | null | null | null | null |
| winning_bids | u148 | 4 | 5 | Arrange 3 | null | null | null | null |
| winning_bids | u148 | 3 | 4 | Get::PassArrangements u145 | null | null | null | null |
| winning_bids | u148 | 2 | 5 | Arrange 1 | null | null | null | null |
| winning_bids | u148 | 1 | 2 | Get::Collection u144 | null | null | null | null |
| wins_by_item | u149 | 8 | null | Arrange 7 | null | null | null | null |
| wins_by_item | u149 | 7 | 8 | Get::PassArrangements u148 | null | null | null | null |

Here, the hinted `DISTINCT ON INPUT GROUP SIZE` is `255.0`. We can re-create our view and index using the hint as follows:

```sql
DROP VIEW winning_bids CASCADE;

CREATE VIEW winning_bids AS
SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
FROM auctions AS a
JOIN bids AS b
ON a.id = b.auction_id
WHERE b.bid_time < a.end_time
AND mz_now() >= a.end_time
OPTIONS (DISTINCT ON INPUT GROUP SIZE = 255) -- use hint!
ORDER BY a.id,
b.amount DESC,
b.bid_time,
b.buyer;

CREATE INDEX wins_by_item ON winning_bids (item);
```

Re-running the `TopK`-hints query will show only `null` hints---there
are no hints because our `TopK` is now appropriately sized. But if we
re-run our [query for attributing memory
usage](#attributing-memory-usage), we can see that our `TopK` operator
uses a third of the memory it was using before:

| name | global_id | lir_id | parent_lir_id | operator | size |
| ------------ | --------- | ------ | ------------- | -------------------------------- | ------- |
| winning_bids | u150 | 6 | null | TopK::Basic 5 | 11 MB |
| winning_bids | u150 | 5 | 6 | Join::Differential 2 » 4 | null |
| winning_bids | u150 | 4 | 5 | Arrange 3 | 1996 kB |
| winning_bids | u150 | 3 | 4 | Get::PassArrangements u145 | null |
| winning_bids | u150 | 2 | 5 | Arrange 1 | 575 kB |
| winning_bids | u150 | 1 | 2 | Get::Collection u144 | null |
| wins_by_item | u151 | 8 | null | Arrange 7 | 402 kB |
| wins_by_item | u151 | 7 | 8 | Get::PassArrangements u150 | null |

### Localizing worker skew

[Worker skew](/transform-data/dataflow-troubleshooting/#is-work-distributed-equally-across-workers) occurs when your data do not end up getting evenly
partitioned between workers. Worker skew can only happen when your
cluster has more than one worker. (You can query
[`mz_catalog.mz_cluster_replica_sizes`](/sql/system-catalog/mz_catalog/#mz_cluster_replica_sizes)
to determine how many workers a given cluster size has.)

You can identify worker skew by comparing a worker's time spent to the
overall time spent across all workers:

```sql
SELECT mo.name AS name, global_id, lir_id, REPEAT(' ', 2 * nesting) || operator as operator,
worker_id,
SUM(elapsed_ns) / 1000 * '1 microsecond'::INTERVAL AS elapsed_ns,
SUM(aebi.total_ns) / 1000 * '1 microsecond'::INTERVAL as avg_ns,
SUM(elapsed_ns) / SUM(aebi.total_ns) as ratio
FROM mz_introspection.mz_lir_mapping mlm
CROSS JOIN LATERAL ( SELECT SUM(elapsed_ns) AS total_ns
FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
GROUP BY worker_id) aebi
CROSS JOIN LATERAL ( SELECT worker_id, SUM(elapsed_ns) as elapsed_ns
FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
GROUP BY worker_id) epw
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
GROUP BY mo.name, global_id, lir_id, nesting, operator, worker_id
ORDER BY global_id, lir_id DESC;
```

| name | global_id | lir_id | operator | worker_id | elapsed_ns | avg_ns | ratio |
| ------------ | --------- | ------ | -------------------------- | --------- | --------------- | --------------- | ----------------------------------------- |
| winning_bids | u150 | 6 | TopK::Basic 5 | 0 | 00:00:02.359193 | 00:00:02.338105 | 1.00901943619301904208932433787196422606 |
| winning_bids | u150 | 6 | TopK::Basic 5 | 1 | 00:00:02.342375 | 00:00:02.338105 | 1.0018262379840754650921842785877521019 |
| winning_bids | u150 | 6 | TopK::Basic 5 | 2 | 00:00:02.314143 | 00:00:02.338105 | 0.989751590156994710647908375633282330915 |
| winning_bids | u150 | 6 | TopK::Basic 5 | 3 | 00:00:02.336708 | 00:00:02.338105 | 0.999402735665910782170583007907001341123 |
| winning_bids | u150 | 5 | Join::Differential 2 » 4 | 0 | 00:00:00.557381 | 00:00:00.548652 | 1.01590883682808615066920883751285427977 |
| winning_bids | u150 | 5 | Join::Differential 2 » 4 | 1 | 00:00:00.542891 | 00:00:00.548652 | 0.98949878273761917487532841143193434597 |
| winning_bids | u150 | 5 | Join::Differential 2 » 4 | 2 | 00:00:00.546565 | 00:00:00.548652 | 0.99619568016440763325559996932924096555 |
| winning_bids | u150 | 5 | Join::Differential 2 » 4 | 3 | 00:00:00.547773 | 00:00:00.548652 | 0.99839670026988704119986278172597040871 |
| winning_bids | u150 | 4 | Arrange 3 | 0 | 00:00:00.314206 | 00:00:00.338216 | 0.929009114588746221704744632262170530522 |
| winning_bids | u150 | 4 | Arrange 3 | 1 | 00:00:00.306153 | 00:00:00.338216 | 0.905199632876776517170415418238708660282 |
| winning_bids | u150 | 4 | Arrange 3 | 2 | 00:00:00.432691 | 00:00:00.338216 | 1.27933123914530934762928875720815264567 |
| winning_bids | u150 | 4 | Arrange 3 | 3 | 00:00:00.299815 | 00:00:00.338216 | 0.886460013389167913495551192290968163528 |
| winning_bids | u150 | 2 | Arrange 1 | 0 | 00:00:00.258171 | 00:00:00.244214 | 1.05715143156633465139502061854023757049 |
| winning_bids | u150 | 2 | Arrange 1 | 1 | 00:00:00.234322 | 00:00:00.244214 | 0.959493923931760528554652347461354389562 |
| winning_bids | u150 | 2 | Arrange 1 | 2 | 00:00:00.230453 | 00:00:00.244214 | 0.943651272055032566225444196288453924574 |
| winning_bids | u150 | 2 | Arrange 1 | 3 | 00:00:00.25391 | 00:00:00.244214 | 1.03970337244687225382488283770995411537 |
| wins_by_item | u151 | 8 | Arrange 7 | 0 | 00:00:00.202201 | 00:00:00.204736 | 0.987618403431703110437750824708179947656 |
| wins_by_item | u151 | 8 | Arrange 7 | 1 | 00:00:00.216731 | 00:00:00.204736 | 1.05858895341593917863900275768799815376 |
| wins_by_item | u151 | 8 | Arrange 7 | 2 | 00:00:00.220717 | 00:00:00.204736 | 1.07805793303181678638721908696246766799 |
| wins_by_item | u151 | 8 | Arrange 7 | 3 | 00:00:00.179294 | 00:00:00.204736 | 0.875734710120540924536027330641354230589 |

The `ratio` column tells you whether a worker is particularly over- or under-loaded.

### Writing your own attribution queries

Since Materialize maps LIR nodes to ranges of dataflow operators in
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping),
you can freely combine information from
[`mz_catalog`](/sql/system-catalog/mz_catalog/) and
[`mz_introspection`](/sql/system-catalog/mz_introspection/) to
understand your dataflows' behavior. We offer the queries above as a
starting point. When building your own, keep the following in mind.

1. `mz_lir_mapping.operator` is not stable and **should not be
parsed**. If you want to traverse the LIR tree, use
`mz_lir_mapping.parent_lir_id`.
2. Make sure you `GROUP BY global_id`---mixing `lir_id`s from
different `global_id`s will produce nonsense.
3. Use `REPEAT(' ', 2 * nesting) || operator` and `ORDER BY lir_id
DESC` to correctly render the LIR tree.
4. `mz_lir_mapping.operator_id_start` is inclusive;
`mz_lir_mapping.operator_id_end` is exclusive. If they are equal
to each other, that LIR operator does not correspond to any
dataflow operators.

## How do I troubleshoot slow queries?

Materialize stores a (sampled) log of the SQL statements that are issued against
Expand Down

0 comments on commit 98a08fd

Please sign in to comment.