sql: olap query hitting context canceled/can't complete #36842

awoods187 · 2019-04-15T15:42:13Z

I'm trying to run an analytics query on registration cluster backup spun up in AWS (with no other traffic on m4.xlarge) but I keep seeing:
pq: communication error: rpc error: code = Canceled desc = context canceled root@cockroachdb-public:26257/registration> pod default/cockroachdb-22073 terminated (Error)

SELECT
    d.month,
    sum(d.constraints) AS constraints,
    sum(d.lease_preferences) AS lease_preferences,
    sum(d.both) AS both,
    sum(d.neither) AS neither
FROM
    (
        SELECT
            x.month,
            x.cluster,
            max(x.constraints) AS constraints,
            max(x.lease_preferences) AS lease_preferences,
            max(x.both) AS both,
            max(x.neither) AS neither
        FROM
            (
                SELECT
                    z.cluster,
                    substring(z.timestamp::STRING, 1, 7)
                        AS month,
                    IF(
                        z.cfg->'constraints' != 'null'
                        AND json_array_length(
                                z.cfg->'constraints'
                            )
                            > 0
                        AND z.cfg->'constraints'->0->'constraints'
                            != 'null',
                        1,
                        0
                    )
                        AS constraints,
                    IF(
                        z.cfg->'lease_preferences' != 'null'
                        AND json_array_length(
                                z.cfg->'lease_preferences'
                            )
                            > 0
                        AND z.cfg->'lease_preferences'->0->'constraints'
                            != 'null',
                        1,
                        0
                    )
                        AS lease_preferences,
                    IF(
                        (
                            z.cfg->'lease_preferences'
                            != 'null'
                            AND json_array_length(
                                    z.cfg->'lease_preferences'
                                )
                                > 0
                            AND z.cfg->'lease_preferences'->0->'constraints'
                                != 'null'
                        )
                        AND (
                                z.cfg->'constraints'
                                != 'null'
                                AND json_array_length(
                                        z.cfg->'constraints'
                                    )
                                    > 0
                                AND z.cfg->'constraints'->0->'constraints'
                                    != 'null'
                            ),
                        1,
                        0
                    )
                        AS both,
                    IF(
                        NOT
                            (
                                z.cfg->'lease_preferences'
                                != 'null'
                                AND json_array_length(
                                        z.cfg->'lease_preferences'
                                    )
                                    > 0
                                AND z.cfg->'lease_preferences'->0->'constraints'
                                    != 'null'
                            )
                        AND NOT
                                (
                                    z.cfg->'constraints'
                                    != 'null'
                                    AND json_array_length(
                                            z.cfg->'constraints'
                                        )
                                        > 0
                                    AND z.cfg->'constraints'->0->'constraints'
                                        != 'null'
                                ),
                        1,
                        0
                    )
                        AS neither
                FROM
                    registration.zonecfgs AS z
                    INNER JOIN clusters AS c
                    ON z.cluster = c.id
                WHERE
                    (c.lastseen::DATE - c.firstseen::DATE)
                    >= '7'
                    AND (
                            c.internal != true
                            OR c.internal IS NULL
                        )
            )
                AS x
        GROUP BY
            x.cluster, x.month
    )
        AS d
GROUP BY
    d.month;

Here is the explain (opt, env):

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  Version: CockroachDB CCL v19.1.0-rc.2 (x86_64-unknown-linux-gnu, built 2019/04/05 00:36:18, go1.11.6)

  CREATE TABLE zonecfgs (
      cluster UUID NOT NULL,
      zone INT NOT NULL,
      "timestamp" TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP,
      cfg JSONB NULL,
      CONSTRAINT "primary" PRIMARY KEY (cluster ASC, zone ASC, "timestamp" DESC),
      FAMILY "primary" (cluster, zone, "timestamp", cfg)
  );

  ALTER TABLE registration.public.zonecfgs INJECT STATISTICS '[
      {
          "columns": [
              "cluster"
          ],
          "created_at": "2019-04-15 14:08:26.858097+00:00",
          "distinct_count": 820850,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 0,
          "row_count": 154033788
      },
      {
          "columns": [
              "zone"
          ],
          "created_at": "2019-04-15 14:08:26.858097+00:00",
          "distinct_count": 860,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 0,
          "row_count": 154033788
      },
      {
          "columns": [
              "timestamp"
          ],
          "created_at": "2019-04-15 14:08:26.858097+00:00",
          "distinct_count": 152173206,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 0,
          "row_count": 154033788
      }
  ]';

  CREATE TABLE clusters (
      id UUID NOT NULL,
      firstseen TIMESTAMP NULL DEFAULT now():::TIMESTAMP,
      lastseen TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP,
      pings INT NOT NULL DEFAULT 0:::INT,
      insecure BOOL NULL,
      maxnodeid INT NULL,
      licensetype STRING NULL,
      internal BOOL NULL,
      CONSTRAINT "primary" PRIMARY KEY (id ASC),
      FAMILY "primary" (id, firstseen, lastseen, pings, insecure, maxnodeid, licensetype, internal)
  );

  ALTER TABLE registration.public.clusters INJECT STATISTICS '[
      {
          "columns": [
              "id"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 2865945,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 0,
          "row_count": 2862353
      },
      {
          "columns": [
              "firstseen"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 2845027,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 0,
          "row_count": 2862353
      },
      {
          "columns": [
              "lastseen"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 2457566,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 0,
          "row_count": 2862353
      },
      {
          "columns": [
              "pings"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 6750,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 0,
          "row_count": 2862353
      },
      {
          "columns": [
              "insecure"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 2,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 388391,
          "row_count": 2862353
      },
      {
          "columns": [
              "maxnodeid"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 129,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 433074,
          "row_count": 2862353
      },
      {
          "columns": [
              "licensetype"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 6,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 800363,
          "row_count": 2862353
      },
      {
          "columns": [
              "internal"
          ],
          "created_at": "2019-04-15 13:51:33.119775+00:00",
          "distinct_count": 2,
          "histo_col_type": "",
          "name": "__auto__",
          "null_count": 1097500,
          "row_count": 2862353
      }
  ]';

  EXPLAIN (OPT, ENV) SELECT d.month, sum(d.constraints) AS constraints, sum(d.lease_preferences) AS lease_preferences, sum(d."both") AS both, sum(d.neither) AS neither FROM (SELECT x.month, x.cluster, max(x.constraints) AS constraints, max(x.lease_preferences) AS lease_preferences, max(x."both") AS both, max(x.neither) AS neither FROM (SELECT z.cluster, substring(z.timestamp::STRING, 1, 7) AS month, IF((((z.cfg->'constraints') != 'null') AND (json_array_length(z.cfg->'constraints') > 0)) AND ((((z.cfg->'constraints')->0)->'constraints') != 'null'), 1, 0) AS constraints, IF((((z.cfg->'lease_preferences') != 'null') AND (json_array_length(z.cfg->'lease_preferences') > 0)) AND ((((z.cfg->'lease_preferences')->0)->'constraints') != 'null'), 1, 0) AS lease_preferences, IF(((((z.cfg->'lease_preferences') != 'null') AND (json_array_length(z.cfg->'lease_preferences') > 0)) AND ((((z.cfg->'lease_preferences')->0)->'constraints') != 'null')) AND ((((z.cfg->'constraints') != 'null') AND (json_array_length(z.cfg->'constraints') > 0)) AND ((((z.cfg->'constraints')->0)->'constraints') != 'null')), 1, 0) AS both, IF((NOT ((((z.cfg->'lease_preferences') != 'null') AND (json_array_length(z.cfg->'lease_preferences') > 0)) AND ((((z.cfg->'lease_preferences')->0)->'constraints') != 'null'))) AND (NOT ((((z.cfg->'constraints') != 'null') AND (json_array_length(z.cfg->'constraints') > 0)) AND ((((z.cfg->'constraints')->0)->'constraints') != 'null'))), 1, 0) AS neither FROM registration.zonecfgs AS z INNER JOIN clusters AS c ON z.cluster = c.id WHERE ((c.lastseen::DATE - c.firstseen::DATE) >= '7') AND ((c.internal != true) OR (c.internal IS NULL))) AS x GROUP BY x.cluster, x.month) AS d GROUP BY d.month;
  ----
  group-by
   ├── group-by
   │    ├── project
   │    │    ├── inner-join (lookup zonecfgs)
   │    │    │    ├── select
   │    │    │    │    ├── scan c
   │    │    │    │    └── filters
   │    │    │    │         ├── (lastseen::DATE - firstseen::DATE) >= 7
   │    │    │    │         └── (internal != true) OR (internal IS NULL)
   │    │    │    └── filters (true)
   │    │    └── projections
   │    │         ├── substring(timestamp::STRING, 1, 7)
   │    │         ├── CASE (((cfg->'constraints') != 'null') AND (json_array_length(cfg->'constraints') > 0)) AND ((((cfg->'constraints')->0)->'constraints') != 'null') WHEN true THEN 1 ELSE 0 END
   │    │         ├── CASE (((cfg->'lease_preferences') != 'null') AND (json_array_length(cfg->'lease_preferences') > 0)) AND ((((cfg->'lease_preferences')->0)->'constraints') != 'null') WHEN true THEN 1 ELSE 0 END
   │    │         ├── CASE ((((((cfg->'lease_preferences') != 'null') AND (json_array_length(cfg->'lease_preferences') > 0)) AND ((((cfg->'lease_preferences')->0)->'constraints') != 'null')) AND ((cfg->'constraints') != 'null')) AND (json_array_length(cfg->'constraints') > 0)) AND ((((cfg->'constraints')->0)->'constraints') != 'null') WHEN true THEN 1 ELSE 0 END
   │    │         └── CASE (((cfg @> '{"lease_preferences": null}') OR (json_array_length(cfg->'lease_preferences') <= 0)) OR (((cfg->'lease_preferences')->0) @> '{"constraints": null}')) AND (((cfg @> '{"constraints": null}') OR (json_array_length(cfg->'constraints') <= 0)) OR (((cfg->'constraints')->0) @> '{"constraints": null}')) WHEN true THEN 1 ELSE 0 END
   │    └── aggregations
   │         ├── max
   │         │    └── variable: constraints
   │         ├── max
   │         │    └── variable: lease_preferences
   │         ├── max
   │         │    └── variable: both
   │         └── max
   │              └── variable: neither
   └── aggregations
        ├── sum
        │    └── variable: max
        ├── sum
        │    └── variable: max
        ├── sum
        │    └── variable: max
        └── sum
             └── variable: max
(187 rows)

Time: 64.115101ms

The text was updated successfully, but these errors were encountered:

jordanlewis · 2019-04-15T15:46:01Z

Can you pull the logs? This doesn't seem like a planning issue to me.

justinj · 2019-04-15T16:21:06Z

@awoods187 I deleted your comment for now, I don't think we should post logs from the reg cluster on public issues.

awoods187 · 2019-04-15T18:40:21Z

I put them on gdrive instead:
https://drive.google.com/open?id=1XJ9VeeP7GuHoDdPjgxZavZ5JZMHVsqSi

jordanlewis · 2019-04-23T18:26:49Z

Action item here: determine what the root cause of the error is.

asubiotto · 2019-04-23T20:05:43Z

Everything looks fine in the logs until we start getting a bunch of:

E190415 16:05:24.226998 109581 sql/distsqlrun/server.go:614  [n2] query execution canceled

This is accompanied by a breaker trip to n5. Further reconnection attempts fail with a context deadline exceeded error, which is interesting:

I190415 16:05:32.452458 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n2] circuitbreaker: rpc <elided>:26257->5 tripped: failed to connect to n5 at <elided>:26257: initial connection heartbeat failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

There might be a connection between this stuck state and the context cancellation earlier. @awoods187 as discussed it would be interesting to get the goroutine dumps in this stuck state and maybe also n5's logs, which seems to be a problem node. Is this the only query you're running? An EXPLAIN (DISTSQL) might also help.

awoods187 · 2019-04-23T20:12:20Z

https://cockroachdb.github.io/distsqlplan/decode.html#eJzsW11z4sgVfc-v0PTLmKwE6m59V7kiZ-zseGsME8wmu067XAIaTIqVXJJIyZnyf0-1hG1ATF8J1sOQ1YunETrdV-e07j1XaL6gMBrzbvAbT5D3L4SRighSEUUqMpCKTHSrooc4GvEkiWJxSgG4HGfI01U0Cx8WqTh8q6JRFHPkfUHpLJ1z5KFBMJzzPg_GPO6IucY8DWbzfJmHePZbED_6_41CPppME6Si64cgTDylw9AjYxl3Gct0k7EMOwljkx_FMYuxbDj-i_iGMBYylrnijGASPzLUwWaH6NjRsK7pxkB3PNPwDNzGtqO7xo3WYegLY9koOBUQh7HM4WIGfsUYQ-IjafVbjGVj81exLJ-KOW0xp6vpRNPpQCcetT3daRuORe0bpKLeIvUUH6s-VX0D3T6pKFqkS0ZeiRg-KvdBcr9OgY_R7dOtipI0mHLk4Sd1N2bN7cyO5osk5fEKs4IBccGjdiyuj3SzD4IJKv7Ygkc--vufBbc4nDDU-RzzySy7CMdIRX-bzVMee8rJiU897_xscKFoik-KYUthC12n_FSxPc-77A6clnLWPRcnO8q7UyWNF7yl9PqK-Hx5rXR__vSp1Volj-T8OXvxR77K3-s8UTzmMR9vzvODWLjSWVukuOLxlP8UzUIedzBeB835JD3x8Q-t03g2vS-GSEV9Ho4Fm8limKTxLJyeCCqvB_3L7o8qXnKovpCpKh_Ori9OBPdawfT7URQmaRzMwjR57z1DW-9O34eL-Vwc-em61_1r66x7fvLvJArvgjgOHu_mPJym9-A0xZf68_L5LBVW1zZwNWP958eLrtgpg48X3WcOLj5dXzzPd9E930LEnAcJv3uI-YTHPBzxveiQTQaSIgN_Q2r-eOwUi9W-N_6Pbw5_OfkXhkq8M-SJqZ9e5-7195N9dPrKRq-_s-orQa-QUg53yfjaVUoBwPVJZJZcWTWVK19TJX3zgrdaIF-LH61jHs6m05hPgzSKO9jaqHKW6osa1itKoKf44oSz7q933d7gThTuE99qlQ5hcejq7JcTn7yM6MvIeBmZrbolXvXJWpU39qvycjYI3rb8Kht4-6WvHwI5WI6sVw9UGCBD9U3Vt76qsrmjymRT5a9cxvXPV8vo8xF9GRkvo_oKrslnVZBvEW4TENKOlrT7_a7xNX57LX5c3aTrVdufpUnHojXBQ_FnJCy5dZl9ZCybjEUHZPdFl-Liz51fRNtiMtQhZNn7YI3gAaYeoR7FbdcyHNu6UYJwrJhKlN4XDcFbdS34IK7bbkx3Y7ob092Y7sZ0N6b7YKYbMAOrppsctemuXeQBNuxjMt3VVSabKn8npnu7fLuZblLS7u1NN6luummN3xzajGUBT_JfGa4Yy0aTS-GtDXEYMxbMxaeJsOL82Z671Nr41cEaEOrpumcYbZsapPjRweqlAjY5FbM652Ks52jxh4hprXfCzwufjwPGGLoR01rLaYmG9YGOPdP2iN12TEcnhZ8nb-_nyWGeouuNoW8MfWPoG0PfGPrG0B_M0ANGY9XCmkdt6GtXecAU6sdk6KurTDZV_k4M_Xb5dnyKXtLu7Q09rW7oSQ1Dv3ypx3UYy4YTxv4j_hW2fUjOlqY-c_L3iyzlH2IsThyR_In6cPWJuq5RnBtwy9ONNjYM17zRXhfg1oNYRkzHnfz1mg_L95MyBz_Pmy88cfLXba7yTmL40jq4mo414gx06lHXo1bbMFxquoXJt97e5Nfg36r6qtGObxppx_iqET1Ik-Q2PVLTIzU9UtMjNT1S0yMdrEcCvMNqj2QcdY9Uu8gDbLjH1CNVV5lsqvyd9Ejb5dvxR4-Sdm_fIxnVPTqu1SP9zi8aaW_deOVNkfP2TZFxEFPvNKa-MfWNqW9MfWPqG1N_MFMPmI1VG0uP2tTXLvIAG84xmfrqKpNNlb8TU79dvh1NfUm7b2vqd4q_G2nRQ4eSjTO3r-3UUb_Pk4coTHilmXURGx9PeXEhSbSIR_xzHI3y_-FRfOzluPztszFP0uJbp_hwGRZfiQCrg4m9Fxrvg8bGXmvLwbgGZ6QeeIOzumi8D3qDs7pry8FkE6yvgukaWN8EU3nculwuOZq4e6HpPmhs7YMGtrghZdyUM27K4wZuEDmaAne2HE3MfdAYyGhyNCCXJd_jAGn2PplYDoYyMYAG9JKjoUwMrC0HO1LGXfked_fKKnI0lFUANJBVgMiBbSpHA3LhUuVcoxxjOecYKEFQ5ZXDocwCwKHUAgUP3KMAHFANAxUUYk5eS6DQgSoI5BcIDskGVFFoywKrA2h5HcVAIcVAPQKyDACH0gwEB_IMFDy0ZeVwSDZ5OcU2QDxQliDZ5XAw0wBlDco0QPDQ7SqHQ7LJqyqYaeT1BQodqIxQpgHgkGxAZYW2LLC6HE3q9OLlBqtWM14bDnSWABxqLaHVAXSpsK5tWELkmYKUKmutFA3AoRQNwYEUDQUP3OsAHNjvpFRZ14k3AOJLpbVWigbgUIoG4FCKhoIH8hwAh2QrVdZaKZqUSmudFA2goRQNwSHZ5HAoRUOrA-hSYV3f8A6w4UuVtV6mkcPBTAPAoUwDBA9tWTkckq1UWdc3vC4nngLlCZAdgEOZBoBDmQYKHrhdATj0CFJeW6FMQ4GudfPB7-3Tn_4XAAD__8YcHQQ=

awoods187 · 2019-04-23T23:42:41Z

Goroutines https://drive.google.com/open?id=1mhgqp-0YaeNZVZsQcU6SYHGJz7dLTBj1

awoods187 · 2019-04-23T23:54:43Z

When I ran this command in the webui it showed that it killed a node (k8 since restarted it) but I grabbed the logs

https://drive.google.com/open?id=1HLCeHz0X7yXxXLZHlFyOttMZxzzPZnL_

awoods187 · 2019-04-24T13:20:17Z

I restored this to a roachprod cluster and ran the query. Here's the debug.zip while the query was running:https://drive.google.com/open?id=1LSRdaKl7BRMafmnp-bLrXwuennEME-4L

It turns out that this query causes an OOM:

[41126.525429] Out of memory: Kill process 3974 (cockroach) score 984 or sacrifice child
[41126.528327] Killed process 3974 (cockroach) total-vm:48908760kB, anon-rss:31296772kB, file-rss:0kB

Link to OOM heap profile https://drive.google.com/open?id=1FZJDZZeZaspEWO8VyuM1U8824R50hlK3
Link to logs https://drive.google.com/open?id=1SKBY4oLn8RXD1UVr_AUwm-kVl1Vn7uhd
Link to goroutines post OOM https://drive.google.com/open?id=1q8cfsrmdGzvyoMFl_DLy-rVfaRe75sVu

asubiotto · 2019-04-24T18:56:15Z

Looks like a bunch of unaccounted-for memory (unused but kept around by references). Unmarshal refers to remote byte slices received and copyFromSliceVector are local byte slices.

asubiotto · 2019-04-24T22:30:02Z

Actually, the dead node seems to have accounted for (this is through sql.mem.distsql.current) a big chunk of this memory (6.84G at time of crash):

The memory profile above shows around 7.5G total allocated memory at 13:04 which lines up with the crash. It seems, however, that the memory stats show a different picture:

So the unaccounted for memory seems to be garbage + CGO as far as I can tell. Unfortunately I don't think we can do much about fixing this since this is a limitation of our memory monitoring model.

It seems like setting --max-sql-memory to a lower limit will probably at least give you an error instead of crashing a node.

It's interesting that so much memory is being used when most of the plan is supposed to be streaming (due to the ordered results) execution. However, looking at the zonecfgs table shows that it has a potential to generate a significant amount of garbage. Running SELECT COUNT(*) FROM zonecfgs (note that we request kvs but never actually use values in execution land) shows that we use 18GiB to do this:

The table has around 161 million rows. This could potentially generate a fair amount of unaccounted for memory and it seems like each cluster ID has a fair amount of redundant data. Joining this with the clusters table could be an issue. There's probably a better way to write this query. Will try out a grouping before the join.

asubiotto · 2019-04-24T22:56:36Z

That seems to do the trick. No crashes occur when grouping by cluster ID first. Closing this issue and will share the query/result privately.

awoods187 · 2019-04-30T16:26:54Z

Reopening pending further investigation

jordanlewis · 2019-05-20T19:34:49Z

Can confirm that doing this again still causes an OOM error.

jordanlewis · 2019-07-10T00:30:04Z

Okay, I think I've gotten to the bottom of this. I think the root cause is that lookup joins can't impose a limit on the number of rows returned: #35950.

As you can see from the plan, the first join in the query is a lookup join that scans clusters and looks up into zonecfgs. The clusters table has one row per cluster. The zonecfgs has many rows per cluster. At worst, over 5 million!

root@localhost:26257/registration> select count(*) from zonecfgs where cluster='6a560e95-f5b3-4c5c-80f2-7bf16f191e23';
   count
+---------+
  5511378
(1 row)

Time: 2.5449804s

Because of the limitation in #35950, this means that the innocent looking lookup join plan creates a single span that returns over 5 million rows, each of which has a sizable amount of JSON inside. Together with the fact that DistSQL doesn't let Go GC any memory from an MVCCScan-returned batch until nobody references any of that memory anymore (see #26553), and the fact that the query requires that all of the JSON payloads within each of the rows be expanded into memory further, this means that this query requires holding those 5 million rows both in bytes and in expanded JSON form at once. This could easily add up to be too much to fit in memory, hence the explosion.

The resolution to this should be fixing #35950.

awoods187 · 2019-07-22T21:24:23Z

@jordanlewis do you expect that this fixed now?

jordanlewis · 2019-07-22T21:25:37Z

It should be, yes. Retrying it would be useful to me if you have the bandwidth.

awoods187 · 2019-07-22T22:34:36Z

It still kills a node via OOM.

[ 3197.987433] Out of memory: Kill process 4402 (cockroach) score 978 or sacrifice child
[ 3197.990801] Killed process 4402 (cockroach) total-vm:58170312kB, anon-rss:31111708kB, file-rss:0kB

Heap profiles

It's up on andy-secure

solongordon · 2019-09-11T21:05:46Z

Well, it takes an hour, but the query completes now!

   month  | constraints | lease_preferences | both | neither
+---------+-------------+-------------------+------+---------+
  2019-05 |          19 |                 5 |    1 |    6753
  2018-09 |           4 |                 0 |    0 |    2562
  2019-01 |          11 |                 1 |    1 |    4896
  2018-12 |           7 |                 0 |    0 |    4519
  2019-04 |          19 |                 5 |    4 |    6001
  2019-03 |          13 |                 3 |    3 |    5824
  2019-02 |          11 |                 1 |    1 |    5229
  2019-06 |          22 |                 8 |    1 |    6973
  2019-07 |         156 |                 4 |    2 |    7456
  2019-08 |         195 |                 9 |    4 |    6912
  2018-10 |          10 |                 0 |    0 |    4200
  2018-11 |          10 |                 0 |    0 |    4608
(12 rows)

Time: 1h6m1.831880543s

Disk spilling appears to be working as expected. I kept an eye on the heap profile while the query was executing and joinreader usage stayed within reasonable bounds.

A bit concerningly, I did run into an OOM the first time I tried to test this. But that was on a cluster where I swapped out several different CRDB versions trying to diagnose an unrelated RESTORE issue, so it may have been in a bad state to begin with. I think I'll test this once more tomorrow to double-check that the query completes on a happy cluster. But barring further issues I'm considering this closed.

awoods187 added C-investigation Further steps needed to qualify. C-label will change. A-sql-optimizer SQL logical planning and optimizations. labels Apr 15, 2019

cockroachdb deleted a comment from awoods187 Apr 15, 2019

awoods187 assigned jordanlewis Apr 16, 2019

jordanlewis assigned asubiotto and unassigned jordanlewis Apr 23, 2019

awoods187 added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Apr 24, 2019

asubiotto closed this as completed Apr 24, 2019

awoods187 reopened this Apr 30, 2019

jordanlewis assigned jordanlewis and unassigned asubiotto Apr 30, 2019

jordanlewis mentioned this issue Jul 9, 2019

storage: return byte batches in ScanResponse #26553

Merged

jordanlewis mentioned this issue Jul 23, 2019

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

Closed

jordanlewis assigned solongordon and unassigned jordanlewis Sep 3, 2019

solongordon closed this as completed Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: olap query hitting context canceled/can't complete #36842

sql: olap query hitting context canceled/can't complete #36842

awoods187 commented Apr 15, 2019

jordanlewis commented Apr 15, 2019

justinj commented Apr 15, 2019

awoods187 commented Apr 15, 2019

jordanlewis commented Apr 23, 2019

asubiotto commented Apr 23, 2019

awoods187 commented Apr 23, 2019 •

edited

Loading

awoods187 commented Apr 23, 2019 •

edited

Loading

awoods187 commented Apr 23, 2019 •

edited

Loading

awoods187 commented Apr 24, 2019

asubiotto commented Apr 24, 2019 •

edited

Loading

asubiotto commented Apr 24, 2019

asubiotto commented Apr 24, 2019

awoods187 commented Apr 30, 2019

jordanlewis commented May 20, 2019

jordanlewis commented Jul 10, 2019

awoods187 commented Jul 22, 2019

jordanlewis commented Jul 22, 2019

awoods187 commented Jul 22, 2019 •

edited

Loading

solongordon commented Sep 11, 2019

sql: olap query hitting context canceled/can't complete #36842

sql: olap query hitting context canceled/can't complete #36842

Comments

awoods187 commented Apr 15, 2019

jordanlewis commented Apr 15, 2019

justinj commented Apr 15, 2019

awoods187 commented Apr 15, 2019

jordanlewis commented Apr 23, 2019

asubiotto commented Apr 23, 2019

awoods187 commented Apr 23, 2019 • edited Loading

awoods187 commented Apr 23, 2019 • edited Loading

awoods187 commented Apr 23, 2019 • edited Loading

awoods187 commented Apr 24, 2019

asubiotto commented Apr 24, 2019 • edited Loading

asubiotto commented Apr 24, 2019

asubiotto commented Apr 24, 2019

awoods187 commented Apr 30, 2019

jordanlewis commented May 20, 2019

jordanlewis commented Jul 10, 2019

awoods187 commented Jul 22, 2019

jordanlewis commented Jul 22, 2019

awoods187 commented Jul 22, 2019 • edited Loading

solongordon commented Sep 11, 2019

awoods187 commented Apr 23, 2019 •

edited

Loading

awoods187 commented Apr 23, 2019 •

edited

Loading

awoods187 commented Apr 23, 2019 •

edited

Loading

asubiotto commented Apr 24, 2019 •

edited

Loading

awoods187 commented Jul 22, 2019 •

edited

Loading