opt: infer lookup join equality conditions from unique key and foreign key #69617

rytaft · 2021-08-30T21:33:29Z

Consider the following example, using ./cockroach demo --global --empty --nodes=9:

CREATE DATABASE db PRIMARY REGION "us-east1" REGIONS "us-west1", "europe-west1";
USE db;

CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE INDEX (account_id, crdb_region) -- this shouldn't be needed, but is. See https://github.com/cockroachdb/cockroach/issues/64619.
) LOCALITY GLOBAL;

CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    PRIMARY KEY (crdb_region, account_id, tweet_id),
    FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW;

INSERT INTO accounts VALUES (DEFAULT, 'starburst', 'us-east1') RETURNING account_id;

INSERT INTO tweets VALUES (
    '6f781502-4936-43cc-b384-04e5cf292cc8',
    DEFAULT,
    'this is tweeet 1',
    (SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;

INSERT INTO tweets VALUES (
    '6f781502-4936-43cc-b384-04e5cf292cc8',
    DEFAULT,
    'this is tweeet 2',
    (SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;

Ideally, the following query would be able to use a lookup join and visit only a single region, but it currently uses a merge join:

EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                                                                                                                                         info
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • merge join
  │ estimated row count: 2
  │ equality: (account_id) = (account_id)
  │ left cols are key
  │
  ├── • scan
  │     estimated row count: 1 (100% of the table; stats collected 1 minute ago)
  │     table: accounts@accounts_account_id_crdb_region_key
  │     spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
  │
  └── • scan
        estimated row count: 2 (100% of the table; stats collected 1 minute ago)
        table: tweets@primary
        spans: [/'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8']
(17 rows)

We can force a lookup join, but it still requires visiting all regions:

EXPLAIN SELECT tweet_id, message
FROM accounts
INNER LOOKUP JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                             info
------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ estimated row count: 2
  │ table: tweets@primary
  │ lookup condition: (account_id = account_id) AND (crdb_region IN ('europe-west1', 'us-east1', 'us-west1'))
  │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
        table: accounts@accounts_account_id_crdb_region_key
        spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)

In order for the optimizer to plan a lookup join that visits a single region, it would need to infer that the join condition USING (account_id) is actually equivalent to USING (account_id, crdb_region). This inference should be possible using the fact that account_id is the primary key of accounts as well as the fact that there is a foreign key in tweets that references accounts (account_id, crdb_region). Including crdb_region would allow the optimizer to change the lookup condition from (crdb_region IN ('europe-west1', 'us-east1', 'us-west1')) to another equality condition (crdb_region = crdb_region), and thus visit the single region that contains 'ab887bbc-1a83-4324-8998-d18ebe448fa7'. The resulting plan should look like the following:

EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (crdb_region, account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                             info
------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ estimated row count: 2
  │ table: tweets@primary
  │ equality: (crdb_region, account_id) = (crdb_region,account_id)
  │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 3 minutes ago)
        table: accounts@accounts_account_id_crdb_region_key
        spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)

cc @nvanbenschoten

Epic CRDB-26292

Jira issue: CRDB-9686

The text was updated successfully, but these errors were encountered:

dain · 2021-08-31T15:46:41Z

Note the explicit UNIQUE INDEX (account_id, crdb_region) on the account table should not be needed either. account_id is already unique, so account_id plus anything will be unique.

electrum · 2021-08-31T16:22:43Z

Thanks for the detailed write up. It perfectly captures how we would like to use regional tables.

One clarification is that the explicit join to accounts should not be necessary, as none of the columns are referenced by query, the account row is guaranteed to exist due to the foreign key, and it’s guaranteed to have exactly one row due to being unique. The join could be removed (either manually or automatically by the optimizer) without changing the query semantics.

rytaft · 2021-09-02T16:09:32Z

Thanks @dain and @electrum for the comments.

@dain, looks like @nvanbenschoten updated the issue description so it mentions that UNIQUE INDEX (account_id, crdb_region) should not be needed (referencing issue #64619). One thing you could do as a workaround for now, is try using our experimental feature UNIQUE WITHOUT INDEX. I think this would actually be a perfect use case for it. Basically, it allows you to declare a unique constraint without building an index. Under the hood, we enforce the constraint by adding checks (similar to foreign key checks) to every update to ensure that the update does not violate the constraint. If uniqueness is already guaranteed, however, as it is in this case, we can remove the checks. We do not have documentation for this feature because we have not found a good use case for it, but this seems like it might be perfect. To use it, you need to enable the feature with the session setting experimental_enable_unique_without_index_constraints or cluster setting sql.defaults.experimental_enable_unique_without_index_constraints.enabled. Then you can declare your unique constraint using the syntax UNIQUE WITHOUT INDEX. For example, this should allow you to specify your foreign key without an error:

SET experimental_enable_unique_without_index_constraints = true;

CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE WITHOUT INDEX (account_id, crdb_region)
) LOCALITY GLOBAL;

CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    PRIMARY KEY (crdb_region, account_id, tweet_id),
    FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW;

Note that we do not plan uniqueness checks in this case since they are not necessary due to the uniqueness of account_id:

EXPLAIN INSERT INTO accounts VALUES ('6f781502-4936-43cc-b384-04e5cf292cc8', 'starburst', 'us-east1') RETURNING account_id;
                       info
---------------------------------------------------
  distribution: local
  vectorized: true

  • insert fast path
    estimated row count: 1
    into: accounts(account_id, name, crdb_region)
    auto commit
    size: 4 columns, 1 row
(8 rows)

If you decide to go down this route, I will be very interested to hear your experience.

@electrum thanks for pointing out the opportunity for join elimination in this case. There is an issue for join elimination when there is a foreign key (#47391), but it is closed. Clearly we are not handling all cases, though, so I will reopen it and try to identify what transformation rules we're missing.

electrum · 2021-09-03T00:50:10Z

@rytaft Good to know about UNIQUE WITHOUT INDEX -- it sounds perfect for this use case. Regarding the join, I think this is a different case than the linked issue, since that one was about removing a join that exists in the query. What I meant was that your example query doesn't need to have a join at all (and it would be annoying to have to explicitly add it to the query). The plan should be the same in either case.

nvanbenschoten · 2021-09-03T06:26:16Z

@electrum transparently introducing a join on a table that is not otherwise included in a query in order to make a lookup more efficient is a pretty wild idea. In some sense, it's analogous to secondary index selection on a single table, but taken to the next level. Given your prior experience with SQL engines and optimizers, do you know of any prior art in this area? Have you seen cases where systems, even OLAP systems, consider such query plans?

rytaft · 2021-09-09T15:05:28Z

Agreed, very interesting idea for an optimization! So just to confirm, @electrum, this is what the original query actually should have looked like:

SELECT tweet_id, message
FROM tweets
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';

and the idea is that the optimizer should add the join with accounts to avoid visiting multiple regions?

electrum · 2021-09-09T15:39:39Z

@nvanbenschoten I don’t have any specific references, but I remember that Oracle definitely took advantage of constraints and foreign keys in its optimizer.

@rytaft exactly!

Given the foreign key relationship, it might be helpful to think about this as a primary key lookup via scalar subquery (not correlated):

SELECT tweet_id, message
FROM tweets
WHERE (region, account_id) =
  (SELECT region, account_id
   FROM accounts
   WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8');

It is a single row that is guaranteed to exist in the parent table if any child rows exist, not an arbitrary join that could change the cardinality.

rytaft · 2021-09-09T16:31:22Z

Interesting! Thanks for the additional info.

msirek · 2022-08-09T20:23:08Z

Note: maybe open a 2nd issue to add a new transformation rule to add a join between a global and regional by row table.

Fixes cockroachdb#69617 When a unique constraint exists on a subset of the referenced columns in a foreign key constraint, the remaining columns in the constraint can be used to generate equijoin predicates which may enable more efficient use of an index on the lookup side of a lookup join. If the index is a multiregion index, a join predicate may be derived which could potentially eliminate reads of remote rows. Example: ``` CREATE TABLE accounts ( account_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name STRING NOT NULL, crdb_region crdb_internal_region NOT NULL, UNIQUE INDEX acct_id_crdb_region_idx (account_id, crdb_region) ) LOCALITY GLOBAL; drop table if exists tweets; CREATE TABLE tweets ( account_id UUID NOT NULL, tweet_id UUID DEFAULT gen_random_uuid(), message STRING NOT NULL, crdb_region crdb_internal_region NOT NULL, PRIMARY KEY (crdb_region, account_id, tweet_id), -- The PK of accounts is a subset of the referenced columns in -- the FK constraint. FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE ) LOCALITY REGIONAL BY ROW as crdb_region; -- Join on account_id uses the uniqueness of accounts_pkey and the FK -- constraint to derive tweets.crdb_region = accounts.crdb_region EXPLAIN SELECT * FROM tweets INNER LOOKUP JOIN accounts@acct_id_crdb_region_idx USING (account_id) WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'; ------------------------------------------------------------------------- distribution: local vectorized: true • lookup join │ table: accounts@accounts_pkey │ equality: (account_id) = (account_id) │ equality cols are key │ └── • lookup join │ table: accounts@acct_id_crdb_region_idx │ equality: (account_id, crdb_region) = (account_id,crdb_region) │ equality cols are key │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8' │ └── • scan missing stats table: tweets@tweets_pkey spans: [/'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us'/'6f781502-4936-43cc-b384-04e5cf292cc8'] ``` Release note (performance improvement): This patch enables more efficient lookup joins by deriving new join constraints when equijoin predicates exist on the column(s) of a unique constraint on one table which are a proper subset of the referencing columns of a foreign key constraint on the other table. If an index exists on those FK constraint referencing columns, equijoin predicates are derived between the PK and FK columns not currently bound by ON clause predicates.

Fixes cockroachdb#69617 This commit amends the PruneJoinLeftCols and PruneJoinRightCols normalization rules to include potential derived ON clause predicates so that columns not present in the SELECT list are not pruned away from the involved Scans before predicates are derived. Derived ON clause predicate columns are excluded from the set of columns to use for equijoin selectivity estimation. Release note: None

90599: xform: derive implicit predicates from FK constraint for lookup join r=rytaft a=msirek Fixes #69617 When a unique constraint exists on a subset of the referenced columns in a foreign key constraint, the remaining columns in the constraint can be used to generate equijoin predicates which may enable more efficient use of an index on the lookup side of a lookup join. If the index is a multiregion index, a join predicate may be derived which could potentially eliminate reads of remote rows. Example: ``` CREATE TABLE accounts ( account_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name STRING NOT NULL, crdb_region crdb_internal_region NOT NULL, UNIQUE INDEX acct_id_crdb_region_idx (account_id, crdb_region) ) LOCALITY GLOBAL; drop table if exists tweets; CREATE TABLE tweets ( account_id UUID NOT NULL, tweet_id UUID DEFAULT gen_random_uuid(), message STRING NOT NULL, crdb_region crdb_internal_region NOT NULL, PRIMARY KEY (crdb_region, account_id, tweet_id), -- The PK of accounts is a subset of the referenced columns in -- the FK constraint. FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE ) LOCALITY REGIONAL BY ROW as crdb_region; -- Join on account_id uses the uniqueness of accounts_pkey and the FK -- constraint to derive tweets.crdb_region = accounts.crdb_region EXPLAIN SELECT * FROM tweets INNER LOOKUP JOIN accounts@acct_id_crdb_region_idx USING (account_id) WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'; ------------------------------------------------------------------------- distribution: local vectorized: true • lookup join │ table: accounts@accounts_pkey │ equality: (account_id) = (account_id) │ equality cols are key │ └── • lookup join │ table: accounts@acct_id_crdb_region_idx │ equality: (account_id, crdb_region) = (account_id,crdb_region) │ equality cols are key │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8' │ └── • scan missing stats table: tweets@tweets_pkey spans: [/'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us'/'6f781502-4936-43cc-b384-04e5cf292cc8'] ``` Release note (performance improvement): This patch enables more efficient lookup joins by deriving new join constraints when equijoin predicates exist on the column(s) of a unique constraint on one table which are a proper subset of the referencing columns of a foreign key constraint on the other table. If an index exists on those FK constraint referencing columns, equijoin predicates are derived between the PK and FK columns not currently bound by ON clause predicates. norm: do not prune scan columns which may show up in derived join terms Fixes #69617 This commit amends the PruneJoinLeftCols and PruneJoinRightCols normalization rules to include potential derived ON clause predicates so that columns not present in the SELECT list are not pruned away from the involved Scans before predicates are derived. Derived ON clause predicate columns are excluded from the set of columns to use for equijoin selectivity estimation. Release note: None Co-authored-by: Mark Sirek <[email protected]>

rytaft added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-optimizer SQL logical planning and optimizations. labels Aug 30, 2021

blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 30, 2021

rytaft mentioned this issue Sep 2, 2021

FK reference requires redundant constraint if subset is unique #64619

Closed

nvanbenschoten mentioned this issue Dec 23, 2021

opt: permit foreign key constraint on full primary index of PARTITION ALL BY tables #74244

Closed

rytaft mentioned this issue Jul 6, 2022

opt: support option to error instead of visiting remote regions when querying multi-region database #83819

Closed

rytaft assigned msirek Jul 6, 2022

rytaft mentioned this issue Jul 20, 2022

opt: make the optimizer "distribution aware" #47226

Open

8 tasks

msirek mentioned this issue Oct 25, 2022

xform: derive implicit predicates from FK constraint for lookup join #90599

Merged

msirek mentioned this issue Nov 1, 2022

opt: add implicit join with global table using FK constraint to avoid scanning remote rows #91084

Open

craig bot closed this as completed in aaa6876 Nov 9, 2022

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: infer lookup join equality conditions from unique key and foreign key #69617

opt: infer lookup join equality conditions from unique key and foreign key #69617

rytaft commented Aug 30, 2021 •

edited

Loading

dain commented Aug 31, 2021 •

edited

Loading

electrum commented Aug 31, 2021

rytaft commented Sep 2, 2021

electrum commented Sep 3, 2021

nvanbenschoten commented Sep 3, 2021

rytaft commented Sep 9, 2021

electrum commented Sep 9, 2021

rytaft commented Sep 9, 2021

msirek commented Aug 9, 2022

opt: infer lookup join equality conditions from unique key and foreign key #69617

opt: infer lookup join equality conditions from unique key and foreign key #69617

Comments

rytaft commented Aug 30, 2021 • edited Loading

dain commented Aug 31, 2021 • edited Loading

electrum commented Aug 31, 2021

rytaft commented Sep 2, 2021

electrum commented Sep 3, 2021

nvanbenschoten commented Sep 3, 2021

rytaft commented Sep 9, 2021

electrum commented Sep 9, 2021

rytaft commented Sep 9, 2021

msirek commented Aug 9, 2022

rytaft commented Aug 30, 2021 •

edited

Loading

dain commented Aug 31, 2021 •

edited

Loading