Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: infer lookup join equality conditions from unique key and foreign key #69617

Closed
rytaft opened this issue Aug 30, 2021 · 9 comments · Fixed by #90599
Closed

opt: infer lookup join equality conditions from unique key and foreign key #69617

rytaft opened this issue Aug 30, 2021 · 9 comments · Fixed by #90599
Assignees
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@rytaft
Copy link
Collaborator

rytaft commented Aug 30, 2021

Consider the following example, using ./cockroach demo --global --empty --nodes=9:

CREATE DATABASE db PRIMARY REGION "us-east1" REGIONS "us-west1", "europe-west1";
USE db;

CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE INDEX (account_id, crdb_region) -- this shouldn't be needed, but is. See https://github.com/cockroachdb/cockroach/issues/64619.
) LOCALITY GLOBAL;

CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    PRIMARY KEY (crdb_region, account_id, tweet_id),
    FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW;

INSERT INTO accounts VALUES (DEFAULT, 'starburst', 'us-east1') RETURNING account_id;

INSERT INTO tweets VALUES (
    '6f781502-4936-43cc-b384-04e5cf292cc8',
    DEFAULT,
    'this is tweeet 1',
    (SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;

INSERT INTO tweets VALUES (
    '6f781502-4936-43cc-b384-04e5cf292cc8',
    DEFAULT,
    'this is tweeet 2',
    (SELECT crdb_region FROM accounts WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8')
) RETURNING tweet_id;

Ideally, the following query would be able to use a lookup join and visit only a single region, but it currently uses a merge join:

EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                                                                                                                                         info
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • merge join
  │ estimated row count: 2
  │ equality: (account_id) = (account_id)
  │ left cols are key
  │
  ├── • scan
  │     estimated row count: 1 (100% of the table; stats collected 1 minute ago)
  │     table: accounts@accounts_account_id_crdb_region_key
  │     spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
  │
  └── • scan
        estimated row count: 2 (100% of the table; stats collected 1 minute ago)
        table: tweets@primary
        spans: [/'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'europe-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-east1'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us-west1'/'6f781502-4936-43cc-b384-04e5cf292cc8']
(17 rows)

We can force a lookup join, but it still requires visiting all regions:

EXPLAIN SELECT tweet_id, message
FROM accounts
INNER LOOKUP JOIN tweets USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                             info
------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ estimated row count: 2
  │ table: tweets@primary
  │ lookup condition: (account_id = account_id) AND (crdb_region IN ('europe-west1', 'us-east1', 'us-west1'))
  │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
        table: accounts@accounts_account_id_crdb_region_key
        spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)

In order for the optimizer to plan a lookup join that visits a single region, it would need to infer that the join condition USING (account_id) is actually equivalent to USING (account_id, crdb_region). This inference should be possible using the fact that account_id is the primary key of accounts as well as the fact that there is a foreign key in tweets that references accounts (account_id, crdb_region). Including crdb_region would allow the optimizer to change the lookup condition from (crdb_region IN ('europe-west1', 'us-east1', 'us-west1')) to another equality condition (crdb_region = crdb_region), and thus visit the single region that contains 'ab887bbc-1a83-4324-8998-d18ebe448fa7'. The resulting plan should look like the following:

EXPLAIN SELECT tweet_id, message
FROM accounts
JOIN tweets USING (crdb_region, account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
                                                             info
------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ estimated row count: 2
  │ table: tweets@primary
  │ equality: (crdb_region, account_id) = (crdb_region,account_id)
  │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
  │
  └── • scan
        estimated row count: 1 (100% of the table; stats collected 3 minutes ago)
        table: accounts@accounts_account_id_crdb_region_key
        spans: [/'6f781502-4936-43cc-b384-04e5cf292cc8'/'europe-west1' - /'6f781502-4936-43cc-b384-04e5cf292cc8'/'us-west1']
(13 rows)

cc @nvanbenschoten

Epic CRDB-26292

Jira issue: CRDB-9686

@rytaft rytaft added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-optimizer SQL logical planning and optimizations. labels Aug 30, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 30, 2021
@dain
Copy link

dain commented Aug 31, 2021

Note the explicit UNIQUE INDEX (account_id, crdb_region) on the account table should not be needed either. account_id is already unique, so account_id plus anything will be unique.

@electrum
Copy link

Thanks for the detailed write up. It perfectly captures how we would like to use regional tables.

One clarification is that the explicit join to accounts should not be necessary, as none of the columns are referenced by query, the account row is guaranteed to exist due to the foreign key, and it’s guaranteed to have exactly one row due to being unique. The join could be removed (either manually or automatically by the optimizer) without changing the query semantics.

@rytaft
Copy link
Collaborator Author

rytaft commented Sep 2, 2021

Thanks @dain and @electrum for the comments.

@dain, looks like @nvanbenschoten updated the issue description so it mentions that UNIQUE INDEX (account_id, crdb_region) should not be needed (referencing issue #64619). One thing you could do as a workaround for now, is try using our experimental feature UNIQUE WITHOUT INDEX. I think this would actually be a perfect use case for it. Basically, it allows you to declare a unique constraint without building an index. Under the hood, we enforce the constraint by adding checks (similar to foreign key checks) to every update to ensure that the update does not violate the constraint. If uniqueness is already guaranteed, however, as it is in this case, we can remove the checks. We do not have documentation for this feature because we have not found a good use case for it, but this seems like it might be perfect. To use it, you need to enable the feature with the session setting experimental_enable_unique_without_index_constraints or cluster setting sql.defaults.experimental_enable_unique_without_index_constraints.enabled. Then you can declare your unique constraint using the syntax UNIQUE WITHOUT INDEX. For example, this should allow you to specify your foreign key without an error:

SET experimental_enable_unique_without_index_constraints = true;

CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE WITHOUT INDEX (account_id, crdb_region)
) LOCALITY GLOBAL;

CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    PRIMARY KEY (crdb_region, account_id, tweet_id),
    FOREIGN KEY (account_id, crdb_region) REFERENCES accounts (account_id, crdb_region) ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW;

Note that we do not plan uniqueness checks in this case since they are not necessary due to the uniqueness of account_id:

EXPLAIN INSERT INTO accounts VALUES ('6f781502-4936-43cc-b384-04e5cf292cc8', 'starburst', 'us-east1') RETURNING account_id;
                       info
---------------------------------------------------
  distribution: local
  vectorized: true

  • insert fast path
    estimated row count: 1
    into: accounts(account_id, name, crdb_region)
    auto commit
    size: 4 columns, 1 row
(8 rows)

If you decide to go down this route, I will be very interested to hear your experience.

@electrum thanks for pointing out the opportunity for join elimination in this case. There is an issue for join elimination when there is a foreign key (#47391), but it is closed. Clearly we are not handling all cases, though, so I will reopen it and try to identify what transformation rules we're missing.

@electrum
Copy link

electrum commented Sep 3, 2021

@rytaft Good to know about UNIQUE WITHOUT INDEX -- it sounds perfect for this use case. Regarding the join, I think this is a different case than the linked issue, since that one was about removing a join that exists in the query. What I meant was that your example query doesn't need to have a join at all (and it would be annoying to have to explicitly add it to the query). The plan should be the same in either case.

@nvanbenschoten
Copy link
Member

@electrum transparently introducing a join on a table that is not otherwise included in a query in order to make a lookup more efficient is a pretty wild idea. In some sense, it's analogous to secondary index selection on a single table, but taken to the next level. Given your prior experience with SQL engines and optimizers, do you know of any prior art in this area? Have you seen cases where systems, even OLAP systems, consider such query plans?

@rytaft
Copy link
Collaborator Author

rytaft commented Sep 9, 2021

Agreed, very interesting idea for an optimization! So just to confirm, @electrum, this is what the original query actually should have looked like:

SELECT tweet_id, message
FROM tweets
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';

and the idea is that the optimizer should add the join with accounts to avoid visiting multiple regions?

@electrum
Copy link

electrum commented Sep 9, 2021

@nvanbenschoten I don’t have any specific references, but I remember that Oracle definitely took advantage of constraints and foreign keys in its optimizer.

@rytaft exactly!

Given the foreign key relationship, it might be helpful to think about this as a primary key lookup via scalar subquery (not correlated):

SELECT tweet_id, message
FROM tweets
WHERE (region, account_id) =
  (SELECT region, account_id
   FROM accounts
   WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8');

It is a single row that is guaranteed to exist in the parent table if any child rows exist, not an arbitrary join that could change the cardinality.

@rytaft
Copy link
Collaborator Author

rytaft commented Sep 9, 2021

Interesting! Thanks for the additional info.

@msirek
Copy link
Contributor

msirek commented Aug 9, 2022

Note: maybe open a 2nd issue to add a new transformation rule to add a join between a global and regional by row table.

msirek pushed a commit to msirek/cockroach that referenced this issue Oct 25, 2022
Fixes cockroachdb#69617

When a unique constraint exists on a subset of the referenced columns in
a foreign key constraint, the remaining columns in the constraint can be
used to generate equijoin predicates which may enable more efficient use
of an index on the lookup side of a lookup join. If the index is a
multiregion index, a join predicate may be derived which could
potentially eliminate reads of remote rows.

Example:
```
CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE INDEX acct_id_crdb_region_idx (account_id, crdb_region)
) LOCALITY GLOBAL;

drop table if exists tweets;
CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    PRIMARY KEY (crdb_region, account_id, tweet_id),
    -- The PK of accounts is a subset of the referenced columns in
    -- the FK constraint.
    FOREIGN KEY (account_id, crdb_region)
    REFERENCES accounts (account_id, crdb_region)
    ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW as crdb_region;

-- Join on account_id uses the uniqueness of accounts_pkey and the FK
-- constraint to derive tweets.crdb_region = accounts.crdb_region
EXPLAIN SELECT *
    FROM tweets
    INNER LOOKUP JOIN accounts@acct_id_crdb_region_idx USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
-------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ table: accounts@accounts_pkey
  │ equality: (account_id) = (account_id)
  │ equality cols are key
  │
  └── • lookup join
      │ table: accounts@acct_id_crdb_region_idx
      │ equality: (account_id, crdb_region) = (account_id,crdb_region)
      │ equality cols are key
      │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
      │
      └── • scan
            missing stats
            table: tweets@tweets_pkey
            spans: [/'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us'/'6f781502-4936-43cc-b384-04e5cf292cc8']
```

Release note (performance improvement): This patch enables more
efficient lookup joins by deriving new join constraints when
equijoin predicates exist on the column(s) of a unique constraint on one
table which are a proper subset of the referencing columns of a foreign
key constraint on the other table. If an index exists on those FK
constraint referencing columns, equijoin predicates are derived between
the PK and FK columns not currently bound by ON clause predicates.
msirek pushed a commit to msirek/cockroach that referenced this issue Oct 25, 2022
Fixes cockroachdb#69617

This commit amends the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to include potential derived ON clause predicates so
that columns not present in the SELECT list are not pruned away from the
involved Scans before predicates are derived. Derived ON clause
predicate columns are excluded from the set of columns to use for
equijoin selectivity estimation.

Release note: None
msirek pushed a commit to msirek/cockroach that referenced this issue Nov 5, 2022
Fixes cockroachdb#69617

This commit amends the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to include potential derived ON clause predicates so
that columns not present in the SELECT list are not pruned away from the
involved Scans before predicates are derived. Derived ON clause
predicate columns are excluded from the set of columns to use for
equijoin selectivity estimation.

Release note: None
msirek pushed a commit to msirek/cockroach that referenced this issue Nov 8, 2022
Fixes cockroachdb#69617

This commit amends the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to include potential derived ON clause predicates so
that columns not present in the SELECT list are not pruned away from the
involved Scans before predicates are derived. Derived ON clause
predicate columns are excluded from the set of columns to use for
equijoin selectivity estimation.

Release note: None
msirek pushed a commit to msirek/cockroach that referenced this issue Nov 8, 2022
Fixes cockroachdb#69617

This commit amends the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to include potential derived ON clause predicates so
that columns not present in the SELECT list are not pruned away from the
involved Scans before predicates are derived. Derived ON clause
predicate columns are excluded from the set of columns to use for
equijoin selectivity estimation.

Release note: None
craig bot pushed a commit that referenced this issue Nov 8, 2022
90599: xform: derive implicit predicates from FK constraint for lookup join r=rytaft a=msirek

Fixes #69617

When a unique constraint exists on a subset of the referenced columns in
a foreign key constraint, the remaining columns in the constraint can be
used to generate equijoin predicates which may enable more efficient use
of an index on the lookup side of a lookup join. If the index is a
multiregion index, a join predicate may be derived which could
potentially eliminate reads of remote rows.

Example:
```
CREATE TABLE accounts (
    account_id  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    UNIQUE INDEX acct_id_crdb_region_idx (account_id, crdb_region)
) LOCALITY GLOBAL;

drop table if exists tweets;
CREATE TABLE tweets (
    account_id UUID NOT NULL,
    tweet_id   UUID DEFAULT gen_random_uuid(),
    message    STRING NOT NULL,
    crdb_region crdb_internal_region NOT NULL,
    PRIMARY KEY (crdb_region, account_id, tweet_id),
    -- The PK of accounts is a subset of the referenced columns in
    -- the FK constraint.
    FOREIGN KEY (account_id, crdb_region)
    REFERENCES accounts (account_id, crdb_region)
    ON DELETE CASCADE ON UPDATE CASCADE
) LOCALITY REGIONAL BY ROW as crdb_region;

-- Join on account_id uses the uniqueness of accounts_pkey and the FK
-- constraint to derive tweets.crdb_region = accounts.crdb_region
EXPLAIN SELECT *
    FROM tweets
    INNER LOOKUP JOIN accounts@acct_id_crdb_region_idx USING (account_id)
WHERE account_id = '6f781502-4936-43cc-b384-04e5cf292cc8';
-------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • lookup join
  │ table: accounts@accounts_pkey
  │ equality: (account_id) = (account_id)
  │ equality cols are key
  │
  └── • lookup join
      │ table: accounts@acct_id_crdb_region_idx
      │ equality: (account_id, crdb_region) = (account_id,crdb_region)
      │ equality cols are key
      │ pred: account_id = '6f781502-4936-43cc-b384-04e5cf292cc8'
      │
      └── • scan
            missing stats
            table: tweets@tweets_pkey
            spans: [/'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'ca'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'eu'/'6f781502-4936-43cc-b384-04e5cf292cc8'] [/'us'/'6f781502-4936-43cc-b384-04e5cf292cc8' - /'us'/'6f781502-4936-43cc-b384-04e5cf292cc8']
```

Release note (performance improvement): This patch enables more
efficient lookup joins by deriving new join constraints when
equijoin predicates exist on the column(s) of a unique constraint on one
table which are a proper subset of the referencing columns of a foreign
key constraint on the other table. If an index exists on those FK
constraint referencing columns, equijoin predicates are derived between
the PK and FK columns not currently bound by ON clause predicates.

norm: do not prune scan columns which may show up in derived join terms

Fixes #69617

This commit amends the PruneJoinLeftCols and PruneJoinRightCols
normalization rules to include potential derived ON clause predicates so
that columns not present in the SELECT list are not pruned away from the
involved Scans before predicates are derived. Derived ON clause
predicate columns are excluded from the set of columns to use for
equijoin selectivity estimation.

Release note: None

Co-authored-by: Mark Sirek <[email protected]>
@craig craig bot closed this as completed in aaa6876 Nov 9, 2022
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants