Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: implement index join #65905

Closed
jordanlewis opened this issue May 31, 2021 · 2 comments · Fixed by #67450
Closed

colexec: implement index join #65905

jordanlewis opened this issue May 31, 2021 · 2 comments · Fixed by #67450
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team

Comments

@jordanlewis
Copy link
Member

See #65904 for more background.

Index join is the simplest of the joinReader algorithms. We should implement it in the vectorized engine. Besides the reasons that we already know for why the vectorized model is better (batch processing, no Datum allocations, colocated memory, no expensive materializer/columnarizer pairs), span generation is one of the most expensive parts of joinReader, and it can be sped significantly up with vectorized.

Span generation requires one type switch and interface lookup per row per key column, and requires at minimum two array allocations (one start key, one end key, and any resizing) per row. Switching to vectorized would remove the type switches altogether, since we can statically plan the encoders and run them sequentially one at a time per batch. I think the allocation behavior could conceivably be improved with flat bytes, but I'm not sure.

For example, a primary key (int, date) that gets index joined would be planned by 2 span generation ops (encodeSpanIntOpt, encodeSpanDateOp) followed by a modified colBatchScan whose spans are selected from a projected column.

Perhaps the optimizer should be involved here? Should we plan index/lookup joins as a projection of the spans to lookup first, followed by some lookup op?

@jordanlewis jordanlewis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 31, 2021
@jordanlewis jordanlewis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels May 31, 2021
@rytaft
Copy link
Collaborator

rytaft commented Jun 1, 2021

cc @awoods187 @kevin-v-ngo

@awoods187
Copy link
Contributor

This seems pretty valuable and we should add it in to our work opportunistically in this release if we have the time.

@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 10, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 10, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 12, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 12, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 14, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 15, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 15, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 15, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 15, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 15, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 16, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Jul 17, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Aug 9, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Aug 10, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Aug 10, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Aug 10, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Aug 11, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `spanAssembler` operators queue up spans until the memory allocated
for the span keys reaches a preset limit (default 4MB). This allows the
cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
DrewKimball added a commit to DrewKimball/cockroach that referenced this issue Aug 11, 2021
This patch provides a vectorized implementation of the index join
operator. Span generation is accomplished using two utility operators.
`spanEncoder` operates on a single index key column and fills a `Bytes`
column with the encoding of that column for each row. `spanAssembler`
takes the output of each `spanEncoder` and generates spans, accounting
for table/index prefixes and possibly splitting the spans over column
families. Finally, the `ColIndexJoin` operator uses the generated spans
to perform a lookup on the table's primary index, returns all batches
resulting from the lookup, and repeats until the input is fully consumed.

The `ColIndexJoin` operator queues up input rows until the memory footprint
of the rows reaches a preset limit (default 4MB for parity with the row
engine). This allows the cost of starting a scan to be amortized.

Addresses cockroachdb#65905

Release note (sql change): The vectorized execution engine can now
perform a scan over an index, and then join on the primary index to
retrieve the required columns.
@craig craig bot closed this as completed in 6b2e2c4 Aug 11, 2021
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants