Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading deletion vectors in Delta Lake #17477

Merged
merged 1 commit into from
Sep 12, 2023

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented May 12, 2023

Description

Fixes #16903

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Support reading tables with deletion vectors. ({issue}`16903`)

@cla-bot cla-bot bot added the cla-signed label May 12, 2023
@github-actions github-actions bot added delta-lake Delta Lake connector tests:hive labels May 12, 2023
@ebyhr ebyhr self-assigned this May 14, 2023
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch 3 times, most recently from 57ed69a to 104ab4c Compare May 16, 2023 04:23
@ebyhr ebyhr marked this pull request as ready for review May 16, 2023 12:29
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch 2 times, most recently from 7cb1ee1 to f65a908 Compare May 16, 2023 22:16
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from f65a908 to b87932b Compare June 6, 2023 04:46
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from b87932b to 7cd6d32 Compare June 9, 2023 04:31
@ebyhr ebyhr requested a review from findepi June 9, 2023 04:32
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from 7cd6d32 to 9de92c9 Compare June 13, 2023 08:04
Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes relating to row_id are a little confusing to me here. Row id is used for write operations, specifically merge, but we're looking at read-only support here so I wouldn't expect the two to interact.

int actualSize = inputStream.readInt();
if (actualSize != expectedSize) {
// TODO: Investigate why these size differ
log.warn("The size of deletion vector %s expects %s but got %s", inputFile.location(), expectedSize, actualSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to resize the array to the real size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, resize doesn't help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we can just ignore this.
it something to investigate. we should rather throw here, than risk correctness (if eg we read from wrong offset, or wrong number of bytes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cause was a misuse of TrinoDataInputStream. Switching to DataInputStream resolved the size difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very interesting. Can you elaborate on that @ebyhr?
do you know what are the situations where TrinoDataInputStream should be used and where it mustn't?

@ebyhr
Copy link
Member Author

ebyhr commented Jul 4, 2023

Going to resolve confilcts.

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from 9de92c9 to 865c973 Compare July 4, 2023 07:46
@findepi
Copy link
Member

findepi commented Jul 5, 2023

@ebyhr please split base85codec and roaringbitmap stuff to own prep PRs
this is the part i focused on first and want to merge it & remove from view

return new UUID(highBits, lowBits);
}

// This method will be used when supporting https://github.com/trinodb/trino/issues/17063
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( #17063 )

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from 865c973 to 0c6cff5 Compare July 12, 2023 10:21
@ebyhr
Copy link
Member Author

ebyhr commented Jul 12, 2023

Addressed comments partially. Let me take another look tomorrow.

@findepi
Copy link
Member

findepi commented Jul 12, 2023

Addressed comments partially. Let me take another look tomorrow.

are you planning on splitting this (per #17477 (comment)), or should i be reviewing this PR?

@@ -158,6 +159,11 @@ public List<DeltaLakeTransactionLogEntry> getJsonTransactionLogEntries()
return logTail.getFileEntries();
}

public Map<Long, List<DeltaLakeTransactionLogEntry>> getJsonTransactionLogVersionAndEntries()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we rely on map ordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated activeAddEntries to use sorted(comparingByKey()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we don't need to sort this, since it comes sorted. maybe we just keep it as a list of something
List<Transaction>
where "Transaction" has long transactionId and List<DeltaLakeTransactionLogEntry>?

}
for (Map.Entry<Long, List<DeltaLakeTransactionLogEntry>> deltaLakeTransactionLogEntries : jsonEntries.entrySet()) {
// Deletion vector registers both 'add' & 'remove' entries in any order. The 'add' entry should be kept.
Set<String> dependOnDeletionVector = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we process removals before additions, can we remove dependOnDeletionVector set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to process removals first.

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch 2 times, most recently from 41b37bb to a069fed Compare August 10, 2023 12:26
@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from a069fed to 3291116 Compare August 18, 2023 06:18
@ebyhr
Copy link
Member Author

ebyhr commented Aug 18, 2023

Rebased on master to resolve conflicts.

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from 3291116 to 2f200a2 Compare August 25, 2023 06:42
@ebyhr
Copy link
Member Author

ebyhr commented Aug 25, 2023

Rebased on master to resolve conflicts.

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from 2f200a2 to a527e66 Compare August 25, 2023 06:53
@findinpath
Copy link
Contributor

suite-delta-lake-databricks122 timed out -> #18805

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch from a527e66 to 3e79fa1 Compare August 31, 2023 06:16
@ebyhr
Copy link
Member Author

ebyhr commented Aug 31, 2023

Rebased on master to resolve conflicts.

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch 3 times, most recently from fb7d8e5 to 6f309df Compare September 1, 2023 03:09
@findinpath
Copy link
Contributor

I was curious whether the changes for dealing with on the Databricks product tests timeouts are effective and stumbled over this failure:

tests               | 2023-09-02 17:21:00 INFO: FAILURE     /    io.trino.tests.product.deltalake.TestDeltaLakeDeleteCompatibility.testDeletionVectors (Groups: profile_specific_tests, delta-lake-exclude-91, delta-lake-databricks, delta-lake-exclude-104, delta-lake-exclude-113, delta-lake-oss) took 18.2 seconds
tests               | 2023-09-02 17:21:00 SEVERE: Failure cause:
tests               | java.lang.AssertionError: Expected row count to be <3>, but was <4>; rows=[[0, CREATE TABLE], [1, WRITE], [2, DELETE], [3, OPTIMIZE]]

https://github.com/trinodb/trino/actions/runs/6057878380/job/16439570813

Apparently the OPTIMIZE is done in the background by Databricks.

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch 2 times, most recently from 6295280 to f0f59bf Compare September 5, 2023 23:21
@ebyhr
Copy link
Member Author

ebyhr commented Sep 5, 2023

@findepi @alexjo2144 Could you take another look when you have time?

@ebyhr ebyhr force-pushed the ebi/delta-deletion-vectors branch 2 times, most recently from 08567c1 to 9637cd7 Compare September 6, 2023 09:16
@findepi
Copy link
Member

findepi commented Sep 12, 2023

cc @radek-starburst

@ebyhr
Copy link
Member Author

ebyhr commented Sep 12, 2023

(Just squashed commits into one)

@ebyhr ebyhr merged commit eb26565 into master Sep 12, 2023
3 of 12 checks passed
@ebyhr ebyhr deleted the ebi/delta-deletion-vectors branch September 12, 2023 13:56
@github-actions github-actions bot added this to the 427 milestone Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

Support reading deletion vectors in Delta Lake tables
6 participants