-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading deletion vectors in Delta Lake #17477
Conversation
57ed69a
to
104ab4c
Compare
...o-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeletionVectorEntry.java
Outdated
Show resolved
Hide resolved
7cb1ee1
to
f65a908
Compare
...duct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeDatabricksDelete.java
Outdated
Show resolved
Hide resolved
f65a908
to
b87932b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/util/DeletionVectors.java
Outdated
Show resolved
Hide resolved
...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/AddFileEntry.java
Show resolved
Hide resolved
b87932b
to
7cd6d32
Compare
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/delete/DeletionVectors.java
Outdated
Show resolved
Hide resolved
...duct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeDatabricksDelete.java
Outdated
Show resolved
Hide resolved
7cd6d32
to
9de92c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes relating to row_id
are a little confusing to me here. Row id is used for write operations, specifically merge, but we're looking at read-only support here so I wouldn't expect the two to interact.
int actualSize = inputStream.readInt(); | ||
if (actualSize != expectedSize) { | ||
// TODO: Investigate why these size differ | ||
log.warn("The size of deletion vector %s expects %s but got %s", inputFile.location(), expectedSize, actualSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to resize the array to the real size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, resize doesn't help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we can just ignore this.
it something to investigate. we should rather throw here, than risk correctness (if eg we read from wrong offset, or wrong number of bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cause was a misuse of TrinoDataInputStream
. Switching to DataInputStream
resolved the size difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very interesting. Can you elaborate on that @ebyhr?
do you know what are the situations where TrinoDataInputStream
should be used and where it mustn't?
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/delete/DeletionVectors.java
Outdated
Show resolved
Hide resolved
Going to resolve confilcts. |
9de92c9
to
865c973
Compare
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
...c/main/java/io/trino/plugin/deltalake/transactionlog/checkpoint/CheckpointEntryIterator.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplit.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/delete/DeletionVectors.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/delete/DeletionVectors.java
Outdated
Show resolved
Hide resolved
...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java
Outdated
Show resolved
Hide resolved
...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java
Outdated
Show resolved
Hide resolved
...in/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePageSourceProvider.java
Outdated
Show resolved
Hide resolved
plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/delete/DeletionVectors.java
Outdated
Show resolved
Hide resolved
@ebyhr please split base85codec and roaringbitmap stuff to own prep PRs |
return new UUID(highBits, lowBits); | ||
} | ||
|
||
// This method will be used when supporting https://github.com/trinodb/trino/issues/17063 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( #17063 )
865c973
to
0c6cff5
Compare
Addressed comments partially. Let me take another look tomorrow. |
are you planning on splitting this (per #17477 (comment)), or should i be reviewing this PR? |
@@ -158,6 +159,11 @@ public List<DeltaLakeTransactionLogEntry> getJsonTransactionLogEntries() | |||
return logTail.getFileEntries(); | |||
} | |||
|
|||
public Map<Long, List<DeltaLakeTransactionLogEntry>> getJsonTransactionLogVersionAndEntries() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we rely on map ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated activeAddEntries
to use sorted(comparingByKey())
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we don't need to sort this, since it comes sorted. maybe we just keep it as a list of something
List<Transaction>
where "Transaction" has long transactionId
and List<DeltaLakeTransactionLogEntry>
?
} | ||
for (Map.Entry<Long, List<DeltaLakeTransactionLogEntry>> deltaLakeTransactionLogEntries : jsonEntries.entrySet()) { | ||
// Deletion vector registers both 'add' & 'remove' entries in any order. The 'add' entry should be kept. | ||
Set<String> dependOnDeletionVector = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we process removals before additions, can we remove dependOnDeletionVector
set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to process removals first.
41b37bb
to
a069fed
Compare
a069fed
to
3291116
Compare
Rebased on master to resolve conflicts. |
3291116
to
2f200a2
Compare
Rebased on master to resolve conflicts. |
2f200a2
to
a527e66
Compare
|
a527e66
to
3e79fa1
Compare
Rebased on master to resolve conflicts. |
fb7d8e5
to
6f309df
Compare
I was curious whether the changes for dealing with on the Databricks product tests timeouts are effective and stumbled over this failure:
https://github.com/trinodb/trino/actions/runs/6057878380/job/16439570813 Apparently the |
6295280
to
f0f59bf
Compare
@findepi @alexjo2144 Could you take another look when you have time? |
08567c1
to
9637cd7
Compare
cc @radek-starburst |
9637cd7
to
0693b94
Compare
(Just squashed commits into one) |
Description
Fixes #16903
Release notes
(x) Release notes are required, with the following suggested text: