-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36924: [Java] support offset/length and filter in scan option #36967
base: main
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case makes sense to me. However, I think dataset api may not be the right place to support file splitting. IIUC, dataset may only accept options that apply to all file formats. If we add a split size or something similar to it and make splits based on that, it sounds more reasonable to me. I think you need a java wrapper around the C++ parquet reader instead of the dataset parquet reader, am I correct?
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java
Show resolved
Hide resolved
@mapleFU thanks for the review. I am trying to come up with some tests for this. but I need some help in running the tests in my CLION ide, not much documentation on this. could you help me with some context and information? I was expecting the running tests green arrows on the ide, but it's not there. I guess command line also works, but there are not enough documentation. |
https://arrow.apache.org/docs/dev/developers/cpp/building.html#optional-components Maybe you need to ensure |
/// \param[in] start_offset start offset for the scan | ||
/// | ||
/// \return Failure if the start_offset is not greater than 0 | ||
Status StartOffset(int64_t start_offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For changes in the dataset api, I'd request @westonpace to take a look.
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java
Show resolved
Hide resolved
java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java
Show resolved
Hide resolved
You may at least enable following options: Make sure you have also set env |
Hi @zinking in case of testing, these steps will be helpful:
All of this s base on https://arrow.apache.org/docs/dev/developers/java/building.html |
@mapleFU @wgtmac @davisusanibar thanks all for the help, I've managed to add a test case. |
@westonpace @pitrou would you mind take a look at this interface? it split a parquet file scanner by offset-length |
cpp/src/arrow/dataset/scanner.h
Outdated
int64_t start_offset = kDefaultStartOffset; | ||
int64_t length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you please add docstrings?
- In which unit is this expressed? Number of rows? Something else?
- Why doesn't
length
have a default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- added
- in bytes
- it isn't really a default value, I was just using that value to check whether parameter is set
Dataset returns results unordered, so does it make sense to ask for a specific offset at all? |
@pitrou , it's actually common in java HADOOP ecosystem. when a parquet file is big, it will be split into multiple pieces , and multiple scanners will read them simultaneously to increase throughput. within the split, whether it is ordered doesn't matter in this case. |
java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java
Outdated
Show resolved
Hide resolved
@zinking need some help with this PR? |
long forgotten, let me pick it up and address your comments. |
@@ -607,6 +607,29 @@ TEST_P(TestParquetFileFormatScan, PredicatePushdown) { | |||
kNumRowGroups - 5); | |||
} | |||
|
|||
TEST_P(TestParquetFileFormatScan, RangeScan) { | |||
constexpr int64_t kNumRowGroups = 16; | |||
constexpr int64_t kTotalNumRows = kNumRowGroups * (kNumRowGroups + 1) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason for this exact number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following existing tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
|
Rationale for this change
currently the dataset scan API doesn't support specifying file start offset and length scan option
What changes are included in this PR?
Are these changes tested?
added a cpp test to test the range options
Are there any user-facing changes?