Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36924: [Java] support offset/length and filter in scan option #36967

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zinking
Copy link

@zinking zinking commented Aug 1, 2023

Rationale for this change

currently the dataset scan API doesn't support specifying file start offset and length scan option

What changes are included in this PR?

  • supported filter through start offset and length in parquet reader
  • added interface in the dataset API to specify these options

Are these changes tested?

added a cpp test to test the range options

Are there any user-facing changes?

  • added offset/length option in dataset scan option interface

@github-actions
Copy link

github-actions bot commented Aug 1, 2023

⚠️ GitHub issue apache/arrow-java#177 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/dataset/scanner.h Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 1, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case makes sense to me. However, I think dataset api may not be the right place to support file splitting. IIUC, dataset may only accept options that apply to all file formats. If we add a split size or something similar to it and make splits based on that, it sounds more reasonable to me. I think you need a java wrapper around the C++ parquet reader instead of the dataset parquet reader, am I correct?

@zinking
Copy link
Author

zinking commented Aug 1, 2023

@mapleFU thanks for the review. I am trying to come up with some tests for this. but I need some help in running the tests in my CLION ide, not much documentation on this. could you help me with some context and information?

I was expecting the running tests green arrows on the ide, but it's not there. I guess command line also works, but there are not enough documentation.

@mapleFU
Copy link
Member

mapleFU commented Aug 1, 2023

https://arrow.apache.org/docs/dev/developers/cpp/building.html#optional-components

Maybe you need to ensure ARROW_DATASET and ARROW_BUILD_TESTS is on?

cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved
/// \param[in] start_offset start offset for the scan
///
/// \return Failure if the start_offset is not greater than 0
Status StartOffset(int64_t start_offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For changes in the dataset api, I'd request @westonpace to take a look.

@wgtmac
Copy link
Member

wgtmac commented Aug 1, 2023

@mapleFU thanks for the review. I am trying to come up with some tests for this. but I need some help in running the tests in my CLION ide, not much documentation on this. could you help me with some context and information?

I was expecting the running tests green arrows on the ide, but it's not there. I guess command line also works, but there are not enough documentation.

You may at least enable following options:
-DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_DATASET=ON

Make sure you have also set env ARROW_TEST_DATA and PARQUET_TEST_DATA pointing to the arrow-testing/data and parquet-testing/data submodules respectively.

@davisusanibar
Copy link
Contributor

@mapleFU thanks for the review. I am trying to come up with some tests for this. but I need some help in running the tests in my CLION ide, not much documentation on this. could you help me with some context and information?
I was expecting the running tests green arrows on the ide, but it's not there. I guess command line also works, but there are not enough documentation.

You may at least enable following options: -DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_DATASET=ON

Make sure you have also set env ARROW_TEST_DATA and PARQUET_TEST_DATA pointing to the arrow-testing/data and parquet-testing/data submodules respectively.

Hi @zinking in case of testing, these steps will be helpful:

  1. Generate dependences resources OSX/Win with:
$ mvn generate-resources -Pgenerate-libs-jni-macos-linux -N`
...
Libraries will be generated at: 
$ ls -latr <absolute_path>/arrow/java-dist/x86_64/
|__ libarrow_dataset_jni.dylib
|__ libarrow_orc_jni.dylib
|__ libgandiva_jni.dylib
  1. Then, you could run dataset module using these resources:
$ cd arrow/java/dataset
$ mvn clean test -Dtest="TestFileSystemDataset#testBaseJsonRead" -Darrow.cpp.build.dir=<absolute_path>/arrow/java-dist
  1. In case you need to ru test by IDE don't forget to pass -Darrow.cpp.build.dir=<absolute_path>/arrow/java-dist

All of this s base on https://arrow.apache.org/docs/dev/developers/java/building.html

@zinking
Copy link
Author

zinking commented Aug 2, 2023

@mapleFU @wgtmac @davisusanibar thanks all for the help, I've managed to add a test case.

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2023

@westonpace @pitrou would you mind take a look at this interface? it split a parquet file scanner by offset-length

Comment on lines 141 to 147
int64_t start_offset = kDefaultStartOffset;
int64_t length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you please add docstrings?
  2. In which unit is this expressed? Number of rows? Something else?
  3. Why doesn't length have a default value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. added
  2. in bytes
  3. it isn't really a default value, I was just using that value to check whether parameter is set

@pitrou
Copy link
Member

pitrou commented Aug 23, 2023

Dataset returns results unordered, so does it make sense to ask for a specific offset at all?

@zinking
Copy link
Author

zinking commented Aug 24, 2023

Dataset returns results unordered, so does it make sense to ask for a specific offset at all?

@pitrou , it's actually common in java HADOOP ecosystem. when a parquet file is big, it will be split into multiple pieces , and multiple scanners will read them simultaneously to increase throughput.

within the split, whether it is ordered doesn't matter in this case.

@vibhatha
Copy link
Collaborator

@zinking need some help with this PR?

@zinking
Copy link
Author

zinking commented Dec 13, 2023

@zinking need some help with this PR?

long forgotten, let me pick it up and address your comments.

@@ -607,6 +607,29 @@ TEST_P(TestParquetFileFormatScan, PredicatePushdown) {
kNumRowGroups - 5);
}

TEST_P(TestParquetFileFormatScan, RangeScan) {
constexpr int64_t kNumRowGroups = 16;
constexpr int64_t kTotalNumRows = kNumRowGroups * (kNumRowGroups + 1) / 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason for this exact number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following existing tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

Copy link

⚠️ GitHub issue #36924 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] Dataset support specifying offset and length when read file
6 participants