GH-36924: [Java] support offset/length and filter in scan option #36967

zinking · 2023-08-01T03:42:42Z

Rationale for this change

currently the dataset scan API doesn't support specifying file start offset and length scan option

What changes are included in this PR?

supported filter through start offset and length in parquet reader
added interface in the dataset API to specify these options

Are these changes tested?

added a cpp test to test the range options

Are there any user-facing changes?

added offset/length option in dataset scan option interface

Closes: [Java] Dataset support specifying offset and length when read file arrow-java#177
GitHub Issue: ARROW-317: Add Slice, Copy methods to Buffer #177

github-actions · 2023-08-01T03:43:08Z

⚠️ GitHub issue apache/arrow-java#177 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/dataset/scanner.h

cpp/src/arrow/dataset/file_parquet.cc

wgtmac

This use case makes sense to me. However, I think dataset api may not be the right place to support file splitting. IIUC, dataset may only accept options that apply to all file formats. If we add a split size or something similar to it and make splits based on that, it sounds more reasonable to me. I think you need a java wrapper around the C++ parquet reader instead of the dataset parquet reader, am I correct?

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java

zinking · 2023-08-01T09:36:18Z

@mapleFU thanks for the review. I am trying to come up with some tests for this. but I need some help in running the tests in my CLION ide, not much documentation on this. could you help me with some context and information?

I was expecting the running tests green arrows on the ide, but it's not there. I guess command line also works, but there are not enough documentation.

mapleFU · 2023-08-01T09:40:54Z

https://arrow.apache.org/docs/dev/developers/cpp/building.html#optional-components

Maybe you need to ensure ARROW_DATASET and ARROW_BUILD_TESTS is on?

cpp/src/arrow/dataset/file_parquet.cc

wgtmac · 2023-08-01T15:27:45Z

cpp/src/arrow/dataset/scanner.h

+  /// \param[in] start_offset start offset for the scan
+  ///
+  /// \return Failure if the start_offset is not greater than 0
+  Status StartOffset(int64_t start_offset);


For changes in the dataset api, I'd request @westonpace to take a look.

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java

wgtmac · 2023-08-01T15:36:55Z

@mapleFU thanks for the review. I am trying to come up with some tests for this. but I need some help in running the tests in my CLION ide, not much documentation on this. could you help me with some context and information?

I was expecting the running tests green arrows on the ide, but it's not there. I guess command line also works, but there are not enough documentation.

You may at least enable following options:
-DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_DATASET=ON

Make sure you have also set env ARROW_TEST_DATA and PARQUET_TEST_DATA pointing to the arrow-testing/data and parquet-testing/data submodules respectively.

davisusanibar · 2023-08-01T21:35:38Z

@mapleFU thanks for the review. I am trying to come up with some tests for this. but I need some help in running the tests in my CLION ide, not much documentation on this. could you help me with some context and information?
I was expecting the running tests green arrows on the ide, but it's not there. I guess command line also works, but there are not enough documentation.

You may at least enable following options: -DARROW_PARQUET=ON -DARROW_BUILD_TESTS=ON -DARROW_DATASET=ON

Make sure you have also set env ARROW_TEST_DATA and PARQUET_TEST_DATA pointing to the arrow-testing/data and parquet-testing/data submodules respectively.

Hi @zinking in case of testing, these steps will be helpful:

Generate dependences resources OSX/Win with:

$ mvn generate-resources -Pgenerate-libs-jni-macos-linux -N`
...
Libraries will be generated at: 
$ ls -latr <absolute_path>/arrow/java-dist/x86_64/
|__ libarrow_dataset_jni.dylib
|__ libarrow_orc_jni.dylib
|__ libgandiva_jni.dylib

Then, you could run dataset module using these resources:

$ cd arrow/java/dataset
$ mvn clean test -Dtest="TestFileSystemDataset#testBaseJsonRead" -Darrow.cpp.build.dir=<absolute_path>/arrow/java-dist

In case you need to ru test by IDE don't forget to pass -Darrow.cpp.build.dir=<absolute_path>/arrow/java-dist

All of this s base on https://arrow.apache.org/docs/dev/developers/java/building.html

zinking · 2023-08-02T12:01:17Z

@mapleFU @wgtmac @davisusanibar thanks all for the help, I've managed to add a test case.

mapleFU · 2023-08-23T05:19:12Z

@westonpace @pitrou would you mind take a look at this interface? it split a parquet file scanner by offset-length

pitrou · 2023-08-23T13:09:38Z

cpp/src/arrow/dataset/scanner.h

+  int64_t start_offset = kDefaultStartOffset;
+  int64_t length;


Can you please add docstrings?

In which unit is this expressed? Number of rows? Something else?

Why doesn't length have a default value?

added

in bytes

it isn't really a default value, I was just using that value to check whether parameter is set

pitrou · 2023-08-23T13:13:24Z

Dataset returns results unordered, so does it make sense to ask for a specific offset at all?

zinking · 2023-08-24T06:01:35Z

Dataset returns results unordered, so does it make sense to ask for a specific offset at all?

@pitrou , it's actually common in java HADOOP ecosystem. when a parquet file is big, it will be split into multiple pieces , and multiple scanners will read them simultaneously to increase throughput.

within the split, whether it is ordered doesn't matter in this case.

cpp/src/arrow/dataset/scanner.h

java/dataset/src/main/cpp/jni_wrapper.cc

java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java

vibhatha · 2023-12-13T04:12:37Z

@zinking need some help with this PR?

zinking · 2023-12-13T04:50:50Z

@zinking need some help with this PR?

long forgotten, let me pick it up and address your comments.

vibhatha · 2023-12-13T05:21:42Z

cpp/src/arrow/dataset/file_parquet_test.cc

@@ -607,6 +607,29 @@ TEST_P(TestParquetFileFormatScan, PredicatePushdown) {
                            kNumRowGroups - 5);
 }

+TEST_P(TestParquetFileFormatScan, RangeScan) {
+  constexpr int64_t kNumRowGroups = 16;
+  constexpr int64_t kTotalNumRows = kNumRowGroups * (kNumRowGroups + 1) / 2;


any specific reason for this exact number?

following existing tests

java/dataset/src/main/cpp/jni_wrapper.cc

github-actions · 2024-11-26T19:16:57Z

⚠️ GitHub issue #36924 has no components, please add labels for components.

zinking requested review from lidavidm and westonpace as code owners August 1, 2023 03:42

github-actions bot added Component: Java Component: C++ awaiting review Awaiting review labels Aug 1, 2023

mapleFU reviewed Aug 1, 2023

View reviewed changes

cpp/src/arrow/dataset/scanner.h Show resolved Hide resolved

cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved

cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 1, 2023

wgtmac reviewed Aug 1, 2023

View reviewed changes

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanOptions.java Show resolved Hide resolved

wgtmac reviewed Aug 1, 2023

View reviewed changes

pitrou reviewed Aug 23, 2023

View reviewed changes

vibhatha reviewed Oct 25, 2023

View reviewed changes

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

vibhatha reviewed Oct 25, 2023

View reviewed changes

java/dataset/src/main/cpp/jni_wrapper.cc Outdated Show resolved Hide resolved

vibhatha reviewed Oct 25, 2023

View reviewed changes

java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java Outdated Show resolved Hide resolved

zinking added 5 commits December 13, 2023 13:06

support offset/length and filter in scan option

347ad39

remove accidental change

c414990

address comments filter on prefilter result

d998fa7

address styles

fdd1457

adding a test case

bbc83b2

zinking added 2 commits December 13, 2023 13:06

add doc string

c8f68b9

address comments

94366a7

zinking force-pushed the offset branch from 75028ea to 94366a7 Compare December 13, 2023 05:06

vibhatha reviewed Dec 13, 2023

View reviewed changes

java/dataset/src/main/cpp/jni_wrapper.cc Outdated Show resolved Hide resolved

address comments

1500a18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-36924: [Java] support offset/length and filter in scan option #36967

GH-36924: [Java] support offset/length and filter in scan option #36967

zinking commented Aug 1, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Aug 1, 2023

wgtmac left a comment

zinking commented Aug 1, 2023

mapleFU commented Aug 1, 2023 •

edited

Loading

wgtmac Aug 1, 2023

wgtmac commented Aug 1, 2023

davisusanibar commented Aug 1, 2023

zinking commented Aug 2, 2023

mapleFU commented Aug 23, 2023

pitrou Aug 23, 2023

zinking Aug 24, 2023

pitrou commented Aug 23, 2023

zinking commented Aug 24, 2023 •

edited

Loading

vibhatha commented Dec 13, 2023

zinking commented Dec 13, 2023

vibhatha Dec 13, 2023

zinking Dec 13, 2023

vibhatha Dec 14, 2023

github-actions bot commented Nov 26, 2024

GH-36924: [Java] support offset/length and filter in scan option #36967

Are you sure you want to change the base?

GH-36924: [Java] support offset/length and filter in scan option #36967

Conversation

zinking commented Aug 1, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Aug 1, 2023

wgtmac left a comment

Choose a reason for hiding this comment

zinking commented Aug 1, 2023

mapleFU commented Aug 1, 2023 • edited Loading

wgtmac Aug 1, 2023

Choose a reason for hiding this comment

wgtmac commented Aug 1, 2023

davisusanibar commented Aug 1, 2023

zinking commented Aug 2, 2023

mapleFU commented Aug 23, 2023

pitrou Aug 23, 2023

Choose a reason for hiding this comment

zinking Aug 24, 2023

Choose a reason for hiding this comment

pitrou commented Aug 23, 2023

zinking commented Aug 24, 2023 • edited Loading

vibhatha commented Dec 13, 2023

zinking commented Dec 13, 2023

vibhatha Dec 13, 2023

Choose a reason for hiding this comment

zinking Dec 13, 2023

Choose a reason for hiding this comment

vibhatha Dec 14, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 26, 2024

zinking commented Aug 1, 2023 •

edited by github-actions bot

Loading

mapleFU commented Aug 1, 2023 •

edited

Loading

zinking commented Aug 24, 2023 •

edited

Loading