Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API #13760

igor-suhorukov · 2022-07-31T21:55:09Z

How to fetch "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API. This chapter is missing in documentation. It is possible to do in CPP/Python API

lwhite1 · 2022-08-02T19:47:44Z

Hi @igor-suhorukov,

I would need to check, but I don't think this is implemented in Java Dataset yet.

Do you need to use Datasets for your application? (in other words, is your data spread over multiple files or too big to fit in memory?) If not, you should be able to use a FileReader load the data.

igor-suhorukov · 2022-08-03T09:50:38Z

This functionality required to implement Arrow file/Stream input format in my use case to process large amount of existing geospatial ARROW format data in Apache Spark data source. Optimized Analytics Package (OAP) for Spark also can leverage this feature of Dataset on JVM. They use FileSystemDatasetFactory in this Spark gazelle_plugin adapter .

Jira ticket for this improvement.

…tory (#13760) (#13811) This PR allow developers to create Dataset from ARROW IPC files in JVM code like: `FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, arrowDatasetURL);` It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion Lead-authored-by: Igor Suhorukov <[email protected]> Co-authored-by: igor.suhorukov <[email protected]> Signed-off-by: David Li <[email protected]>

igor-suhorukov mentioned this issue Aug 3, 2022

How to save org.apache.arrow.vector.VectorSchemaRoot into parquet file in Java API #13759

Open

lidavidm linked a pull request Aug 8, 2022 that will close this issue

ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFactory (#13760) #13811

Merged

lidavidm closed this as completed in #13811 Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API #13760

Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API #13760

igor-suhorukov commented Jul 31, 2022

lwhite1 commented Aug 2, 2022

igor-suhorukov commented Aug 3, 2022 •

edited

Loading

Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API #13760

Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API #13760

Comments

igor-suhorukov commented Jul 31, 2022

lwhite1 commented Aug 2, 2022

igor-suhorukov commented Aug 3, 2022 • edited Loading

igor-suhorukov commented Aug 3, 2022 •

edited

Loading