Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12480: [Java][Dataset] FileSystemDataset: Support reading from a directory #10114

Closed
wants to merge 6 commits into from

Conversation

zhztheplayer
Copy link
Member

No description provided.

@github-actions
Copy link

@zhztheplayer
Copy link
Member Author

@emkornfield - Are you able to have a review for this patch? Just a fundamental functionality to make Java dataset API able to read from files in folder.

@zhztheplayer
Copy link
Member Author

Ping @kou @liyafan82 @kszucs - If we have other active Java code committers?

@emkornfield
Copy link
Contributor

Sorry, not sure when I will be able to take a look (possibility of this weekend) but I'm generally behind on reviews.

@kou
Copy link
Member

kou commented Dec 7, 2021

I'm not familiar with Java. @kiszk Can you review this?

@kiszk
Copy link
Member

kiszk commented Dec 7, 2021

Sure, I will do a review.

datum.forEach(batch -> assertEquals(1, batch.getLength()));
checkParquetReadResult(schema, expectedJsonUnordered, datum);

AutoCloseables.close(datum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use try (... datum = ...) { ... } at line 153? So, we can remove this line.

Copy link
Member Author

@zhztheplayer zhztheplayer Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that datum is of type List that is not AutoClosable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe place this in a finally block?

uri = "file://" + path;
writer = AvroParquetWriter.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path))
writer = AvroParquetWriter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need this format change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a trivial change to keep code readable.



public ParquetWriteSupport(String schemaName, File outputFolder) throws Exception {
avroSchema = readSchemaFromFile(schemaName);
path = outputFolder.getPath() + File.separator + "generated.parquet";
path = outputFolder.getPath() + File.separator + "generated-" + random.nextLong() + ".parquet";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this change wants to get a unique name for a short period.

How about using System.currentTimeMillis()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't milli cause collisions? ParquetWriteSupport is instantiated rapidly during ut execution.

By the way any special reason to use time rather than random? The files can be dropped after finishing ut. Users should not be aware of the files.

@kiszk
Copy link
Member

kiszk commented Dec 11, 2021

Looks good to me with a few comments.

I would like to get another review, for example, by @emkornfield or @liyafan82

@zhztheplayer zhztheplayer force-pushed the ARROW-12480 branch 2 times, most recently from 222ce24 to 1a6353a Compare December 17, 2021 03:49
@kszucs kszucs requested a review from kiszk January 14, 2022 13:03
@kszucs kszucs requested a review from liyafan82 January 14, 2022 13:03
@kszucs
Copy link
Member

kszucs commented Jan 17, 2022

Thanks everyone, merging!

@kszucs kszucs closed this in e12a454 Jan 17, 2022
@ursabot
Copy link

ursabot commented Jan 17, 2022

Benchmark runs are scheduled for baseline = bbbe668 and contender = e12a454. e12a454 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants