-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12480: [Java][Dataset] FileSystemDataset: Support reading from a directory #10114
Conversation
94d94c0
to
c805908
Compare
c805908
to
86aab3f
Compare
86aab3f
to
d318abc
Compare
e96e1d1
to
49dd067
Compare
@emkornfield - Are you able to have a review for this patch? Just a fundamental functionality to make Java dataset API able to read from files in folder. |
Ping @kou @liyafan82 @kszucs - If we have other active Java code committers? |
Sorry, not sure when I will be able to take a look (possibility of this weekend) but I'm generally behind on reviews. |
I'm not familiar with Java. @kiszk Can you review this? |
Sure, I will do a review. |
datum.forEach(batch -> assertEquals(1, batch.getLength())); | ||
checkParquetReadResult(schema, expectedJsonUnordered, datum); | ||
|
||
AutoCloseables.close(datum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use try (... datum = ...) { ... }
at line 153? So, we can remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that datum
is of type List that is not AutoClosable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe place this in a finally block?
uri = "file://" + path; | ||
writer = AvroParquetWriter.<GenericRecord>builder(new org.apache.hadoop.fs.Path(path)) | ||
writer = AvroParquetWriter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we need this format change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a trivial change to keep code readable.
|
||
|
||
public ParquetWriteSupport(String schemaName, File outputFolder) throws Exception { | ||
avroSchema = readSchemaFromFile(schemaName); | ||
path = outputFolder.getPath() + File.separator + "generated.parquet"; | ||
path = outputFolder.getPath() + File.separator + "generated-" + random.nextLong() + ".parquet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this change wants to get a unique name for a short period.
How about using System.currentTimeMillis()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't milli cause collisions? ParquetWriteSupport
is instantiated rapidly during ut execution.
By the way any special reason to use time rather than random? The files can be dropped after finishing ut. Users should not be aware of the files.
Looks good to me with a few comments. I would like to get another review, for example, by @emkornfield or @liyafan82 |
java/dataset/src/test/java/org/apache/arrow/dataset/ParquetWriteSupport.java
Show resolved
Hide resolved
java/dataset/src/test/java/org/apache/arrow/dataset/file/TestFileSystemDataset.java
Outdated
Show resolved
Hide resolved
222ce24
to
1a6353a
Compare
Thanks everyone, merging! |
Benchmark runs are scheduled for baseline = bbbe668 and contender = e12a454. e12a454 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.