-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-777: Add Parquet CLI. #384
Conversation
There are still Cloudera copyright headers, which @tomwhite is planning to change to the Apache header on Cloudera's behalf. Thanks, Tom! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. I'll definitely going to use this for debugging my parquet-cpp generated files.
parquet-cli/README.md
Outdated
Then, run the command-line using the `target/dependencies` for the classpath: | ||
|
||
``` | ||
java -cp 'target/parquet-cli-1.9.1-runtime.jar:target/dependency/*' org.apache.parquet.cli.Main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a version independent string? Directly run straight into while trying it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for taking a look!
@rdblue is |
@sadikovi, the plan is for parquet-cli to be available through maven in the next release. I'm not sure about removing parquet-tools. I don't use them and I think that this is going to be easier to maintain, but I'm not going to insist they be removed. |
Let's not remove parquet-tools right now since the commands are not the same. If people shift to the new tool then we can deprecate and then remove it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a high level look. I like the direction this PR is going, the revamped cli looks nice! Few comments at a high level:
- As-is this PR is really massive :-). I think it would be great if we could break it up to make it iterative. That would allow the community to provide more through reviews as well. One suggestion to do this would be to start with the skeleton Command, BaseCommand list of proposed commands. We could then follow up with concrete implementations of subsets of individual commands. What do you think?
- Seems like we're missing thrift / protobuf support right? Would be nice to have thrift / proto equivalents to some of the avro command variants.
- If possible it would be nice to add some unit tests for each of the commands.
<groupId>org.apache.parquet</groupId> | ||
<artifactId>parquet</artifactId> | ||
<relativePath>../pom.xml</relativePath> | ||
<version>1.9.1-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{project.version} ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matches the other module POM files and is updated by Maven's release plugin. I don't have much of an opinion either way on best practice, but I think that this should match the others for now. Feel free to open a follow-up issue to change over to ${project.version}
if you think we should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the one place where you can't put project.version since project.version come from the parent
case PARQUET: | ||
Configuration conf = new Configuration(getConf()); | ||
// TODO: add these to the reader builder | ||
AvroReadSupport.setRequestedProjection(conf, projection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we also need thrift / protobuf support? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need thrift and protobuf support. It would be nice to be able to translate those formats into Parquet in the CLI tool, but it doesn't have to do everything yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
import java.util.List; | ||
import java.util.NoSuchElementException; | ||
|
||
public abstract class BaseCommand implements Command, Configurable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we decouple things in this class a bit to make it more readable? Seems like there are a few things, all the FS support (create file, path resolution etc) which could be pulled out into a separate class. Classloader stuff could be pulled out, parquet avro record reader could be broken out, SeekableFSDataInputStream could be broken out too.
Ideally we should only define the methods here that are going to be overridden but specific command classes. Other pieces of code that specific sub commands need could be used via composition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FS operations are convenience methods to help commands. That's basically the purpose of this base class and they require access to the Configuration set on the command instance. I think it makes sense to keep those here, so they're easily accessed by commands.
I agree that some of the static classes can be refactored and placed in a util package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the static embedded classes.
for (String cmd : helpCommands) { | ||
JCommander commander = jc.getCommands().get(cmd); | ||
if (commander == null) { | ||
console.error("Unknown command: {}", cmd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe list the set of known commands here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. It now prints the generic help after the log message.
private static final long GB = 1024 * MB; | ||
|
||
public static String humanReadable(float bytes) { | ||
if (bytes > GB) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TB too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. It won't get used since this is mostly for printing sizes inside of a row group, but it will be good to have if we end up printing total file size later.
import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.BINARY; | ||
|
||
@Parameters(commandDescription = "Check Parquet files for corrupt page and column stats (PARQUET-251)") | ||
public class CheckParquet251Command extends BaseCommand { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about naming this check corrupt stats instead of check parquet 251?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only checks for PARQUET-251, where min and max aren't the actual min and max values for some String columns. It doesn't check, for example, that the correct sort order was used or that the min and max for numeric columns are valid. Because it is limited to checking for PARQUET-251, I'd leave the name as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
} | ||
} | ||
|
||
public static String humanReadable(long bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should method be like snippet below instead of copying logic (there is also a possible typo around %.02f B
in above case and %d B
below)?
public static String humanReadable(long bytes) {
float fbytes = (float) bytes;
return humanReadable(fbytes);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bytes argument in the one above is a float because it is typically used as an average, showing the average number of bytes for values in a row group or page. That's why the float version uses %.02f
and not %d
. This function is called when the number of bytes is exact, so it uses %d
.
@piyushnarang, thanks for taking the time to review. I'll fix a few things that you pointed out and I've responded to some comments to clarify.
I don't think I have the time to break this into a series of commits. Perhaps it would be helpful to review a section at a time along the lines of the commits you suggested? It probably isn't very valuable to break it across commits in the long run, unless there are a huge number of changes that are needed.
This shouldn't hold up what's already done. It would be great if someone added the ability to read thrift and protobuf, but that can be done later.
It is possible to test, and a good idea in the long run. I wanted to get this out to users sooner than I would be able to find time to thoroughly test. Is it okay to focus on areas that must be tested to in order to get this committed, and add tests as it evolves later? |
I want to echo what @piyushnarang has commented here, if not for this PR but for future ones. This PR is really large. The larger a PR is, the less likely it is to be thoroughly reviewed. I think we really need to aim for smaller / more incremental PRs than this in general. I don't know if it's worth the time to split this one up now that it's been done, but I think we should avoid setting a precedent of PRs this large being common, especially for all new code (as opposed to large refactors that sometimes have to be large). Additionally, we need to also make sure that new code is accompanied by matching new tests. Wanting to get something into the hands of users early w/o the matching tests is also not something I think we should set a precedent for either. I don't think that means 100% test coverage, but there should be test coverage for the most important parts. Was this a project developed separately, and now it's being imported into the parquet codebase? That sort of situation is difficult, as it may have been built incrementally and reviewed thoroughly, and it's odd to break it up and import it piece by piece after the fact, so I do understand that. |
@isnotinvain, this was developed as a separate project. I originally thought it would be better as a separate Parquet repository since it isn't really a part of Parquet MR (but does depend on it). I think it's more important to get this project released than to add thorough unit tests at this point because we've been using it internally and know that it works for the most part. It also provides some great help for people experimenting with Parquet that you can't get elsewhere, like CSV, JSON, and Avro file conversions. I completely agree that we should aim for smaller PRs with test coverage. I think that's more the norm, like #390 that is a series of commits and has thorough tests for the ByteBufferInputStream classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main comment would be that there should be tests for at least the utility methods (dynamic methods, schema stuff etc) If some code was borrowed from another project, probably the tests should come along.
/** | ||
* A wrapper for FSDataInputStream that implements Avro's SeekableInput. | ||
*/ | ||
public class SeekableFSDataInputStream extends InputStream implements SeekableInput { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be in parquet-avro?
<groupId>org.apache.parquet</groupId> | ||
<artifactId>parquet</artifactId> | ||
<relativePath>../pom.xml</relativePath> | ||
<version>1.9.1-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the one place where you can't put project.version since project.version come from the parent
<relocations> | ||
<relocation> | ||
<!-- relocate Avro in the runtime jar to avoid conflicts with | ||
on-cluster Avro versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and guava?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guava shouldn't be necessary because Hadoop uses 11.0.2 and this specifies 11.0.
case PARQUET: | ||
Configuration conf = new Configuration(getConf()); | ||
// TODO: add these to the reader builder | ||
AvroReadSupport.setRequestedProjection(conf, projection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
console.info("\n {}", example); | ||
} else { | ||
console.info(" {} {} {}", | ||
new Object[] {programName, cmd, example}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
console doesn't have varargs methods?
import java.util.regex.Pattern; | ||
|
||
|
||
public class Expressions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc?
public class Expressions { | ||
private static final Pattern NUMERIC_RE = Pattern.compile("^\\d+$"); | ||
|
||
public static Object select(Schema schema, Object datum, String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc? It would make it easier to follow what this does
case MAP: | ||
Preconditions.checkArgument(datum instanceof Map, | ||
"Not a map: %s", datum); | ||
Map<Object, Object> map = (Map<Object, Object>) datum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also use Map<?,?>
and cast on get to avoid warnings
} | ||
} | ||
|
||
public static CodecFactory avroCodec(String codec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in parquet-avro?
|
||
@Override | ||
public Iterator<E> iterator() { | ||
return this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the contract of iterable would be to get a new one everytime
@isnotinvain @piyushnarang In general I agree that we should avoid large patches and get smaller ones. However this particular patch doesn't touch any of the existing parquet code and by itself has a lot of independent parts that are easy to review independently in one patch (various commands, some utility packages, etc) and I don't think that it would help much to have this particular patch split into smaller ones since it would basically be split by files. It would be more work on Ryan's side with very little benefit on the reviewer end. So I think we can make an exception here and review as is. I did a first pass review above. |
we should merge this. |
+1, I'm already using this regularly! |
Great! I'll rebase and commit it shortly. I think I was working on tests when I last had a chance... so I'll make sure that's reasonably complete. |
a7515ba
to
ce807af
Compare
ce807af
to
de49eff
Compare
This adds a new parquet-cli module with an improved command-line tool. The parquet-cli/README.md file has instructions for building and testing locally.