-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet test file for all supported types in 2.5.0 format #40
Labels
arrow
Changes to the arrow crate
Comments
Comment from Dr. Christoph Jung(doc_schorsch) @ 2021-04-24T07:08:11.717+0000: I'm willing to volunteer for this one. [https://github.com/drcgjung] {quote}Being generally interested contributing to the datafusion/ballista development stream. ~5years of professional experience in Apache Spark (RDD & dataframe) for large-scale measurement data. ~4years open source contributions to JBoss (aka "Wildfly") a while back. {quote} Some obligatory questions: * Java API = [https://github.com/apache/parquet-mr] ? * Parquet Format = [https://github.com/apache/parquet-format] ? * Is it sufficient to restrict to the 2.6.0 format (Java API >= 1.11) ? Because there really was no Java API release using 2.5.0 and and [https://github.com/apache/arrow/blob/master/rust/parquet/README.md] refers to 2.6.0 * "arrow-testing" = [https://github.com/apache/arrow-testing] ? * New folder there, data/parquet/types ? * Where to put the java generator project, also there? * Better to have a single parquet with all the types or one parquet per basic type (can be many derived ones, see below)? * Would it be good to include the format version into the test parquet file name (for later additions when rust/parquet upgrades the format)? * I count 14 "plain" logical, parameterized types. * I count relevant 29 basic type instantiations, each could be represented mandatory and optional (=>58 test types) ** string ** enum ** uuid ** int_8, ... uint_64 ** decimal_32, decimal_64 (maybe additional precision tests?) ** date ** time_utc_millis, time_utc_micros, time_utc_nanos, time_local_millis, time_local_micros, time_local_nanos ** timstamp_utc_millis, ... timestamp_local_nanos ** interval ** json, bson * Nested types could be derived in arbitrary combinations, but I guess its ok to have one LIST and two MAP types per basic test type (one as required key and one as value). Again, nested type could be mandatory and optional. (=> 58*2 + 29*2 + 58*2 = 290 nested test types) * There will be two PRs necessary because of the two repositories involved (arrow hard-linking to a version in the arrow-testing repo). The arrow PR will have to change the version link to the arrow-testing repo (which is maybe not safe for other arrow subprojects). Is that ok? Thanks if/for considering me ;) |
@andygrove I'm comfortable that we support all the 2.6.0 data types to the extent that we can. Can we close this? |
@alamb can we close this? |
I agree we can close this -- the repo now uses data from the parquet-testing repo (which is largely generated by java parquet) |
Thanks @ByteBaker |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-4936
Suggested A/C
The text was updated successfully, but these errors were encountered: