Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet test file for all supported types in 2.5.0 format #40

Closed
alamb opened this issue Apr 26, 2021 · 5 comments
Closed

Add parquet test file for all supported types in 2.5.0 format #40

alamb opened this issue Apr 26, 2021 · 5 comments
Labels
arrow Changes to the arrow crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-4936

Suggested A/C

  • Generate a Parquet file using the Java API and check it into the arrow-testing repo
  • Write unit tests in the parquet crate for reading all types
  • Write unit tests in the datafusion crate for reading all types
@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Dr. Christoph Jung(doc_schorsch) @ 2021-04-24T07:08:11.717+0000:

I'm willing to volunteer for this one.

[https://github.com/drcgjung]
{quote}Being generally interested contributing to the datafusion/ballista development stream.
 ~5years of professional experience in Apache Spark (RDD & dataframe) for large-scale measurement data.
 ~4years open source contributions to JBoss (aka "Wildfly") a while back.
{quote}
Some obligatory questions:
 * Java API = [https://github.com/apache/parquet-mr] ?

 * Parquet Format = [https://github.com/apache/parquet-format] ?

 * Is it sufficient to restrict to the 2.6.0 format (Java API >= 1.11) ? Because there really was no Java API release using 2.5.0 and
 and [https://github.com/apache/arrow/blob/master/rust/parquet/README.md] refers to 2.6.0

 * "arrow-testing" = [https://github.com/apache/arrow-testing] ?
 * New folder there, data/parquet/types ?
 * Where to put the java generator project, also there? 

 * Better to have a single parquet with all the types or one parquet per basic type (can be many derived ones, see below)?

 * Would it be good to include the format version into the test parquet file name (for later additions when rust/parquet upgrades the format)?

 * I count 14 "plain" logical, parameterized types.

 * I count relevant 29 basic type instantiations, each could be represented mandatory and optional (=>58 test types)
 ** string
 ** enum
 ** uuid
 ** int_8, ... uint_64
 ** decimal_32, decimal_64 (maybe additional precision tests?)
 ** date
 ** time_utc_millis, time_utc_micros, time_utc_nanos, time_local_millis, time_local_micros, time_local_nanos
 ** timstamp_utc_millis, ... timestamp_local_nanos
 ** interval
 ** json, bson

 * Nested types could be derived in arbitrary combinations, but I guess its ok to have
 one LIST and two MAP types per basic test type (one as required key and one as value). Again,
 nested type could be mandatory and optional. (=> 58*2 + 29*2 + 58*2 = 290 nested test types)

 * There will be two PRs necessary because of the two repositories involved (arrow hard-linking to a version in the arrow-testing repo). The
 arrow PR will have to change the version link to the arrow-testing repo (which is maybe not safe for other arrow subprojects). Is that ok?

Thanks if/for considering me ;)

 

 

 

@nevi-me
Copy link
Contributor

nevi-me commented May 8, 2021

@andygrove I'm comfortable that we support all the 2.6.0 data types to the extent that we can. Can we close this?

@ByteBaker
Copy link
Contributor

@alamb can we close this?

@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2024

I agree we can close this -- the repo now uses data from the parquet-testing repo (which is largely generated by java parquet)

@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2024

Thanks @ByteBaker

@alamb alamb closed this as completed Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

No branches or pull requests

3 participants