-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing ak._v2.to_parquet. #1440
Conversation
Codecov Report
|
…ng 'OSError: Data size too small for number of values (corrupted file?)' on read-back.
for more information, see https://pre-commit.ci
Please see here for the list of options we thought would be important when we talked this through. |
…ikit-hep/awkward-1.0 into jpivarski/start-v2-to_parquet
Per leaf column is implemented: all ParquetWriter arguments that accept a dict or list can be selected via a dict of Awkward column selectors. Selectors are strings or iterables of strings that get passed to >>> array = ak._v2.Array([[{"x": 1.1, "y": [1], "z": "one"}, {"x": 2.2, "y": [1, 2], "z": "two"}], [], [{"x": 3.3, "y": [1, 2, 3], "z": "three"}]])
>>> print(array.layout.form.select_columns("x"))
{
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "RecordArray",
"contents": {
"x": "float64"
}
}
}
>>> print(array.layout.form.select_columns("y"))
{
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "RecordArray",
"contents": {
"y": {
"class": "ListOffsetArray",
"offsets": "i64",
"content": "int64"
}
}
}
}
>>> print(array.layout.form.select_columns("z"))
{
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "RecordArray",
"contents": {
"z": {
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"primitive": "uint8",
"parameters": {
"__array__": "char"
}
},
"parameters": {
"__array__": "string"
}
}
}
}
}
>>> print(array.layout.form.select_columns(["x", "y"]))
{
"class": "ListOffsetArray",
"offsets": "i64",
"content": {
"class": "RecordArray",
"contents": {
"x": "float64",
"y": {
"class": "ListOffsetArray",
"offsets": "i64",
"content": "int64"
}
}
}
} They're wildcard-friendly and don't have the " Once sliced, >>> array.layout.form.column_types()
(dtype('float64'), dtype('int64'), 'string') Again, it ignores whatever lists or option-types it encountered on the way down to the leaves, and this They were to be used here: But Arrow is saying that nested column buffers made this way are too short, that the file is possibly corrupted. If I'm not misunderstanding something, this looks like a bug in pyarrow.
See above.
If the
I haven't tried it out yet.
That would be But the apparent bug I mentioned above is making me hesitate about that. |
This is pretty much done. I think the options might change; in particular, I think it would be much better to be writing compliant list types, but there's an issue with that. |
I meant directory name partitioning, ie., following group-by.
(I forgot to mention that we discussed what decent defaults might be, per data type) I know nothing about the possible pyarrow bug. |
The variable named |
No description provided.