Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.Record should have the same Pandas-style constructor that ak.Array has #1978

Closed
jpivarski opened this issue Dec 8, 2022 · 1 comment · Fixed by #1981
Closed

ak.Record should have the same Pandas-style constructor that ak.Array has #1978

jpivarski opened this issue Dec 8, 2022 · 1 comment · Fixed by #1981
Assignees
Labels
bug The problem described is something that must be fixed policy Choice of behavior

Comments

@jpivarski
Copy link
Member

From @jpata on https://gitter.im/Scikit-HEP/awkward-array

sorry to bother, I'm trying to figure out if/how it's possible to create an empty Record with a specific datatype, to be able to save in parquet.

basically, I have something like this, which works

j1 = awkward.from_numpy(np.ones(1, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")

but sometimes, depending on the data, the array j1 is empty, in which case, the export to parquet fails

j1 = awkward.from_numpy(np.empty(0, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")
#fails with "NullType Arrow field must be nullable"

what's the right way to fix this?

maybe this is a bit clearer

#works
j1 = awkward.from_iter([[1], [2]])
awkward.to_parquet({"d": j1}, "test.parquet")
#how to do this, specifying the datatype as above?
j1 = awkward.from_iter([[], []])
awkward.to_parquet({"d": j1}, "test.parquet")

My response

@jpata You've found some quirks in how ak.Records get constructed that should get fixed before the API gets frozen today or tomorrow (in the 2.0.0 release). So, good timing!

What's weird about these records is their data type. You want it to be integer type with zero entries, but it's an unknown type. The reason for that is that the ak.Record(dict(...)) constructor is iterating over the data in the dict because it sees it as generic Python objects to be interpreted with ak.from_iter. With generic Python objects, if a list is empty, the type of the data in that list is unknown.

>>> j1 = ak.from_numpy(np.empty(0, np.int32))
>>> j1
<Array [] type='0 * int32'>
>>> ak.Record({"d": j1})
<Record {d: []} type='{d: var * unknown}'>

By contrast, the ak.Array constructor recognizes "dict of arrays" as a special case, in which the arrays are taken to be columns. We call this the "Pandas-style constructor" because it's what you'd expect when constructing a Pandas DataFrame. Arbitrary data in an ak.Array constructor (neither an array nor a dict of arrays, but some other Python objects, including lists) invokes ak.from_iter.

>>> ak.Array({"d": j1})
<Array [] type='0 * {d: int32}'>

So you could get an ak.Record with a field that is a length-zero list of integers by

>>> ak.Array({"d": j1[np.newaxis]})[0]
<Record {d: []} type='{d: 0 * int32}'>

But we should add a special case to the ak.Record constructor to match the special case in the ak.Array constructor so that you can do this with ak.Record({"d": j1}). The case for doing this for ak.Record is even stronger than the case for doing it with ak.Array, since the Pandas-style ak.Array constructor takes data in a SOA form and makes it (virtually) AOS, a change in structure, but there would be no difference for an equivalent ak.Record constructor (there's no "A" here).

The next step, actually writing this to Parquet, works:

>>> ak.to_parquet(ak.Array({"d": j1[np.newaxis]})[0], "/tmp/test.parquet")
<pyarrow._parquet.FileMetaData object at 0x7fa2b81772c0>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 1
  num_rows: 1
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 0

but the subsequent step, reading it back with ak.from_parquet, doesn't because of a pyarrow.lib.ArrowInvalid error. It might be a missing case in pyarrow: Parquet files with only one record in them are weird. That's another thing that I'll look into, though it might land in version 2.0.1 or 2.0.2. (It's not an API-changing thing.)

@jpivarski jpivarski added bug The problem described is something that must be fixed policy Choice of behavior pr-next-release Required for the next release labels Dec 8, 2022
@jpivarski jpivarski assigned jpivarski and agoose77 and unassigned jpivarski and agoose77 Dec 8, 2022
@jpivarski
Copy link
Member Author

I'm stealing this back because I had already started and it will be very quick.

@jpivarski jpivarski linked a pull request Dec 8, 2022 that will close this issue
@jpivarski jpivarski removed the pr-next-release Required for the next release label Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed policy Choice of behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants