ak.Record should have the same Pandas-style constructor that ak.Array has #1978

jpivarski · 2022-12-08T15:51:59Z

From @jpata on https://gitter.im/Scikit-HEP/awkward-array

sorry to bother, I'm trying to figure out if/how it's possible to create an empty Record with a specific datatype, to be able to save in parquet.

basically, I have something like this, which works

j1 = awkward.from_numpy(np.ones(1, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")

but sometimes, depending on the data, the array j1 is empty, in which case, the export to parquet fails

j1 = awkward.from_numpy(np.empty(0, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")
#fails with "NullType Arrow field must be nullable"

what's the right way to fix this?

maybe this is a bit clearer

#works
j1 = awkward.from_iter([[1], [2]])
awkward.to_parquet({"d": j1}, "test.parquet")

#how to do this, specifying the datatype as above?
j1 = awkward.from_iter([[], []])
awkward.to_parquet({"d": j1}, "test.parquet")

My response

@jpata You've found some quirks in how ak.Records get constructed that should get fixed before the API gets frozen today or tomorrow (in the 2.0.0 release). So, good timing!

What's weird about these records is their data type. You want it to be integer type with zero entries, but it's an unknown type. The reason for that is that the ak.Record(dict(...)) constructor is iterating over the data in the dict because it sees it as generic Python objects to be interpreted with ak.from_iter. With generic Python objects, if a list is empty, the type of the data in that list is unknown.

>>> j1 = ak.from_numpy(np.empty(0, np.int32))
>>> j1
<Array [] type='0 * int32'>
>>> ak.Record({"d": j1})
<Record {d: []} type='{d: var * unknown}'>

By contrast, the ak.Array constructor recognizes "dict of arrays" as a special case, in which the arrays are taken to be columns. We call this the "Pandas-style constructor" because it's what you'd expect when constructing a Pandas DataFrame. Arbitrary data in an ak.Array constructor (neither an array nor a dict of arrays, but some other Python objects, including lists) invokes ak.from_iter.

>>> ak.Array({"d": j1})
<Array [] type='0 * {d: int32}'>

So you could get an ak.Record with a field that is a length-zero list of integers by

>>> ak.Array({"d": j1[np.newaxis]})[0]
<Record {d: []} type='{d: 0 * int32}'>

But we should add a special case to the ak.Record constructor to match the special case in the ak.Array constructor so that you can do this with ak.Record({"d": j1}). The case for doing this for ak.Record is even stronger than the case for doing it with ak.Array, since the Pandas-style ak.Array constructor takes data in a SOA form and makes it (virtually) AOS, a change in structure, but there would be no difference for an equivalent ak.Record constructor (there's no "A" here).

The next step, actually writing this to Parquet, works:

>>> ak.to_parquet(ak.Array({"d": j1[np.newaxis]})[0], "/tmp/test.parquet")
<pyarrow._parquet.FileMetaData object at 0x7fa2b81772c0>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 1
  num_rows: 1
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 0

but the subsequent step, reading it back with ak.from_parquet, doesn't because of a pyarrow.lib.ArrowInvalid error. It might be a missing case in pyarrow: Parquet files with only one record in them are weird. That's another thing that I'll look into, though it might land in version 2.0.1 or 2.0.2. (It's not an API-changing thing.)

The text was updated successfully, but these errors were encountered:

jpivarski · 2022-12-08T19:37:03Z

I'm stealing this back because I had already started and it will be very quick.

jpivarski added bug The problem described is something that must be fixed policy Choice of behavior pr-next-release Required for the next release labels Dec 8, 2022

jpivarski assigned jpivarski and agoose77 and unassigned jpivarski and agoose77 Dec 8, 2022

jpivarski linked a pull request Dec 8, 2022 that will close this issue

fix: ak.Record dict constructor should retain type. #1981

Merged

agoose77 closed this as completed in #1981 Dec 8, 2022

jpivarski removed the pr-next-release Required for the next release label Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ak.Record should have the same Pandas-style constructor that ak.Array has #1978

ak.Record should have the same Pandas-style constructor that ak.Array has #1978

jpivarski commented Dec 8, 2022

jpivarski commented Dec 8, 2022

ak.Record should have the same Pandas-style constructor that ak.Array has #1978

ak.Record should have the same Pandas-style constructor that ak.Array has #1978

Comments

jpivarski commented Dec 8, 2022

From @jpata on https://gitter.im/Scikit-HEP/awkward-array

My response

jpivarski commented Dec 8, 2022