Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterating over Records and ak.from_iter #1582

Closed
jpivarski opened this issue Aug 10, 2022 · 6 comments · Fixed by #1725
Closed

Iterating over Records and ak.from_iter #1582

jpivarski opened this issue Aug 10, 2022 · 6 comments · Fixed by #1725
Labels
policy Choice of behavior

Comments

@jpivarski
Copy link
Member

jpivarski commented Aug 10, 2022

Currently, ak.Record.__iter__ returns an iterator over its field names, which may be what a user expects if a Record were a Mapping/MutableMapping like a dict, but it's not. It's probably for the best that it's not, because we don't want keys, values, items methods, particularly since we're trying to consolidate the nomenclature of field names as "fields", rather than "keys". And we especially don't want the non-vectorized __eq__ and __ne__ that Mapping has.

To throw a little more confusion in, ak.Record.__len__ always returns 1 and not the number of fields (the length of what it iterates over). That's a weird enough choice to be considered a bug.

Since v2 is coming up and we can make little backward-incompatible changes like this, perhaps we should just ensure that v2 ak.Record has no __iter__ and __len__ methods. If they're really needed for something, we can be driven by the use-case as to whether ak.Record iteration should be over field names, field-value pairs, or something else. At the very least, __len__ should return the length of the __iter__ iterable.

Related to this is how ak.Record is treated by ak.from_iter: it naively iterates, which means it gets the field names:

>>> import awkward as ak
>>> ak._v2.from_iter(ak._v2.Record({"a": 1, "b": 2}))
<Array ['a', 'b'] type='2 * string'>
>>> ak._v2.from_iter([ak._v2.Record({"a": 1, "b": 2})])
<Array [['a', 'b']] type='1 * var * string'>

Although iteration is a slow way to build an array and coming from a collection of ak.Record is suggestive that it's being used when a faster way is possible, the above is almost certainly not what the user had in mind. Changing v2 ak.from_iter to recognize ak.Record and build an equivalent thing is an improvement that would be independent of fixing the ak.Record.__iter__ and __len__ behavior, but both should be done.

Edit: also check ArrayBuilder.append; make sure it's consistent with ak.from_iter (it might be how ak.from_iter is implemented...)

@jpivarski jpivarski added the policy Choice of behavior label Aug 10, 2022
@agoose77
Copy link
Collaborator

I am inclined to agree that we should not push the record-as-mapping protocol here. We'd start getting into the weeds with how NumPy defines e.g. __contains__ and how a pure-Python mapping should etc. So, 👍 from me on making Record less clever.

@chernals
Copy link

The way I read this, this issue contained 2 things: disabling the iteration over records (done now) and fixing the way arrays are built from collections of records while iterating.

I would suggest that we reopen as the second one is not fixed.

@agoose77
Copy link
Collaborator

agoose77 commented Sep 26, 2022

@chernals you're correct; this issue did raise two closely related bugs / behavioural problems, and #1725 itself only addressed one of them. However, the building of records during iteration has also been fixed by #1721, which replaced v1 with our v2 submodule. The PR that closed this issue landed after #1721, so it's fixed in main.

There is an argument to be made for backporting this fix (and the separate array-of-records fix in #1725) to the v2 submodule in 1.10.x, but I don't know yet what our criteria are for doing this. In fact, your reply is a good opportunity for @jpivarski to clarify that for me :)

@chernals could you clarify the version of Awkward that you are using? Are you using the v2 submodule from our latest minor 1.10 release?

@jpivarski
Copy link
Member Author

This change was sufficient to fix both issues in v2 (this is from main/2.0.0rc1):

>>> import awkward as ak   # which is to say, v2
>>> ak.from_iter(ak.Record({"a": 1, "b": 2}))
<Record {a: 1, b: 2} type='{a: int64, b: int64}'>
>>> ak.from_iter([ak.Record({"a": 1, "b": 2})])
<Array [{a: 1, b: 2}] type='1 * {a: int64, b: int64}'>
>>> ak.from_iter([[ak.Record({"a": 1, "b": 2})]])
<Array [[{a: 1, b: 2}]] type='1 * var * {a: int64, b: int64}'>

The question, then, is whether it should be or can be backported into v1. Our stance is that we're only making essential bug-fixes in v1 now. A change in behavior like this might be considered a bug-fix (though the iteration over field names had once been intentional—it just had unforeseen consequences), but it's "not essential" in the sense that there's a work-around.

In fact, we've been making very few changes to v1 in practice for about a year now. (It would be interesting to make a plot of how often the v1 directories were changed in git relative to the v2 directories.)

@chernals
Copy link

Thanks for confirming that this is all available in v2 RC, I'll have a look.

@agoose77
Copy link
Collaborator

agoose77 commented Sep 28, 2022

We don't yet have an RC out for v2. If you've used pipx and have docker installed, you can clone main and run

# Change XX for e.g. 39 (Python 3.9)
pipx run cibuildwheel . --only cpXX-manylinux_x86_64 

Here, cpXX-manylinux_x86_64 is the wheel tag to build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
policy Choice of behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants