Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Awkward v2 update #620

Merged
merged 6 commits into from
Jul 8, 2022
Merged

Awkward v2 update #620

merged 6 commits into from
Jul 8, 2022

Conversation

kkothari2001
Copy link
Collaborator

No description provided.

@kkothari2001
Copy link
Collaborator Author

kkothari2001 commented Jun 22, 2022

The above commit has all instances of awkward.layout removed and replaced with awkward.contents and awkward.index. Classes which were separate in akv1 but are now combined in akv2, for eg ListOffsetArray64 have been replaced by their corresponding classes in akv2.

@jpivarski There are 6 places in uproot.interpretation.library where I need some help in deciding what to do. I have marked all these locations with # PLEASE SEE: ...
2 of them were in the form cls = getattr(awkward.layout, form["class"]). Now, I'm not really sure if I should have replaced it with awkward.contents or awkward.index. With some minor context from other functions, I decided to go with awkward.contents for now. Please suggest. For the other 4, some tests are failing because we aren't initialising the RecordArray class right. Changing keys to fields didn't seem to work. Please suggest.

Aryan's work had a few instances of awkward.layout so I changed that too, I'll revert before merging if those changes were not to be done, or if they can possibly interfere with his work.

Lastly as discussed, the typeparser was added to v2. The release hasn't made it to pypi at the time of writing, so I just did ip install -U git+https://github.com/scikit-hep/awkward@main to test it out, which gave me the correct 1.9.0rc6 version. But despite that, the number of failing tests just went from 57 to 52. A lot of tests are failing due to TypeError: object of type 'awkward._ext.ArrayType' has no len() that is called somewhere down the stacktrace of the newly added from_datashape() function.

Essentially this had the same effect as directly importing ak_v1 inline and using that v1 parser, as the same tests fail, and the same tests are resolved (the 5 which were failing earlier). Please have a look at the last few tests in the CI for the error.

@kkothari2001
Copy link
Collaborator Author

kkothari2001 commented Jun 22, 2022

I found 7 major ways in which the tests are failing:

  1. awkward.operations.convert.to_list() throws TypeError: use ak._v2.operations.convert.to_list for v2 arrays (for now)
    Edit: I just realised this is because of the test importing ak_v1 and then using ak.to_list(). We can just remove these tests for now.

  2. awkward.operations.describe.type() throws

E           TypeError: unrecognized array type: <Array [{x: 1.1, y: 1, z: 97}, ..., {...}] type='5 * {x: float64, y: int32,...'>
E
E           (https://github.com/scikit-hep/awkward-1.0/blob/1.9.0rc6/src/awkward/operations/describe.py#L191)

I believe this is because somewhere in the tests we might be using ak_v1 arrays and that is conflicting.

  1. tests/test_0034-generic-objects-in-ttrees.py is particularly troublesome because it is comparing 2 json strings where one actually has ak_v1-specific classes like ListOffsetArray64. So all the tests here fail somewhere along _v2.types.numpytype.py.

  2. Another error is AttributeError: type object 'Form' has no attribute 'fromjson' for which I'm sure there is a good replacement. Please suggest.

E       AssertionError: assert False
E        +  where False = isinstance(<Array [{i8: -15, f8: -14.9}, ..., {...}] type='420 * {i8: int64, f8: float64}'>, <class 'awkward.highlevel.Array'>)
E        +    where <class 'awkward.highlevel.Array'> = <module 'awkward' from '/home/kmk/uprootall/uproot5/venv/lib/python3.8/site-packages/awkward/__init__.py'>.Array

I'm not sure what this is, but you provided a few notes on the highlevel submodule, I'll go through that, some code and get back to this test.

  1. The changed API for RecordArray is causing issues, as mentioned in the comment above. TypeError: __init__() got an unexpected keyword argument 'keys'

  2. Lastly the various TypeErrors caused during parsing through the from _datashape() function.
    Examples are:
    TypeError: not a NumPy dtype or an Awkward datashape: int32
    TypeError: not a NumPy dtype or an Awkward datashape: {"x": float64, "y": 3 * float64} etc...

But somewhere along the stacktrace the typeparser code in awkward is trying to use PrimitiveType and RecordType which as you explained were replaced with corresponding Numpy types.
Examples of errors thrown are:
TypeError: object of type 'awkward._ext.RecordType' has no len()
TypeError: object of type 'awkward._ext.PrimitiveType' has no len()

Comment on lines 433 to 435
cls = getattr(awkward.layout, form["class"])
cls = getattr(
awkward.contents, form["class"]
) # PLEASE SEE: I'M NOT SURE WHAT TO DO HERE! PLS CONFIRM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This form is apparently a dict that came from JSON-parsing the JSON serialization of a Form. Every JSON object representing a node of a Form tree has a "class" attribute, which names the associated Content class.

So, for instance, you know that a particular JSON node represents a ListOffsetForm if the "class" is "ListOffsetArray32", "ListOffsetArray64", or an unqualified name like "ListOffsetArray". (This JSON has to be backward compatible, so it really can have the "64" on it or not, regardless of whether it's v1 or v2.)

This code would only have worked for the "ListOffsetArray32", "ListOffsetArray64" kinds of strings in v1, so it's not completely valid anyway. What it did was getattr the Content subclass by name, such as awkward.layout.ListOffsetArray64.

Maybe the right thing to do is to have a helper function:

def _content_cls_from_name(name):
    if name.endswith("32") or name.endswith("64"):
        name = name[-2:]
    elif name.endswith("U32"):
        name = name[-3:]
    elif name.endswith("8_32") or name.endswith("8_64"):
        name = name[-4:]
    elif name.endswith("8_U32"):
        name = name[-5:]
    return getattr(awkward.contents, name)

These are all of the possible suffixes that have to be ignored in v2. Since it's a fixed set of classes, it could have been a dict mapping names to class objects, but then that dict would have to be updated if a new Content subclass is ever added to Awkward Array (unlikely). This, at least, is open-minded about new Content subclasses, as long as they stay with a limited set of suffixes.

Comment on lines 444 to 448
cls = getattr(awkward.layout, form["class"])
cls = getattr(
awkward.contents, form["class"]
) # PLEASE SEE: AND HERE, PLS CONFIRM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Comment on lines 624 to 626
return awkward.Array(
awkward.contents.RecordArray([], keys=[])
) # PLEASE SEE: RECORD ARRAY API seems to have changed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: "keys" is now named "fields" and it is no longer an optional argument.

Suggested change
return awkward.Array(
awkward.contents.RecordArray([], keys=[])
) # PLEASE SEE: RECORD ARRAY API seems to have changed
return awkward.Array(awkward.contents.RecordArray([], []))

Comment on lines 663 to 665
out = awkward.Array(
awkward.contents.RecordArray([], keys=[])
) # PLEASE SEE: RECORD ARRAY API seems to have changed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Suggested change
out = awkward.Array(
awkward.contents.RecordArray([], keys=[])
) # PLEASE SEE: RECORD ARRAY API seems to have changed
out = awkward.Array(awkward.contents.RecordArray([], []))

Comment on lines 691 to 693
awkward.contents.RecordArray(
[], keys=[]
) # PLEASE SEE: RECORD ARRAY API seems to have changed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Suggested change
awkward.contents.RecordArray(
[], keys=[]
) # PLEASE SEE: RECORD ARRAY API seems to have changed
awkward.contents.RecordArray([], [])

Comment on lines 703 to 705
awkward.contents.RecordArray(
[], keys=[]
) # PLEASE SEE: RECORD ARRAY API seems to have changed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Suggested change
awkward.contents.RecordArray(
[], keys=[]
) # PLEASE SEE: RECORD ARRAY API seems to have changed
awkward.contents.RecordArray([], [])

@jpivarski
Copy link
Member

I need some help in deciding what to do. I have marked all these locations with # PLEASE SEE: ...

Done.

Lastly as discussed, the typeparser was added to v2. The release hasn't made it to pypi at the time of writing, so I just did ip install -U git+https://github.com/scikit-hep/awkward@main to test it out, which gave me the correct 1.9.0rc6 version. But despite that, the number of failing tests just went from 57 to 52. A lot of tests are failing due to TypeError: object of type 'awkward._ext.ArrayType' has no len() that is called somewhere down the stacktrace of the newly added from_datashape() function.

I'll be fixing (mostly replacing) the new from_datashape pretty soon.

This error doesn't sound like it's related, since awkward._ext.* is a bunch of v1 classes (with the exception of the ForthMachine that Aryan is using).

If you need to make pytest's output less noisy, you can put a

@pytest.mark.skip(reason="FIXME: remove this mark before end of PR")

decorator on every test that fails except the one you're working on. The pytest output will list a lot of skipped tests at the end. Then, you can leave the ones related to from_datashape until the end.

  1. awkward.operations.convert.to_list() throws TypeError: use ak._v2.operations.convert.to_list for v2 arrays (for now)
    Edit: I just realised this is because of the test importing ak_v1 and then using ak.to_list(). We can just remove these tests for now.

Better yet: replace the v1 to_list with a v2 to_list. The feature of having v1 functions raise an exception when they encounter a v2 object was to help find these issues. Not all of the tests have such checks, unfortunately, so we can't rely on it as an absolute indicator, but it helps with the ones that it does catch.

  1. awkward.operations.describe.type() throws

I believe this is because somewhere in the tests we might be using ak_v1 arrays and that is conflicting.

This awkward.operations.describe.type is a v1 function; it should be replaced with v2.

On Zoom, I said that the submodules within awkward._v2.operations were flattened, so

awkward.operations.describe.type

becomes

awkward._v2.operations.type

But for type specifically, ak.Arrays and ak.Records now have a type property that can be used directly; no need to use the function.

  1. tests/test_0034-generic-objects-in-ttrees.py is particularly troublesome because it is comparing 2 json strings where one actually has ak_v1-specific classes like ListOffsetArray64. So all the tests here fail somewhere along _v2.types.numpytype.py.

If the new JSON string doesn't have the "64"s, you can drop them from the test. It's changing the test, but we understand why.

  1. Another error is AttributeError: type object 'Form' has no attribute 'fromjson' for which I'm sure there is a good replacement. Please suggest.
awkward._v2.forms.from_json
E       AssertionError: assert False
E        +  where False = isinstance(<Array [{i8: -15, f8: -14.9}, ..., {...}] type='420 * {i8: int64, f8: float64}'>, <class 'awkward.highlevel.Array'>)
E        +    where <class 'awkward.highlevel.Array'> = <module 'awkward' from '/home/kmk/uprootall/uproot5/venv/lib/python3.8/site-packages/awkward/__init__.py'>.Array

Is this isinstance being given a awkward.highlevel.Array when it should be given a awkward._v2.highlevel.Array?

(It's also possible to drop the word "highlevel." from both of the above.)

  1. The changed API for RecordArray is causing issues, as mentioned in the comment above. TypeError: __init__() got an unexpected keyword argument 'keys'

The argument named "keys" has become "fields". (In v1, several different words were used for this concept; in v2, they were all consolidated to just one: "fields".)

  1. Lastly the various TypeErrors caused during parsing through the from _datashape() function.
    Examples are:
    TypeError: not a NumPy dtype or an Awkward datashape: int32
    TypeError: not a NumPy dtype or an Awkward datashape: {"x": float64, "y": 3 * float64} etc...

But somewhere along the stacktrace the typeparser code in awkward is trying to use PrimitiveType and RecordType which as you explained were replaced with corresponding Numpy types. Examples of errors thrown are: TypeError: object of type 'awkward._ext.RecordType' has no len() TypeError: object of type 'awkward._ext.PrimitiveType' has no len()

I don't see a way for v1 types to get into the v2 from_datashape, but I'll be looking into that soon.

@kkothari2001
Copy link
Collaborator Author

kkothari2001 commented Jul 5, 2022

@jpivarski
I reset the entire branch (thus deleting all my commits) and then merged the branch we were working on. Yet a test seems to be failing. Currently, there is no difference between this branch and the branch we were working on during the meeting.

Is it possible, we might not have pushed a change? I remember you correcting the test that is currently failing.

@jpivarski
Copy link
Member

The failure is in the thing I thought I'd have to change in Awkward (and is therefore temporary):

E             - 1 * [var * (int64, BDSOutputROOTGeant4Data::ParticleInfo[name: string, charge: int64, mass: float64]), parameters={"__array__": "sorted_map"}]
E             + 1 * [var * (int64, struct[{name: string, charge: int64, mass: float64}, parameters={"__record__": "BDSOutputROOTGeant4Data::ParticleInfo"}]), parameters={"__array__": "sorted_map"}]

The question is whether a record name like "BDSOutputROOTGeant4Data::ParticleInfo" can be put before the square brackets, in place of "struct". The version of Awkward that CI is using has a different opinion, so just change this test to what Awkward wants it to be. We might have to change it again because I don't think a name with "::" in it should be allowed to do that. (So I guess that means that the latest-released Awkward is correct, and the main branch needs to be reverted, at least for cases like this.)

@kkothari2001
Copy link
Collaborator Author

kkothari2001 commented Jul 7, 2022

@jpivarski
So while solving the erronous test, I realised that a few more tests that depended on ROOT were also failing.

In order to solve them, I had to change the arguments of from_datashape. As already discussed, in ak v1 the default for high-level was False, and in v2 the default is True. Due to this change, the instances of from_datashape in uproot now have to explicitly mention highlevel=False.

Now, based on our meet, we did talk about how maybe it should have been highlevel=True, but assuming the code before the update to v2 was correct, I think that the outermost length of the awkward array has already been dealt with in this case. Please have a look and suggest accordingly.

@kkothari2001 kkothari2001 marked this pull request as ready for review July 7, 2022 09:15
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved!

But it will get merged after #644 because we want to do a performance test with that one (today), and not updating to v2 yet will make it easier to do that right away.

After the #644 merger into main, I'll help merge that into this branch, and then this can be merged. That should happen later today (I can do that alone) or tomorrow.

@jpivarski
Copy link
Member

As I said in Slack, we're having trouble with #644: it doesn't pass all tests, and so we won't merge it yet. The trouble has been traced back to Awkward Array, and it will take a long time to get an Awkward Array fix into PyPI so that Uproot's CI can see it.

Therefore, we'll merge this one first, and do #644 later.

@jpivarski jpivarski enabled auto-merge (squash) July 8, 2022 22:33
@jpivarski jpivarski merged commit abdb5e3 into main Jul 8, 2022
@jpivarski jpivarski deleted the awkward-v2-update branch July 8, 2022 22:47
@jpivarski
Copy link
Member

@all-contributors please add @kkothari2001 for code

@allcontributors
Copy link
Contributor

@jpivarski

I've put up a pull request to add @kkothari2001! 🎉

@jpivarski
Copy link
Member

@all-contributors please add @kkothari2001 for code

@allcontributors
Copy link
Contributor

@jpivarski

@kkothari2001 already contributed before to code

Moelf pushed a commit to Moelf/uproot5 that referenced this pull request Aug 1, 2022
* Changed everything we _think_ we need to change.

* All tests pass.

* Change test 0034

* Alter instances of from_datashape

* Removed 'v1/v2 insensitivity' checks. We're all-in for v2 now.

Co-authored-by: Jim Pivarski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants