Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ refactoring: handle datetime and timedelta #1149

Merged
merged 7 commits into from
Nov 15, 2021

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Nov 15, 2021

  • test reducers on datetime and timedelta at axis None

  • sort and argsort datetime and timedelta types

  • unique and is_unique for datetime and timedelta types

@codecov
Copy link

codecov bot commented Nov 15, 2021

Codecov Report

Merging #1149 (0fb8452) into main (f795658) will increase coverage by 0.35%.
The diff coverage is 77.61%.

Impacted Files Coverage Δ
src/awkward/_v2/forms/bitmaskedform.py 75.64% <ø> (ø)
src/awkward/_v2/forms/bytemaskedform.py 76.05% <ø> (ø)
src/awkward/_v2/forms/emptyform.py 76.92% <ø> (ø)
src/awkward/_v2/forms/indexedoptionform.py 76.81% <ø> (ø)
src/awkward/_v2/forms/listform.py 75.94% <ø> (ø)
src/awkward/_v2/forms/listoffsetform.py 80.55% <ø> (ø)
src/awkward/_v2/forms/recordform.py 66.46% <ø> (ø)
src/awkward/_v2/forms/regularform.py 75.34% <ø> (ø)
src/awkward/_v2/forms/unionform.py 76.19% <ø> (ø)
src/awkward/_v2/forms/unmaskedform.py 74.57% <ø> (ø)
... and 50 more

@ianna ianna changed the title C++ refactoring: test datetime and timedelta reducers at axis None C++ refactoring: handle datetime and timedelta Nov 15, 2021
@ianna
Copy link
Collaborator Author

ianna commented Nov 15, 2021

@jpivarski - I think, datetime/timedelta is done. Please, have a look. Thanks!

Unrelated to this PR: I'm not sure about is the reducers default axis. Should it be -1 or None?

@ianna ianna requested a review from jpivarski November 15, 2021 13:19
@jpivarski
Copy link
Member

Unrelated to this PR: I'm not sure about is the reducers default axis. Should it be -1 or None?

Unfortunately, it needs to be None. We'd like it to be -1 because that's our most common use-case, but they override NumPy functions and those NumPy functions have a default of None:

>>> np.sum([[1, 2], [3, 4]])
10

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks fine—removing NumpyArray wrappers from intermediate arrays is conceptually simpler.

This doesn't add any tests, though. The PR I took over last Friday had some un-activated tests, and I thought you were planning on activating them (uncommenting, I think—either that or removing pytest.mark.skip) and making them work. When I took over the PR, I did not do that. I don't want them to slip through the cracks!

@jpivarski
Copy link
Member

When I get back in half an hour, I can point out which tests those are. This PR can be merged, if it's not going to cover them.

@ianna
Copy link
Collaborator Author

ianna commented Nov 15, 2021

This PR looks fine—removing NumpyArray wrappers from intermediate arrays is conceptually simpler.

This doesn't add any tests, though. The PR I took over last Friday had some un-activated tests, and I thought you were planning on activating them (uncommenting, I think—either that or removing pytest.mark.skip) and making them work. When I took over the PR, I did not do that. I don't want them to slip through the cracks!

Yes, three tests have been enabled and the other three need more work.

  1. Awkward 1.0 converted the following strings to datetime:
array0 = ak._v2.contents.NumpyArray(
            ["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"]
        )

while Awkward 2.0 treats them as an unsupported dtype: dtype('<U19')

  1. The following Array's layout is a ListOffsetArray of a UnionArray and _completely_flatten does not handle it:
    array = ak._v2.highlevel.Array(
        [
            [
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-05"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-04-27T10:41:11"),
            ],
            [
                np.datetime64("2020-04-27"),
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-06-27T10:41:11"),
            ],
            [
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
            ],
        ]
    )
>>> array.layout
<ListOffsetArray len='3'>
    <offsets><Index dtype='int64' len='4'>[ 0  5  9 12]</Index></offsets>
    <content><UnionArray len='12'>
        <tags><Index dtype='int8' len='12'>[0 0 1 0 0 2 0 0 0 0 0 0]</Index></tags>
        <index><Index dtype='int64' len='12'>[0 1 0 2 3 0 4 5 6 7 8 9]</Index></index>
        <content index='0'>
            <NumpyArray dtype='datetime64[s]' len='10'>
                ['2020-03-27T10:41:11' '2020-01-27T10:41:11'
                 '2020-01-27T10:41:11' '2020-04-27T10:41:11'
                 '2020-02-27T10:41:11' '2020-01-27T10:41:11'
                 '2020-06-27T10:41:11' '2020-02-27T10:41:11'
                 '2020-03-27T10:41:11' '2020-01-27T10:41:11']
            </NumpyArray>
        </content>
        <content index='1'>
            <NumpyArray dtype='datetime64[M]' len='1'>['2020-05']</NumpyArray>
        </content>
        <content index='2'>
            <NumpyArray dtype='datetime64[D]' len='1'>['2020-04-27']</NumpyArray>
        </content>
    </UnionArray></content>
</ListOffsetArray>
>>> 
  1. a test that uses ak.to_numpy

@jpivarski
Copy link
Member

  1. Awkward 1.0 converted the following strings to datetime:
array0 = ak._v2.contents.NumpyArray(
            ["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"]
        )

while Awkward 2.0 treats them as an unsupported dtype: dtype('<U19')

The connection to NumPy is more direct now—we no longer have to do the .asint64().view(np.datetime64) dance because NumpyArray's data does not go through the Python buffer protocol (which lacks datetimes/timedeltas). If satisfying old tests would mean putting in extra rules to modify NumPy's behavior, then we should change the old tests.

In this case, NumPy interprets the data as strings:

>>> np.array(["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"])
array(['2019-09-02T09:30:00', '2019-09-13T09:30:00',
       '2019-09-21T20:00:00'], dtype='<U19')

>>> np.asarray(["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"])
array(['2019-09-02T09:30:00', '2019-09-13T09:30:00',
       '2019-09-21T20:00:00'], dtype='<U19')

And it should, because making the output type depend on the formatting of the strings would make it less predictable. (Suppose someone didn't know about ISO time formatting and they just wanted strings that look like this.) Even Pandas won't interpret strings as dates unless a specific option is passed (parse_dates).

So this old test was wrong: it has to be given a NumPy array with datetime dtype.

Even Python datetime objects don't work:

>>> from datetime import datetime
>>> np.array([datetime.now(), datetime.now(), datetime.now()])
array([datetime.datetime(2021, 11, 15, 9, 0, 13, 584111),
       datetime.datetime(2021, 11, 15, 9, 0, 13, 584115),
       datetime.datetime(2021, 11, 15, 9, 0, 13, 584116)], dtype=object)

We won't (ever) support dtype=object. The NumpyArray class is relatively low-level, it doesn't have to have a nice, convenient interface. It would be more reasonable for ak.Array to recognize datetime objects, but a nicety like that doesn't need to be implemented right away.

  1. The following Array's layout is a ListOffsetArray of a UnionArray and _completely_flatten does not handle it:
    array = ak._v2.highlevel.Array(
        [
            [
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-05"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-04-27T10:41:11"),
            ],
            [
                np.datetime64("2020-04-27"),
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-06-27T10:41:11"),
            ],
            [
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
            ],
        ]
    )

This is the first test of completely_flatten for UnionArray. Here's what was missing:

--- a/src/awkward/_v2/contents/unionarray.py
+++ b/src/awkward/_v2/contents/unionarray.py
@@ -921,7 +921,7 @@ class UnionArray(Content):
     def _completely_flatten(self, nplike, options):
         out = []
         for content in self._contents:
-            out.extend(content[: self._length]._completely_flatten(nplike, options))
+            out.extend(content[: len(self._tags)]._completely_flatten(nplike, options))
         return out
 
     def _recursively_apply(

Then you can

>>> array.layout.completely_flatten()
array(['2020-03-27T10:41:11', '2020-01-27T10:41:11',
       '2020-01-27T10:41:11', '2020-04-27T10:41:11',
       '2020-02-27T10:41:11', '2020-01-27T10:41:11',
       '2020-06-27T10:41:11', '2020-02-27T10:41:11',
       '2020-03-27T10:41:11', '2020-01-27T10:41:11',
       '2020-05-01T00:00:00', '2020-04-27T00:00:00'],
      dtype='datetime64[s]')

and work with the single NumPy array that results.

  1. a test that uses ak.to_numpy

ak._v2.operations.convert.to_numpy hasn't been implemented (refactored) yet, but there have been a few places where I've wanted it. Getting this updated would be great if there's enough context to see how to do that.

I don't think to_numpy should use recursively_apply because it does something unique at every node level. (It's not like these functions that descend until they get to an axis and then wrap up the result into the original structure: to_numpy turns every level into one NumPy array.) The current implementation might have a long if-elif-elif chain. That would be better as methods on all the Content subclasses. (Let Python's dispatch do the "else if isinstance".) This is similar to the to_arrow refactoring—it used to be an if-elif-elif chain, but now it's a method on all the Content subclasses. from_arrow and from_numpy aren't, since they're going in the opposite direction.

@ianna
Copy link
Collaborator Author

ianna commented Nov 15, 2021

@jpivarski - thanks! I've enabled all tests and started on to_numpy here - or should it be a separate PR?

@jpivarski
Copy link
Member

@jpivarski - thanks! I've enabled all tests and started on to_numpy here - or should it be a separate PR?

If it's easy enough to separate into another PR, then let's do it. (The only reason I sometimes don't is because there's so much to do and waiting for a prerequisite to finish testing and merge before starting the next would slow down development. This may be the end of the day for you; if you're planning to start working on to_numpy tomorrow, then a new PR makes a lot of sense.)

Is this one ready to merge as-is?

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've scanned it again and it looks good to me!

@ianna
Copy link
Collaborator Author

ianna commented Nov 15, 2021

@jpivarski - thanks! I've enabled all tests and started on to_numpy here - or should it be a separate PR?

If it's easy enough to separate into another PR, then let's do it. (The only reason I sometimes don't is because there's so much to do and waiting for a prerequisite to finish testing and merge before starting the next would slow down development. This may be the end of the day for you; if you're planning to start working on to_numpy tomorrow, then a new PR makes a lot of sense.)

Is this one ready to merge as-is?

Yes, I think so. I'll open a new PR tomorrow with to_numpy as I'd like to add more tests.

@jpivarski jpivarski enabled auto-merge (squash) November 15, 2021 16:05
@jpivarski jpivarski merged commit 06dfff9 into main Nov 15, 2021
@jpivarski jpivarski deleted the ianna/datetime64-timedelta64-in-v2 branch November 15, 2021 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants