C++ refactoring: handle datetime and timedelta #1149

ianna · 2021-11-15T10:47:01Z

test reducers on datetime and timedelta at axis None
sort and argsort datetime and timedelta types
unique and is_unique for datetime and timedelta types

codecov · 2021-11-15T10:56:54Z

Codecov Report

Merging #1149 (0fb8452) into main (f795658) will increase coverage by 0.35%.
The diff coverage is 77.61%.

Impacted Files	Coverage Δ
src/awkward/_v2/forms/bitmaskedform.py	`75.64% <ø> (ø)`
src/awkward/_v2/forms/bytemaskedform.py	`76.05% <ø> (ø)`
src/awkward/_v2/forms/emptyform.py	`76.92% <ø> (ø)`
src/awkward/_v2/forms/indexedoptionform.py	`76.81% <ø> (ø)`
src/awkward/_v2/forms/listform.py	`75.94% <ø> (ø)`
src/awkward/_v2/forms/listoffsetform.py	`80.55% <ø> (ø)`
src/awkward/_v2/forms/recordform.py	`66.46% <ø> (ø)`
src/awkward/_v2/forms/regularform.py	`75.34% <ø> (ø)`
src/awkward/_v2/forms/unionform.py	`76.19% <ø> (ø)`
src/awkward/_v2/forms/unmaskedform.py	`74.57% <ø> (ø)`
... and 50 more

ianna · 2021-11-15T13:19:09Z

@jpivarski - I think, datetime/timedelta is done. Please, have a look. Thanks!

Unrelated to this PR: I'm not sure about is the reducers default axis. Should it be -1 or None?

jpivarski · 2021-11-15T13:38:41Z

Unrelated to this PR: I'm not sure about is the reducers default axis. Should it be -1 or None?

Unfortunately, it needs to be None. We'd like it to be -1 because that's our most common use-case, but they override NumPy functions and those NumPy functions have a default of None:

>>> np.sum([[1, 2], [3, 4]])
10

jpivarski

This PR looks fine—removing NumpyArray wrappers from intermediate arrays is conceptually simpler.

This doesn't add any tests, though. The PR I took over last Friday had some un-activated tests, and I thought you were planning on activating them (uncommenting, I think—either that or removing pytest.mark.skip) and making them work. When I took over the PR, I did not do that. I don't want them to slip through the cracks!

jpivarski · 2021-11-15T13:43:35Z

When I get back in half an hour, I can point out which tests those are. This PR can be merged, if it's not going to cover them.

ianna · 2021-11-15T14:38:51Z

This PR looks fine—removing NumpyArray wrappers from intermediate arrays is conceptually simpler.

This doesn't add any tests, though. The PR I took over last Friday had some un-activated tests, and I thought you were planning on activating them (uncommenting, I think—either that or removing pytest.mark.skip) and making them work. When I took over the PR, I did not do that. I don't want them to slip through the cracks!

Yes, three tests have been enabled and the other three need more work.

Awkward 1.0 converted the following strings to datetime:

array0 = ak._v2.contents.NumpyArray(
            ["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"]
        )

while Awkward 2.0 treats them as an unsupported dtype: dtype('<U19')

The following Array's layout is a ListOffsetArray of a UnionArray and _completely_flatten does not handle it:

    array = ak._v2.highlevel.Array(
        [
            [
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-05"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-04-27T10:41:11"),
            ],
            [
                np.datetime64("2020-04-27"),
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-06-27T10:41:11"),
            ],
            [
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
            ],
        ]
    )

>>> array.layout
<ListOffsetArray len='3'>
    <offsets><Index dtype='int64' len='4'>[ 0  5  9 12]</Index></offsets>
    <content><UnionArray len='12'>
        <tags><Index dtype='int8' len='12'>[0 0 1 0 0 2 0 0 0 0 0 0]</Index></tags>
        <index><Index dtype='int64' len='12'>[0 1 0 2 3 0 4 5 6 7 8 9]</Index></index>
        <content index='0'>
            <NumpyArray dtype='datetime64[s]' len='10'>
                ['2020-03-27T10:41:11' '2020-01-27T10:41:11'
                 '2020-01-27T10:41:11' '2020-04-27T10:41:11'
                 '2020-02-27T10:41:11' '2020-01-27T10:41:11'
                 '2020-06-27T10:41:11' '2020-02-27T10:41:11'
                 '2020-03-27T10:41:11' '2020-01-27T10:41:11']
            </NumpyArray>
        </content>
        <content index='1'>
            <NumpyArray dtype='datetime64[M]' len='1'>['2020-05']</NumpyArray>
        </content>
        <content index='2'>
            <NumpyArray dtype='datetime64[D]' len='1'>['2020-04-27']</NumpyArray>
        </content>
    </UnionArray></content>
</ListOffsetArray>
>>>

a test that uses ak.to_numpy

jpivarski · 2021-11-15T15:13:19Z

Awkward 1.0 converted the following strings to datetime:
array0 = ak._v2.contents.NumpyArray(
            ["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"]
        )
while Awkward 2.0 treats them as an unsupported dtype: dtype('<U19')

The connection to NumPy is more direct now—we no longer have to do the .asint64().view(np.datetime64) dance because NumpyArray's data does not go through the Python buffer protocol (which lacks datetimes/timedeltas). If satisfying old tests would mean putting in extra rules to modify NumPy's behavior, then we should change the old tests.

In this case, NumPy interprets the data as strings:

>>> np.array(["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"])
array(['2019-09-02T09:30:00', '2019-09-13T09:30:00',
       '2019-09-21T20:00:00'], dtype='<U19')

>>> np.asarray(["2019-09-02T09:30:00", "2019-09-13T09:30:00", "2019-09-21T20:00:00"])
array(['2019-09-02T09:30:00', '2019-09-13T09:30:00',
       '2019-09-21T20:00:00'], dtype='<U19')

And it should, because making the output type depend on the formatting of the strings would make it less predictable. (Suppose someone didn't know about ISO time formatting and they just wanted strings that look like this.) Even Pandas won't interpret strings as dates unless a specific option is passed (parse_dates).

So this old test was wrong: it has to be given a NumPy array with datetime dtype.

Even Python datetime objects don't work:

>>> from datetime import datetime
>>> np.array([datetime.now(), datetime.now(), datetime.now()])
array([datetime.datetime(2021, 11, 15, 9, 0, 13, 584111),
       datetime.datetime(2021, 11, 15, 9, 0, 13, 584115),
       datetime.datetime(2021, 11, 15, 9, 0, 13, 584116)], dtype=object)

We won't (ever) support dtype=object. The NumpyArray class is relatively low-level, it doesn't have to have a nice, convenient interface. It would be more reasonable for ak.Array to recognize datetime objects, but a nicety like that doesn't need to be implemented right away.

The following Array's layout is a ListOffsetArray of a UnionArray and _completely_flatten does not handle it:

    array = ak._v2.highlevel.Array(
        [
            [
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-05"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-04-27T10:41:11"),
            ],
            [
                np.datetime64("2020-04-27"),
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
                np.datetime64("2020-06-27T10:41:11"),
            ],
            [
                np.datetime64("2020-02-27T10:41:11"),
                np.datetime64("2020-03-27T10:41:11"),
                np.datetime64("2020-01-27T10:41:11"),
            ],
        ]
    )

This is the first test of completely_flatten for UnionArray. Here's what was missing:

--- a/src/awkward/_v2/contents/unionarray.py
+++ b/src/awkward/_v2/contents/unionarray.py
@@ -921,7 +921,7 @@ class UnionArray(Content):
     def _completely_flatten(self, nplike, options):
         out = []
         for content in self._contents:
-            out.extend(content[: self._length]._completely_flatten(nplike, options))
+            out.extend(content[: len(self._tags)]._completely_flatten(nplike, options))
         return out
 
     def _recursively_apply(

Then you can

>>> array.layout.completely_flatten()
array(['2020-03-27T10:41:11', '2020-01-27T10:41:11',
       '2020-01-27T10:41:11', '2020-04-27T10:41:11',
       '2020-02-27T10:41:11', '2020-01-27T10:41:11',
       '2020-06-27T10:41:11', '2020-02-27T10:41:11',
       '2020-03-27T10:41:11', '2020-01-27T10:41:11',
       '2020-05-01T00:00:00', '2020-04-27T00:00:00'],
      dtype='datetime64[s]')

and work with the single NumPy array that results.

a test that uses ak.to_numpy

ak._v2.operations.convert.to_numpy hasn't been implemented (refactored) yet, but there have been a few places where I've wanted it. Getting this updated would be great if there's enough context to see how to do that.

I don't think to_numpy should use recursively_apply because it does something unique at every node level. (It's not like these functions that descend until they get to an axis and then wrap up the result into the original structure: to_numpy turns every level into one NumPy array.) The current implementation might have a long if-elif-elif chain. That would be better as methods on all the Content subclasses. (Let Python's dispatch do the "else if isinstance".) This is similar to the to_arrow refactoring—it used to be an if-elif-elif chain, but now it's a method on all the Content subclasses. from_arrow and from_numpy aren't, since they're going in the opposite direction.

ianna · 2021-11-15T15:47:58Z

@jpivarski - thanks! I've enabled all tests and started on to_numpy here - or should it be a separate PR?

jpivarski · 2021-11-15T15:53:37Z

@jpivarski - thanks! I've enabled all tests and started on to_numpy here - or should it be a separate PR?

If it's easy enough to separate into another PR, then let's do it. (The only reason I sometimes don't is because there's so much to do and waiting for a prerequisite to finish testing and merge before starting the next would slow down development. This may be the end of the day for you; if you're planning to start working on to_numpy tomorrow, then a new PR makes a lot of sense.)

Is this one ready to merge as-is?

jpivarski

I've scanned it again and it looks good to me!

ianna · 2021-11-15T15:57:09Z

@jpivarski - thanks! I've enabled all tests and started on to_numpy here - or should it be a separate PR?

If it's easy enough to separate into another PR, then let's do it. (The only reason I sometimes don't is because there's so much to do and waiting for a prerequisite to finish testing and merge before starting the next would slow down development. This may be the end of the day for you; if you're planning to start working on to_numpy tomorrow, then a new PR makes a lot of sense.)

Is this one ready to merge as-is?

Yes, I think so. I'll open a new PR tomorrow with to_numpy as I'd like to add more tests.

enable tests for datetime and timedelta reducers at axis None

86db943

argsort/sort datetime and timedelta

c2bd77f

ianna changed the title ~~C++ refactoring: test datetime and timedelta reducers at axis None~~ C++ refactoring: handle datetime and timedelta Nov 15, 2021

final cleanup

e750b09

ianna requested a review from jpivarski November 15, 2021 13:19

jpivarski approved these changes Nov 15, 2021

View reviewed changes

forgot to add

36f309e

enable two more tests

206419f

ianna added 2 commits November 15, 2021 16:39

address Jim's comments

14b2bce

ceanup

0fb8452

jpivarski approved these changes Nov 15, 2021

View reviewed changes

jpivarski enabled auto-merge (squash) November 15, 2021 16:05

jpivarski merged commit 06dfff9 into main Nov 15, 2021

jpivarski deleted the ianna/datetime64-timedelta64-in-v2 branch November 15, 2021 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ refactoring: handle datetime and timedelta #1149

C++ refactoring: handle datetime and timedelta #1149

ianna commented Nov 15, 2021 •

edited

Loading

codecov bot commented Nov 15, 2021 •

edited

Loading

ianna commented Nov 15, 2021

jpivarski commented Nov 15, 2021

jpivarski left a comment

jpivarski commented Nov 15, 2021

ianna commented Nov 15, 2021 •

edited

Loading

jpivarski commented Nov 15, 2021

ianna commented Nov 15, 2021

jpivarski commented Nov 15, 2021

jpivarski left a comment

ianna commented Nov 15, 2021

C++ refactoring: handle datetime and timedelta #1149

C++ refactoring: handle datetime and timedelta #1149

Conversation

ianna commented Nov 15, 2021 • edited Loading

codecov bot commented Nov 15, 2021 • edited Loading

Codecov Report

ianna commented Nov 15, 2021

jpivarski commented Nov 15, 2021

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski commented Nov 15, 2021

ianna commented Nov 15, 2021 • edited Loading

jpivarski commented Nov 15, 2021

ianna commented Nov 15, 2021

jpivarski commented Nov 15, 2021

jpivarski left a comment

Choose a reason for hiding this comment

ianna commented Nov 15, 2021

ianna commented Nov 15, 2021 •

edited

Loading

codecov bot commented Nov 15, 2021 •

edited

Loading

ianna commented Nov 15, 2021 •

edited

Loading