DataFrame interchange protocol: should NaT be like NaN or a sentinel? #64

jorisvandenbossche · 2021-09-13T15:40:10Z

In the describe_null we currently list the following options:

0 : non-nullable
1 : NaN/NaT
2 : sentinel value
3 : bit mask
4 : byte mask

While looking at the pandas implementation, I was wondering if we shouldn't treat NaT differently from NaN and see it as a sentinel value (option 2 in the list above).

While NaN could also be seen as a kind of sentinel value, there are some clear differences: NaN is a floating point concept backed by the IEEE754 standard (while as far as I know "NaT" is quite numpy specific? eg Arrow doesn't support it). NaNs also evaluate as non-equal (following the standard), and while for datetime64 with NaT that's also the case in numpy, if you view the data as int64 it's not (and eg for dlpack those values will be regarded as int64? And the actual Buffer object might be agnostic to it)

The text was updated successfully, but these errors were encountered:

rgommers · 2021-09-22T14:52:51Z

Yes I agree, NaT is a custom thing so is best treated as a sentinel value. Thanks for pointing that out! I'll open a PR to change that.

rgommers · 2022-09-13T11:43:54Z

This was fixed in gh-74. I just noticed that that change is not yet reflected in the deployed docs at https://data-apis.org/dataframe-protocol/latest/API.html, so I'll fix that now. The content in this repo is fine though, so I'll close this issue.

jorisvandenbossche · 2022-09-13T12:34:31Z

Do we have a specific test for this in https://github.com/data-apis/dataframe-interchange-tests? That might be useful to ensure that all implementations follow this (although of course not every implementation necessarily uses NaT for missing values in datetime data, so this might be hard to test library-independenty. But a minimum test could be to ensure it does not indicate to use NaN)

honno · 2022-09-13T12:47:31Z

Do we have a specific test for this in https://github.com/data-apis/dataframe-interchange-tests? That might be useful to ensure that all implementations follow this (although of course not every implementation necessarily uses NaT for missing values in datetime data, so this might be hard to test library-independenty. But a minimum test could be to ensure it does not indicate to use NaN)

~~Currently there are no assertions on values when sentinel (kind=2), but I like your idea of ensuring it's not None-y. Though I suppose None/float("nan")/etc. are valid sentinel values?~~ Oh if you mean test datetime columns aren't using kind=1, yeah that's not tested but makes sense, so will explore.

jorisvandenbossche · 2022-09-13T13:00:42Z

Though I suppose None/float("nan")/etc. are valid sentinel values?

Theoretically yes (I don't think we said anywhere that it is not allowed), but since we have a specific flag for NaN, I think we should certainly strongly recommend using that flag (1) instead of a sentinel (2) with value NaN (or maybe even consider disallowing this)

(note that None can never be a valid sentinel, since we only support primitive arrays right now (no object dtype), so those can never hold None)

The test I am thinking about would involve a dtype specific check. In pseudo-code (for that test you linked to):

if col.dtype == "datetime":
    assert col.describe_null[0] != 1

with the idea that a datetime dtype can never have NaNs, since it is backed by a int array (although this last part is not really clear if that's required by the spec ..)

honno · 2022-09-13T13:02:30Z

(looks like I edited my comment just as you submitted yours heh)

The test I am thinking about would involve a dtype specific check. In pseudo-code (for that test you linked to):
if col.dtype == "datetime":
    assert col.describe_null[0] != 1

Ayup makes sense, will explore thanks.

rgommers added the interchange-protocol label Sep 22, 2021

rgommers closed this as completed Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame interchange protocol: should NaT be like NaN or a sentinel? #64

DataFrame interchange protocol: should NaT be like NaN or a sentinel? #64

jorisvandenbossche commented Sep 13, 2021

rgommers commented Sep 22, 2021

rgommers commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

honno commented Sep 13, 2022 •

edited

Loading

jorisvandenbossche commented Sep 13, 2022 •

edited

Loading

honno commented Sep 13, 2022

DataFrame interchange protocol: should NaT be like NaN or a sentinel? #64

DataFrame interchange protocol: should NaT be like NaN or a sentinel? #64

Comments

jorisvandenbossche commented Sep 13, 2021

rgommers commented Sep 22, 2021

rgommers commented Sep 13, 2022

jorisvandenbossche commented Sep 13, 2022

honno commented Sep 13, 2022 • edited Loading

jorisvandenbossche commented Sep 13, 2022 • edited Loading

honno commented Sep 13, 2022

honno commented Sep 13, 2022 •

edited

Loading

jorisvandenbossche commented Sep 13, 2022 •

edited

Loading