-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame interchange protocol: should NaT be like NaN or a sentinel? #64
Comments
Yes I agree, |
This was fixed in gh-74. I just noticed that that change is not yet reflected in the deployed docs at https://data-apis.org/dataframe-protocol/latest/API.html, so I'll fix that now. The content in this repo is fine though, so I'll close this issue. |
Do we have a specific test for this in https://github.com/data-apis/dataframe-interchange-tests? That might be useful to ensure that all implementations follow this (although of course not every implementation necessarily uses NaT for missing values in datetime data, so this might be hard to test library-independenty. But a minimum test could be to ensure it does not indicate to use NaN) |
|
Theoretically yes (I don't think we said anywhere that it is not allowed), but since we have a specific flag for NaN, I think we should certainly strongly recommend using that flag (1) instead of a sentinel (2) with value NaN (or maybe even consider disallowing this) (note that None can never be a valid sentinel, since we only support primitive arrays right now (no object dtype), so those can never hold None) The test I am thinking about would involve a dtype specific check. In pseudo-code (for that test you linked to):
with the idea that a datetime dtype can never have NaNs, since it is backed by a int array (although this last part is not really clear if that's required by the spec ..) |
(looks like I edited my comment just as you submitted yours heh)
Ayup makes sense, will explore thanks. |
In the
describe_null
we currently list the following options:While looking at the pandas implementation, I was wondering if we shouldn't treat NaT differently from NaN and see it as a sentinel value (option 2 in the list above).
While NaN could also be seen as a kind of sentinel value, there are some clear differences: NaN is a floating point concept backed by the IEEE754 standard (while as far as I know "NaT" is quite numpy specific? eg Arrow doesn't support it). NaNs also evaluate as non-equal (following the standard), and while for datetime64 with NaT that's also the case in numpy, if you view the data as int64 it's not (and eg for dlpack those values will be regarded as int64? And the actual Buffer object might be agnostic to it)
The text was updated successfully, but these errors were encountered: