Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.select(pl.col(pl.Datetime)) doesn't work if the DataFrame was created with pl.from_pandas #6422

Closed
2 tasks done
2-5 opened this issue Jan 24, 2023 · 3 comments
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@2-5
Copy link
Contributor

2-5 commented Jan 24, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

df.select(pl.col(pl.Datetime)) doesn't return the datetime column if the DataFrame was created with pl.from_pandas

Reproducible example

from datetime import datetime

import pandas as pd
import polars as pl

df = pd.DataFrame(dict(
    value=[1],
    time=[datetime(2022, 1, 1)],
))

df2 = pl.from_pandas(df)
print(df2)
print(df2.select(pl.col(pl.Datetime)))

Expected behavior

works

Installed versions

---Version info---
Polars: 0.15.16
Index type: UInt32
Platform: Windows-10-10.0.19045-SP0
Python: 3.10.8 (tags/v3.10.8:aaaf517, Oct 11 2022, 16:50:30) [MSC v.1933 64 bit (AMD64)]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.23.5
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.6.2
@2-5 2-5 added bug Something isn't working python Related to Python Polars labels Jan 24, 2023
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jan 25, 2023

It looks like the cause of the issue is that datetime is parsed as ns whereas the default time_unit for pl.Datetime is us and they are considered different.

>>> df2
shape: (1, 2)
┌───────┬─────────────────────┐
│ value | time                │
│ ---   | ---                 │
│ i64   | datetime[ns]        │
╞═══════╪═════════════════════╡
│ 1     | 2022-01-01 00:00:00 │
└───────┴─────────────────────┘
>>> df2.select(pl.col(pl.Datetime))
shape: (0, 0)
┌┐
╞╡
└┘
>>> df2.select(pl.col(pl.Datetime("ns")))
shape: (1, 1)
┌─────────────────────┐
│ time                │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2022-01-01 00:00:00 │
└─────────────────────┘

If it's of use - I debugged this with:

for dt in dir(pl.datatypes):
    try:
       out = df2.select(pl.col(getattr(pl.datatypes, dt)))
       if out.height > 0:
          print(f"{dt=}")
          print(out)
    except: pass

Which resulted in

dt='TEMPORAL_DTYPES'
shape: (1, 1)
┌─────────────────────┐
│ time                │
│ ---                 │
│ datetime[ns]        │
╞═════════════════════╡
│ 2022-01-01 00:00:00 │
└─────────────────────┘
>>> pl.datatypes.TEMPORAL_DTYPES
frozenset({Date,
           Datetime(tu='ms', tz=None),
           Datetime(tu='ns', tz=None),
           Datetime(tu='us', tz=None),
           Duration(tu='ms'),
           Duration(tu='ns'),
           Duration(tu='us'),
           Time})

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 25, 2023

Yup; we still need a slightly better way to match Datetime in the general case, as we do indeed default to μs and pandas defaults to ns. We have started to add "official" dtype groups, such as pl.datatypes.INTEGER_DTYPES that you can select with. A pl.datatypes.DATETIME_DTYPES might be a start, covering all of the different time-units 🤔

@ritchie46
Copy link
Member

This is not a bug per se. pl.Datetime defaults to Datetime(us), which isn't Datetime(ns).

I will close this in favor of #5300 which is the same underlying reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants