Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Ds.dataset() should be able to return partition column as datetime (if it is in ISO date format) #37071

Closed
lmocsi opened this issue Aug 8, 2023 · 1 comment

Comments

@lmocsi
Copy link

lmocsi commented Aug 8, 2023

Describe the enhancement requested

Right now, ds.dataset() can include the partition key in the result if partitioning='hive' option is used.
However, this partition key column is returned as a str, though it used to be a date.
Would be nice, if it was returned as a datetime column, if it is in the standard ISO format.

import pyarrow.dataset as ds

mytab = ds.dataset("my_table_name", partitioning='hive').to_table()
print(mytab)

Calendar_date is the partitioning column.

Results:
pyarrow.Table
USER_ID: double
TRX_CNT: double
DATE_OF_BIRTH: timestamp[ns]
CALENDAR_DATE: string

USER_ID: [[1000,1001,1002,1003],[1000,1001,1002,1005],[1000,1001,1003,1005,1008]]
TRX_CNT: [[434,11,3,555],[111,32,1,2],[434,21,44,111,222]]
DATE_OF_BIRTH: [[1998-12-01 23:00:00.000000000,2002-03-13 23:00:00.000000000,1975-08-31 23:00:00.000000000,1998-12-31 23:00:00.000000000],[1998-12-01 23:00:00.000000000,2002-03-13 23:00:00.000000000,1975-08-31 23:00:00.000000000,2004-06-06 22:00:00.000000000],[1998-12-01 23:00:00.000000000,2002-03-13 23:00:00.000000000,1998-12-31 23:00:00.000000000,2004-06-06 22:00:00.000000000,1988-02-27 23:00:00.000000000]]
CALENDAR_DATE: [["2023-08-01 00:00:00","2023-08-01 00:00:00","2023-08-01 00:00:00","2023-08-01 00:00:00"],["2023-08-02 00:00:00","2023-08-02 00:00:00","2023-08-02 00:00:00","2023-08-02 00:00:00"],["2023-08-03 00:00:00","2023-08-03 00:00:00","2023-08-03 00:00:00","2023-08-03 00:00:00","2023-08-03 00:00:00"]]

my_table_name.zip

Component(s)

Parquet, Python

@mapleFU
Copy link
Member

mapleFU commented Aug 11, 2023

>>> part = ds.partitioning(pa.schema([("CALENDAR_DATE", pa.timestamp('ms'))]), flavor='hive')
>>> mytab = ds.dataset(".", partitioning=part).to_table()
>>> mytab
pyarrow.Table
USER_ID: double
TRX_CNT: double
DATE_OF_BIRTH: timestamp[ns]
CALENDAR_DATE: timestamp[ms]
----
USER_ID: [[1000,1001,1002,1003],[1000,1001,1002,1005],[1000,1001,1003,1005,1008]]
TRX_CNT: [[434,11,3,555],[111,32,1,2],[434,21,44,111,222]]
DATE_OF_BIRTH: [[1998-12-01 23:00:00.000000000,2002-03-13 23:00:00.000000000,1975-08-31 23:00:00.000000000,1998-12-31 23:00:00.000000000],[1998-12-01 23:00:00.000000000,2002-03-13 23:00:00.000000000,1975-08-31 23:00:00.000000000,2004-06-06 22:00:00.000000000],[1998-12-01 23:00:00.000000000,2002-03-13 23:00:00.000000000,1998-12-31 23:00:00.000000000,2004-06-06 22:00:00.000000000,1988-02-27 23:00:00.000000000]]
CALENDAR_DATE: [[2023-08-01 00:00:00.000,2023-08-01 00:00:00.000,2023-08-01 00:00:00.000,2023-08-01 00:00:00.000],[2023-08-02 00:00:00.000,2023-08-02 00:00:00.000,2023-08-02 00:00:00.000,2023-08-02 00:00:00.000],[2023-08-03 00:00:00.000,2023-08-03 00:00:00.000,2023-08-03 00:00:00.000,2023-08-03 00:00:00.000,2023-08-03 00:00:00.000]]
>>> 

You can follow the doc: https://arrow.apache.org/docs/python/dataset.html#different-partitioning-schemes @lmocsi

@lmocsi lmocsi closed this as completed Feb 22, 2024
@kou kou changed the title Ds.dataset() should be able to return partition column as datetime (if it is in ISO date format) [Python][Parquet] Ds.dataset() should be able to return partition column as datetime (if it is in ISO date format) Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants