Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' #41132

Closed
nikfio opened this issue Apr 10, 2024 · 8 comments
Closed

Comments

@nikfio
Copy link

nikfio commented Apr 10, 2024

Hi guys,

I am trying to convert a timestamp column from string to timestamp datatype.
I have this format: %Y%m%d %H%M%S%f
To read, for example, a date equal to 20090101 185956000.

Assuming that
table['timestamp'][0] = '20090101 185956000'

Running the script below
import pyarrow.compute as pc
pc.strptime(table['timestamp'][0], format='%Y%m%d %H%M%S%f', unit='ms')
throws the error:
*** pyarrow.lib.ArrowInvalid: Failed to parse string: '20090101 185956000' as a scalar of type timestamp[ms]

Meanwhile if I use strptime from datetime module:

test_date = str(table['timestamp'][0])
datetime.strptime(str(df['timestamp'][0]), '%Y%m%d %H%M%S%f')
It gives as output a correctly read datetime object:
datetime.datetime(2009, 1, 1, 18, 59, 56)

Should the pyarrow strptime work like the standard datetime strptime function?
What am I doing wrong?

Many thanks,
Nick

Component(s)

Python

@rok
Copy link
Member

rok commented Apr 10, 2024

Thanks for reporting this @nikfio !
This is due to pc.strptime using the C/C++ format semantics for parsing instead of the Python ones. Namely it seems it's the %f flag that is causing this issue as this works fine:

>>> pc.strptime('20090101 185956', format='%Y%m%d %H%M%S', unit='ms')
<pyarrow.TimestampScalar: '2009-01-01T18:59:56.000'>

To work around this you can try [this (cumbersome) approach] for now:

import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
pc.add(ts2, d)

@rok
Copy link
Member

rok commented Apr 10, 2024

Tag #31324 for reference.

@nikfio
Copy link
Author

nikfio commented Apr 12, 2024

Hello @rok,

thank you for your help.

I tried your piece suggested. But executing the first operation still gives the same error :

ts2 = pc.strptime(pc.utf8_slice_codeunits('20090101 185956000', 0, 19), format='%Y%m%d %H%M%S%f', unit="ns")
*** pyarrow.lib.ArrowInvalid: Failed to parse string: '20090101 185956000' as a scalar of type timestamp[ns]

I know and I am sorry actually, this date format (%Y%m%d %H%M%S%f) is a pain in the ass.

I managed to work aroud the issue by reading the timestamp at first as pyarrow string and the passing through a datetime conversion using pandas to_datetime. Then finally convert the datetime array into pyarrow array with type timestamp('ms').

  1. convert timestamp with type pyarrow string - call it timestamp_str - to pandas datetime using the date format wanted from the start:
    import pandas as pd
    std_datetime = pd.to_datetime(timestamp_str.to_numpy(), format=%Y%m%d %H%M%S%f)
  2. convert back to pyarrow array;
    import pyarrow as pa
    timecol = pa.array(std_datetime, type=pa.timestamp('ms'))
  3. rebuild table as wanted
    target_schema = pa.schema([('timestamp', pa.timestamp('ms')), 'otehr columns types'])
    table = pa.Table.from arrays( [ timecol, 'other cols' ], schema=target_schema

Didn't wrote everything clear in point (3) to be more synthesized.
Hope it is understable from everyone, otherwise let me know.

Thanks,
Nick

@rok
Copy link
Member

rok commented Apr 12, 2024

Hey @nikfio, sorry I didn't use your raw data. This works for me on your example:

import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["20090101 185956123"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 15), format="%Y%m%d %H%M%S", unit="ms")
d = pc.utf8_slice_codeunits(ts, 15, 99).cast(pa.int64()).cast(pa.duration("ms"))
pc.add(ts2, d)
<pyarrow.lib.TimestampArray object at 0x73006a846680>
[
  2009-01-01 18:59:56.123
]

Your Pandas/numpy workaround looks good. I'm not sure which approach would be better for your usecase.

@nikfio
Copy link
Author

nikfio commented Apr 12, 2024

Great @rok I just tested it and it works fine as expected.

sorry also my fault I didn't understand at first the usage of pc.utf8_slice_codeunits.

I think I'll go with yours in order to have a all-pyarrow operation.

Thanks a lot,
Nick

@nikfio nikfio closed this as completed Apr 12, 2024
@nikfio
Copy link
Author

nikfio commented Apr 12, 2024

Close as a prime solution has been suggested by @rok and an alternative one has been proposed.

@rok
Copy link
Member

rok commented Apr 12, 2024

Thanks for confirming it worked and closing the isssue @nikfio.
We might want to implement %f and other flags in the future if there's enough interest. It would require a discussion first and some C++ work.

@amanlai
Copy link

amanlai commented Aug 19, 2024

Related Stack Overflow discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants