[Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' #41132

nikfio · 2024-04-10T20:23:41Z

Hi guys,

I am trying to convert a timestamp column from string to timestamp datatype.
I have this format: %Y%m%d %H%M%S%f
To read, for example, a date equal to 20090101 185956000.

Assuming that
table['timestamp'][0] = '20090101 185956000'

Running the script below
import pyarrow.compute as pc
pc.strptime(table['timestamp'][0], format='%Y%m%d %H%M%S%f', unit='ms')
throws the error:
*** pyarrow.lib.ArrowInvalid: Failed to parse string: '20090101 185956000' as a scalar of type timestamp[ms]

Meanwhile if I use strptime from datetime module:

test_date = str(table['timestamp'][0])
datetime.strptime(str(df['timestamp'][0]), '%Y%m%d %H%M%S%f')
It gives as output a correctly read datetime object:
datetime.datetime(2009, 1, 1, 18, 59, 56)

Should the pyarrow strptime work like the standard datetime strptime function?
What am I doing wrong?

Many thanks,
Nick

Component(s)

Python

The text was updated successfully, but these errors were encountered:

rok · 2024-04-10T23:03:55Z

Thanks for reporting this @nikfio !
This is due to pc.strptime using the C/C++ format semantics for parsing instead of the Python ones. Namely it seems it's the %f flag that is causing this issue as this works fine:

>>> pc.strptime('20090101 185956', format='%Y%m%d %H%M%S', unit='ms')
<pyarrow.TimestampScalar: '2009-01-01T18:59:56.000'>

To work around this you can try [this (cumbersome) approach] for now:

import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["1970-01-01T00:00:59.123456789", "2000-02-29T23:23:23.999999999"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 19), format="%Y-%m-%dT%H:%M:%S", unit="ns")
d = pc.utf8_slice_codeunits(ts, 20, 99).cast(pa.int64()).cast(pa.duration("ns"))
pc.add(ts2, d)

rok · 2024-04-10T23:05:28Z

Tag #31324 for reference.

nikfio · 2024-04-12T20:57:24Z

Hello @rok,

thank you for your help.

I tried your piece suggested. But executing the first operation still gives the same error :

ts2 = pc.strptime(pc.utf8_slice_codeunits('20090101 185956000', 0, 19), format='%Y%m%d %H%M%S%f', unit="ns")
*** pyarrow.lib.ArrowInvalid: Failed to parse string: '20090101 185956000' as a scalar of type timestamp[ns]

I know and I am sorry actually, this date format (%Y%m%d %H%M%S%f) is a pain in the ass.

I managed to work aroud the issue by reading the timestamp at first as pyarrow string and the passing through a datetime conversion using pandas to_datetime. Then finally convert the datetime array into pyarrow array with type timestamp('ms').

convert timestamp with type pyarrow string - call it timestamp_str - to pandas datetime using the date format wanted from the start:
import pandas as pd
std_datetime = pd.to_datetime(timestamp_str.to_numpy(), format=%Y%m%d %H%M%S%f)
convert back to pyarrow array;
import pyarrow as pa
timecol = pa.array(std_datetime, type=pa.timestamp('ms'))
rebuild table as wanted
target_schema = pa.schema([('timestamp', pa.timestamp('ms')), 'otehr columns types'])
table = pa.Table.from arrays( [ timecol, 'other cols' ], schema=target_schema

Didn't wrote everything clear in point (3) to be more synthesized.
Hope it is understable from everyone, otherwise let me know.

Thanks,
Nick

rok · 2024-04-12T21:28:49Z

Hey @nikfio, sorry I didn't use your raw data. This works for me on your example:

import pyarrow as pa
import pyarrow.compute as pc
ts = pa.array(["20090101 185956123"], pa.string())
ts2 = pc.strptime(pc.utf8_slice_codeunits(ts, 0, 15), format="%Y%m%d %H%M%S", unit="ms")
d = pc.utf8_slice_codeunits(ts, 15, 99).cast(pa.int64()).cast(pa.duration("ms"))
pc.add(ts2, d)

<pyarrow.lib.TimestampArray object at 0x73006a846680>
[
  2009-01-01 18:59:56.123
]

Your Pandas/numpy workaround looks good. I'm not sure which approach would be better for your usecase.

nikfio · 2024-04-12T21:52:29Z

Great @rok I just tested it and it works fine as expected.

sorry also my fault I didn't understand at first the usage of pc.utf8_slice_codeunits.

I think I'll go with yours in order to have a all-pyarrow operation.

Thanks a lot,
Nick

nikfio · 2024-04-12T21:53:18Z

Close as a prime solution has been suggested by @rok and an alternative one has been proposed.

rok · 2024-04-12T22:23:23Z

Thanks for confirming it worked and closing the isssue @nikfio.
We might want to implement %f and other flags in the future if there's enough interest. It would require a discussion first and some C++ work.

amanlai · 2024-08-19T16:29:00Z

Related Stack Overflow discussion.

nikfio added the Type: bug label Apr 10, 2024

github-actions bot added the Component: Python label Apr 10, 2024

nikfio closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' #41132

[Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' #41132

nikfio commented Apr 10, 2024 •

edited

Loading

rok commented Apr 10, 2024

rok commented Apr 10, 2024

nikfio commented Apr 12, 2024

rok commented Apr 12, 2024

nikfio commented Apr 12, 2024

nikfio commented Apr 12, 2024 •

edited

Loading

rok commented Apr 12, 2024

amanlai commented Aug 19, 2024

[Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' #41132

[Python] pyarrow compute strptime not working with format '%Y%m%d %H%M%S%f' #41132

Comments

nikfio commented Apr 10, 2024 • edited Loading

Component(s)

rok commented Apr 10, 2024

rok commented Apr 10, 2024

nikfio commented Apr 12, 2024

rok commented Apr 12, 2024

nikfio commented Apr 12, 2024

nikfio commented Apr 12, 2024 • edited Loading

rok commented Apr 12, 2024

amanlai commented Aug 19, 2024

nikfio commented Apr 10, 2024 •

edited

Loading

nikfio commented Apr 12, 2024 •

edited

Loading