Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to cast timestamps to string raises an ArrowNotImplementedError #43

Open
isichei opened this issue Mar 16, 2021 · 0 comments

Comments

@isichei
Copy link
Contributor

isichei commented Mar 16, 2021

Our current set up of read into arrow then cast will cause this issue if arrow sees (CSV/JSON) timestamp strings in the ISO standard format. A current work around for CSVs can be this:

from io import BytesIO
import pyarrow as pa
from pyarrow import csv
from arrow_pd_parser.parse import pa_read_csv_to_pandas

csv_data = b"""
a,b
1,2020-01-01 00:00:00
2,2021-01-01 23:59:59
"""

# note can also provide partial schema and get package to infer a's type by also setting `expect_full_schema=False`
schema = pa.schema([("b", pa.string())])
test_file = BytesIO(csv_data)

# The following line will raise an ArrowNotImplementedError.
# This is because there is currently no implementation to casting timestamps to str.
df = pa_read_csv_to_pandas(test_file, schema=schema, expect_full_schema=False)

# By default Arrow will read in str representations of timestamps as
# timestamps if they conform to ISO standard format.
# Then you get the error when you try and cast that timestamp to str. To
# get around this you can force pyarrow to read in the data as a string
# when it parses it as a CSV (note that ConvertOptions is not currently
# available for the JSON reader)
co = csv.ConvertOptions(column_types=schema)
df = pa_read_csv_to_pandas(test_file, schema=schema, expect_full_schema=False, convert_options=co)

But this seems quite clunky. It can also not be implemented for JSON which do not currently have a ConvertOptions module. Also worth remembering that we moved to the frame work of (let arrow read in using its best guess at the data then cast as providing a schema to the JSON reader caused an issue (see #40). It may be worth updating to pyarrow 3.0 and seeing if this issue still persists, if not perhaps we should provide the schema on read in. Failing that it might be worth casting the data via Pandas rather than Arrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant