Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing timestamps from CSV files #958

Closed
andygrove opened this issue Aug 30, 2021 · 9 comments
Closed

Add support for parsing timestamps from CSV files #958

andygrove opened this issue Aug 30, 2021 · 9 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@andygrove
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I updated the nyc benchmark schema to use timestamps:

fn nyctaxi_schema() -> Schema {
    Schema::new(vec![
        Field::new("VendorID", DataType::Utf8, true),
        Field::new("pickup_datetime", DataType::Timestamp(TimeUnit::Microsecond, None), true),
        Field::new("dropoff_datetime", DataType::Timestamp(TimeUnit::Microsecond, None), true),
        ...

I tried running a query and got this error.

Error: ArrowError(ExternalError(ArrowError(ParseError("Error while parsing value 2020-01-01 00:35:39 for column 1 at line 2"))))

Describe the solution you'd like
I would like to be able to query CSV files containing timestamps.

Describe alternatives you've considered
None.

Additional context
None.

@andygrove andygrove added the enhancement New feature or request label Aug 30, 2021
@alamb alamb added the good first issue Good for newcomers label Oct 2, 2021
@alamb
Copy link
Contributor

alamb commented Oct 2, 2021

Arrow contains the code to parse a string --> timestamp correctly here: https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/cast_utils.rs#L69

This ticket would likely be a matter of hooking that code up into the CSV parser: https://github.com/apache/arrow-rs/blob/master/arrow/src/csv/reader.rs

So most of the code in this PR might best belong in arrow-rs rather than datafusion

@novemberkilo
Copy link

I would like to pick this up. Please assign to me as appropriate // @alamb

@novemberkilo
Copy link

I have an example of this now on novemberkilo@d9f096a

To reproduce, follow the directions in benchmarks/README.md to get a ballista-scheduler and ballista-executor going locally, then do

cargo run --release --bin nyctaxi -- --iterations 3 --path benchmarks/data/nyctaxi_100.csv --format csv --batch-size 4096

@alamb
Copy link
Contributor

alamb commented Oct 10, 2021

novemberkilo@d9f096a <-- looks very cool 👍

@novemberkilo
Copy link

@alamb looks like datafusion is pinned to version 5.3 of arrow-rs. Once apache/arrow-rs#832 is merged, in order to get it to datafusion, will need to upgrade to around 7.0.0 -- that seems like a not-small change? What would the process for this be? Thanks.

@houqp
Copy link
Member

houqp commented Oct 19, 2021

@novemberkilo since apache/arrow-rs#832 doesn't break any public api, it will be released as part of arrrow 6.x. @alamb already have a PR ready to merge for arrow-rs 6.x integration: #984. Process wise, we need to get arrow-rs 6.0.0 released first. I will let @alamb decide whether your arrow-rs PR should be merged and released as part of the 6.0.0 release or the release after that.

@alamb
Copy link
Contributor

alamb commented Oct 19, 2021

arrow 6.0.0 is released. When apache/arrow-rs#832 is merged I'll backport that( will be included in 6.1.0, due to be released around Nov 1 2021)

@alamb
Copy link
Contributor

alamb commented Mar 24, 2022

🤔 I wonder if this issue is now done? Or does it need more work?

@novemberkilo
Copy link

iirc we just wanted to wait until we could confirm that the version of arrow-rs that contains the fix is being used in datafusion. I don't think it needs more work.

@alamb alamb closed this as completed Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants