Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12278: [Rust][DataFusion] Use Timestamp(Nanosecond, None) for SQL TIMESTAMP Type #9936

Closed

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 7, 2021

Rationale

Running the query CREATE EXTERNAL TABLE .. (c TIMESTAMP) today in DataFusion will result in a data type pf "Date64" which means that anything more specific than the date will be ignored.

This leads to strange behavior such as

echo "Jorge,2018-12-13T12:12:10.011" >> /tmp/foo.csv
echo "Andrew,2018-11-13T17:11:10.011" > /tmp/foo.csv

cargo run -p datafusion --bin datafusion-cli
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/datafusion-cli`
> CREATE EXTERNAL TABLE t(a varchar, b TIMESTAMP)
STORED AS CSV
LOCATION '/tmp/foo.csv';

0 rows in set. Query took 0 seconds.
> select * from t;
+--------+------------+
| a      | b          |
+--------+------------+
| Andrew | 2018-11-13 |
| Jorge  | 2018-12-13 |
+--------+------------+

(note that the Time part is chopped off)

Changes

This PR changes the default mapping from SQL type TIMESTAMP

@@ -2863,10 +2913,7 @@ mod tests {
ctx: &mut ExecutionContext,
sql: &str,
) -> Result<Vec<RecordBatch>> {
let logical_plan = ctx.create_logical_plan(sql)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed so that CREATE EXTERNAL TABLE is handled correctly

@@ -298,7 +298,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
SQLDataType::Boolean => Ok(DataType::Boolean),
SQLDataType::Date => Ok(DataType::Date32),
SQLDataType::Time => Ok(DataType::Time64(TimeUnit::Millisecond)),
SQLDataType::Timestamp => Ok(DataType::Date64),
SQLDataType::Timestamp => Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual code change in this PR

@github-actions
Copy link

github-actions bot commented Apr 7, 2021

@alamb alamb force-pushed the ARROW-12278-timestamps-for-timestamps branch 2 times, most recently from fae1e02 to fd6e344 Compare April 7, 2021 21:57
@alamb
Copy link
Contributor Author

alamb commented Apr 8, 2021

FYI @Dandandan / @seddonm1 / @ovr

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice catch.

Random though: I think that we would benefit from not writing a test on Context to test the SQL planner. It could be easier to identify which tests are testing what by writing unit tests more specialized to the component that they affect, as they usually declare the API behavior and are easier to find.

file_path.to_str().expect("path is utf8")
);

println!("{}", sql);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a wild println has appeared :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no! I will remove

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb force-pushed the ARROW-12278-timestamps-for-timestamps branch from fd6e344 to b6fbc1e Compare April 9, 2021 10:29
@alamb
Copy link
Contributor Author

alamb commented Apr 9, 2021

Random though: I think that we would benefit from not writing a test on Context to test the SQL planner. It could be easier to identify which tests are testing what by writing unit tests more specialized to the component that they affect, as they usually declare the API behavior and are easier to find.

@jorgecarleitao I think the core of the problem is that the tests in context.rs are largely end to end tests of DataFusion rather than tests for the ExecutionContext itself. I think they ended up in context.rs because that is typically the main entry point to run queries.

What would you think about refactoring the non ExecutionContext specific tests to some other file (perhaps the tests/sql.rs end to end test?)

@alamb alamb closed this in 53b462b Apr 9, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…QL TIMESTAMP Type

# Rationale
Running the query `CREATE EXTERNAL TABLE .. (c TIMESTAMP)` today in DataFusion will result in a data type pf "Date64" which means that anything more specific than the date will be ignored.

This leads to strange behavior such as

```shell
echo "Jorge,2018-12-13T12:12:10.011" >> /tmp/foo.csv
echo "Andrew,2018-11-13T17:11:10.011" > /tmp/foo.csv

cargo run -p datafusion --bin datafusion-cli
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/datafusion-cli`
> CREATE EXTERNAL TABLE t(a varchar, b TIMESTAMP)
STORED AS CSV
LOCATION '/tmp/foo.csv';

0 rows in set. Query took 0 seconds.
> select * from t;
+--------+------------+
| a      | b          |
+--------+------------+
| Andrew | 2018-11-13 |
| Jorge  | 2018-12-13 |
+--------+------------+
```

(note that the Time part is chopped off)

# Changes
This PR changes the default mapping from SQL type `TIMESTAMP`

Closes apache#9936 from alamb/ARROW-12278-timestamps-for-timestamps

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
…QL TIMESTAMP Type

# Rationale
Running the query `CREATE EXTERNAL TABLE .. (c TIMESTAMP)` today in DataFusion will result in a data type pf "Date64" which means that anything more specific than the date will be ignored.

This leads to strange behavior such as

```shell
echo "Jorge,2018-12-13T12:12:10.011" >> /tmp/foo.csv
echo "Andrew,2018-11-13T17:11:10.011" > /tmp/foo.csv

cargo run -p datafusion --bin datafusion-cli
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/datafusion-cli`
> CREATE EXTERNAL TABLE t(a varchar, b TIMESTAMP)
STORED AS CSV
LOCATION '/tmp/foo.csv';

0 rows in set. Query took 0 seconds.
> select * from t;
+--------+------------+
| a      | b          |
+--------+------------+
| Andrew | 2018-11-13 |
| Jorge  | 2018-12-13 |
+--------+------------+
```

(note that the Time part is chopped off)

# Changes
This PR changes the default mapping from SQL type `TIMESTAMP`

Closes apache#9936 from alamb/ARROW-12278-timestamps-for-timestamps

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…QL TIMESTAMP Type

# Rationale
Running the query `CREATE EXTERNAL TABLE .. (c TIMESTAMP)` today in DataFusion will result in a data type pf "Date64" which means that anything more specific than the date will be ignored.

This leads to strange behavior such as

```shell
echo "Jorge,2018-12-13T12:12:10.011" >> /tmp/foo.csv
echo "Andrew,2018-11-13T17:11:10.011" > /tmp/foo.csv

cargo run -p datafusion --bin datafusion-cli
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/datafusion-cli`
> CREATE EXTERNAL TABLE t(a varchar, b TIMESTAMP)
STORED AS CSV
LOCATION '/tmp/foo.csv';

0 rows in set. Query took 0 seconds.
> select * from t;
+--------+------------+
| a      | b          |
+--------+------------+
| Andrew | 2018-11-13 |
| Jorge  | 2018-12-13 |
+--------+------------+
```

(note that the Time part is chopped off)

# Changes
This PR changes the default mapping from SQL type `TIMESTAMP`

Closes apache#9936 from alamb/ARROW-12278-timestamps-for-timestamps

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants