Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support timestamp and interval arithmetic #5764

Merged
merged 41 commits into from
Mar 30, 2023
Merged

Support timestamp and interval arithmetic #5764

merged 41 commits into from
Mar 30, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Mar 28, 2023

Which issue does this PR close?

Closes #5704
Closes #194

Rationale for this change

We can handle such queries now:
SELECT val, ts1 - ts2 AS ts_diff FROM table_a
SELECT val, interval1 - interval2 AS interval_diff FROM table_a
SELECT val, ts1 - interval1 AS ts_interval_diff FROM table_a
SELECT val, interval1 + ts1 AS interval_ts_sub FROM table_a

What changes are included in this PR?

  - +
timestamp op timestamp (same type) OK: second and millisecond types give results in daytime(day+millisecond), microsecond and nanosecond types give result in monthdaynano(month+day+nano, but month field is not used) N/A
timestamp op timestamp (different types) N/A N/A
interval op interval (same type) OK: operations are done field by field, gives the same type OK: operations are done field by field, gives the same type
interval op interval (different types) OK: give result in monthdaynano type OK: give result in monthdaynano type
timestamp op interval OK: give result in the type of the timestamp OK: give result in the type of the timestamp
interval op timestamp N/A OK: the same of timestamp + interval

Some match expressions in planner.rs, binary.rs, and datetime.rs are extended. Coerced types and allowable operations are shown in the table.

I try to use existing scalar value functions as much as possible to not duplicate. However, in arrow.rs, subtraction and addition functions are for numeric types, hence I need to add some functions to call with binary function.

In datetime.rs, the evaluate function was written to accept only "Array + Scalar" or "Scalar + Scalar" values to evaluate. It is extended to accept "Array + Array", and 4 different variations of that case (Timestamp op Timestamp, Interval op Interval, Timestamp op Interval, Interval op Timestamp) are implemented. "Array + Scalar" evaluations are done with unary function in arrow.rs, and I follow the similar pattern with try_binary_op function. try_binary_op function is a modified version of binary function in arrow-rs. The only difference is that it returns Result and creates the buffer with try_from_trusted_len_iter. Otherwise, we had to unwrap the op function sent to binary.

Are these changes tested?

Yes, there are tests for each match in timestamp.slt. However, tables with intervals cannot be created like INTERVAL '1 second', since some work is needed in arrow-rs for casting. Timestamp difference case with timezone is also left in timestamp.rs because of a similar reason.

Are there any user-facing changes?

@berkaysynnada
Copy link
Contributor Author

If I add 1 month to 2023-03-01T00:00:00 +02 is the output 2023-04-01T00:00:00 +02 or 2023-03-29T00:00:00 +02.

I believe this PR implements the latter as it performs arithmetic with respect to the UTC epoch? Is that the desired behaviour?

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=11112707c7217ecca3ef64ceb984beb6 contains an example showing the difference

#[tokio::test]
async fn interval_ts_add() -> Result<()> {
    let ctx = SessionContext::new();
    let table_a = {
        let schema = Arc::new(Schema::new(vec![
            Field::new("ts1", DataType::Timestamp(TimeUnit::Second, None), false),
            Field::new(
                "interval1",
                DataType::Interval(IntervalUnit::YearMonth),
                false,
            ),
        ]));
        let array1 = PrimitiveArray::<TimestampSecondType>::from_iter_values(vec![
            1_677_628_800i64, // 2023-03-01T00:00:00
        ]);
        let array2 =
            PrimitiveArray::<IntervalYearMonthType>::from_iter_values(vec![1i32]);
        let data = RecordBatch::try_new(
            schema.clone(),
            vec![Arc::new(array1), Arc::new(array2)],
        )?;
        let table = MemTable::try_new(schema, vec![vec![data]])?;
        Arc::new(table)
    };

    ctx.register_table("table_a", table_a)?;
    let sql = "SELECT ts1, ts1 + interval1 from table_a";
    let actual = execute_to_batches(&ctx, sql).await;
    let expected = vec![
        "+---------------------+---------------------------------+",
        "| ts1                 | table_a.ts1 + table_a.interval1 |",
        "+---------------------+---------------------------------+",
        "| 2023-03-01T00:00:00 | 2023-04-01T00:00:00             |",
        "+---------------------+---------------------------------+",
    ];
    assert_batches_eq!(expected, &actual);
    Ok(())
}

The former result is produced, you can reproduce it with this test. Existing do_date_time_math functionality is adopted.

@tustvold
Copy link
Contributor

The former result is produced

This example has a timezone of None, not +02:00 as is necessary to demonstrate the potential inconsistency?

@berkaysynnada
Copy link
Contributor Author

The former result is produced

This example has a timezone of None, not +02:00 as is necessary to demonstrate the potential inconsistency?

Now I understand what you mean. Postgre gives the result as in the former. If no objection, I will fix it that way.

@alamb alamb changed the title timestamp interval arithmetic query Support timestamp and interval arithmetic Mar 29, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you so much @berkaysynnada

I think this is a significant improvement to DataFusion -- while longer term I would prefer to see the interval arithmetic logic moved into arrow-rs, starting with an implementation in the DataFusion repo has worked well in the past and I think will work well here too.

Can you please respond to @tustvold 's comments? I think they are good questions, but then I think we could merge this PR and file a follow on tickets

  1. Move the arithmetic code into binary.rs (following the existing models, as a step towards getting them upstream in arrow).
  2. File a ticket about not handling timezones properly

cc @waitingkuo @avantgardnerio @andygrove @liukun4515

@@ -142,14 +138,30 @@ impl PhysicalExpr for DateTimeIntervalExpr {
return Err(DataFusionError::Internal(msg.to_string()));
}
};
// RHS is first checked. If it is a Scalar, there are 2 options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term I think it would be good to move the date_time arithmetic into https://github.com/apache/arrow-datafusion/tree/main/datafusion/physical-expr/src/expressions/binary as these really are binary operations

That would also set us up so when the kernels are added to arrow-rs (aka part of apache/arrow-rs#3958) it would be easier to migrate.

I like how this PR followed the existing pattern in DateTimeIntervalExpr even if that may not be our ideal end state

datafusion/physical-expr/src/expressions/datetime.rs Outdated Show resolved Hide resolved
@berkaysynnada
Copy link
Contributor Author

First of all, thank you so much @berkaysynnada

I think this is a significant improvement to DataFusion -- while longer term I would prefer to see the interval arithmetic logic moved into arrow-rs, starting with an implementation in the DataFusion repo has worked well in the past and I think will work well here too.

Can you please respond to @tustvold 's comments? I think they are good questions, but then I think we could merge this PR and file a follow on tickets

  1. Move the arithmetic code into binary.rs (following the existing models, as a step towards getting them upstream in arrow).
  2. File a ticket about not handling timezones properly

cc @waitingkuo @avantgardnerio @andygrove @liukun4515

I am working on @tustvold 's comments, and when I finalize them I will commit. Thanks for the support of try_binary.

@berkaysynnada
Copy link
Contributor Author

@tustvold I tried to fix the issues you mention, can you please take a quick look?

.ok();
parsed_tz
let parsed_tz: Tz = FromStr::from_str(tz).map_err(|_| {
DataFusionError::Execution("cannot parse given timezone".to_string())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the error contained the problematic timezone. Something like

        let parsed_tz: Tz = FromStr::from_str(tz).map_err(|e| {
            DataFusionError::Execution(format!("cannot parse '{tz}' as timezone: {e}".to_string())

@@ -348,63 +340,6 @@ pub fn evaluate_temporal_arrays(
Ok(ColumnarValue::Array(ret))
}

#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb alamb mentioned this pull request Mar 30, 2023
@@ -261,6 +261,110 @@ SELECT INTERVAL '8' MONTH + '2000-01-01T00:00:00'::timestamp;
----
2000-09-01T00:00:00

# Interval columns are created with timestamp subtraction in subquery since they are not supported yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with #5792 we can now write better tests here -- specifically we can create interval constants.

@@ -44,6 +44,7 @@ unicode_expressions = ["unicode-segmentation"]
[dependencies]
ahash = { version = "0.8", default-features = false, features = ["runtime-rng"] }
arrow = { workspace = true }
arrow-array = { version = "34.0.0", default-features = false, features = ["chrono-tz"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of datafusion now uses arrow 36, but this uses arrow 34

Suggested change
arrow-array = { version = "34.0.0", default-features = false, features = ["chrono-tz"] }
arrow-array = { workspace = true }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #5794

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- given we are working on intervals in general and this PR pushes things along substantially I am going to merge it and we can clean things up with follow on PRs.

Thanks again

@berkaysynnada
Copy link
Contributor Author

berkaysynnada commented Mar 30, 2023

@alamb Thanks for the support. I add these issues to my to-do's and will open the PRs as I progress.

@alamb
Copy link
Contributor

alamb commented Mar 30, 2023

@alamb Thanks for the support. I add these issues to my to-do's and will open the PRs as I progress.

Thanks @berkaysynnada -- can you be specific about which items you have added to the todo list?

@berkaysynnada berkaysynnada deleted the feature/timestamp-interval-arith-query branch March 31, 2023 08:33
@berkaysynnada
Copy link
Contributor Author

@alamb Thanks for the support. I add these issues to my to-do's and will open the PRs as I progress.

Thanks @berkaysynnada -- can you be specific about which items you have added to the todo list?

I meant #5803, which you have completed, and removing the arithmetic code to binary.rs, but I can spare time for the issues that you see as relevant in #5753 and #3958

@alamb
Copy link
Contributor

alamb commented Mar 31, 2023

Thank you @berkaysynnada 🙇 . I think this issue:

removing the arithmetic code to binary.rs

This is the most valuable part in my opinion as it pays down tech debt and sets us up for a easier migration / porting of the code upstream to arrow-rs -- and since you probably still have all the timestamp / kernel context in your head, you are probably likely to do it more quickly than someone who needs to get up to speed

@berkaysynnada
Copy link
Contributor Author

Thank you @berkaysynnada 🙇 . I think this issue:

removing the arithmetic code to binary.rs

This is the most valuable part in my opinion as it pays down tech debt and sets us up for a easier migration / porting of the code upstream to arrow-rs -- and since you probably still have all the timestamp / kernel context in your head, you are probably likely to do it more quickly than someone who needs to get up to speed

I am working on symmetric hash join with temporal type inputs, and hence I need to modify evaluate_array function in datetime.rs, where the evaluations of Array vs. Scalar values are done (newly added match arms here also use some of these arithmetic functions). I plan to insert that removal work to sym hash join PR, if it is not a problem.

@alamb
Copy link
Contributor

alamb commented Mar 31, 2023

I plan to insert that removal work to sym hash join PR, if it is not a problem.

If possible, I would recommend a separate PR (that your sym hash join builds on) that moves the code -- this should speed up reviews as each will be smaller and more focused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp and Interval arithmetics Add support for subtracting timestamps --> intervals
7 participants