Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtracting Timestamp from Timestamp should produce a Duration (not Timestamp) #3964

Closed
Tracked by #3958
alamb opened this issue Mar 27, 2023 · 15 comments · Fixed by #4244
Closed
Tracked by #3958

Subtracting Timestamp from Timestamp should produce a Duration (not Timestamp) #3964

alamb opened this issue Mar 27, 2023 · 15 comments · Fixed by #4244
Labels
arrow Changes to the arrow crate bug good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Mar 27, 2023

Describe the bug
Subtracting two Timestamp columns results in another Timestamp which is not correct

To Reproduce

fn subtract_timestamps() {
    let arr1 = TimestampNanosecondArray::from(vec![
        1_000,
        1_000_000,
        1_000_000_000,
        1_000_000_000_000,
        1_000_000_000_000_000,
        1_000_000_000_000_000_000,
    ]);

    let arr2 = TimestampNanosecondArray::from(vec![
        2_000,
        2_000_000,
        2_000_000_000,
        2_000_000_000_000,
        2_000_000_000_000_000,
        2_000_000_000_000_000_000,
    ]);

    println!(
        "input:\n{}\n{}",
        pretty_format_columns("arr1", &[Arc::new(arr1.clone()) as _]).unwrap(),
        pretty_format_columns("arr2", &[Arc::new(arr2.clone()) as _]).unwrap(),
    );

    let interval = subtract_dyn(&arr2, &arr1).unwrap();

    println!(
        "output{}:\n{}",
        interval.data_type(),
        pretty_format_columns("interval", &[Arc::new(interval.clone()) as _]).unwrap(),
    );
}

Which produces

input:
+----------------------------+
| arr1                       |
+----------------------------+
| 1970-01-01T00:00:00.000001 |
| 1970-01-01T00:00:00.001    |
| 1970-01-01T00:00:01        |
| 1970-01-01T00:16:40        |
| 1970-01-12T13:46:40        |
| 2001-09-09T01:46:40        |
+----------------------------+
+----------------------------+
| arr2                       |
+----------------------------+
| 1970-01-01T00:00:00.000002 |
| 1970-01-01T00:00:00.002    |
| 1970-01-01T00:00:02        |
| 1970-01-01T00:33:20        |
| 1970-01-24T03:33:20        |
| 2033-05-18T03:33:20        |
+----------------------------+
output Timestamp(Nanosecond, None):
+----------------------------+
| interval                   |
+----------------------------+
| 1970-01-01T00:00:00.000001 |
| 1970-01-01T00:00:00.001    |
| 1970-01-01T00:00:01        |
| 1970-01-01T00:16:40        |
| 1970-01-12T13:46:40        |
| 2001-09-09T01:46:40        |
+----------------------------+

Expected behavior
I expect the output to be an interval of type Interval(MonthDayNano) (not Timestamp)

Updated (after discussion with @tustvold ) I expect the output to be a duration of type Duration(unit) where unit is the same as the source Timestamp(unit) (not Timestamp)

Additional context
Here is what postgres does:

postgres=# create table test as select now() - now();
SELECT 1
postgres=# \d test
                 Table "public.test"
  Column  |   Type   | Collation | Nullable | Default
----------+----------+-----------+----------+---------
 ?column? | interval |           |          |

postgres=#
@alamb alamb added the bug label Mar 27, 2023
@alamb alamb changed the title Subtracting timestamps should result in an interval Subtracting Timestamp from Timestamp should be Interval (not Timestamp) Mar 27, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 27, 2023

I expect the output to be an interval of type Interval(MonthDayNano) (not Timestamp)

Should it be an interval, or should it be a duration? The latter is significantly easier to reason about, although it would appear postgres lacks such a construct.

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2023

Should it be an interval, or should it be a duration?

I guess I was thinking MonthDayNano with month=0 and day=0 would be the same as a Duration 🤔 Not sure if that makes sense

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2023

I would also be fine with Duration -- changing ticket

@alamb alamb changed the title Subtracting Timestamp from Timestamp should be Interval (not Timestamp) Subtracting Timestamp from Timestamp should be Duration (not Timestamp) Mar 27, 2023
@alamb alamb changed the title Subtracting Timestamp from Timestamp should be Duration (not Timestamp) Subtracting Timestamp from Timestamp should produce a Duration (not Timestamp) Mar 27, 2023
@alamb alamb added the good first issue Good for newcomers label Mar 27, 2023
@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2023

I think this is a good first issue because the semantics are well defined and there are some existing examples

@leetcode-1533
Copy link

hi, I would like to take this task

@leetcode-1533
Copy link

it is my first rust related PR, just want to highlight my plan for implement this:

Within function subtract_dyn(), we currently use typed_dict_math_op() and math_op().

I would like to impement antoher function for timestamp types, called timestamp_math_op(), which deals with timestamp type specific operations. And always returns Duration when calculating between two timestamps.

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2023

I would like to impement antoher function for timestamp types, called timestamp_math_op(), which deals with timestamp type specific operations. And always returns Duration when calculating between two timestamps.

Sounds good to me

@berkaysynnada
Copy link
Contributor

it is my first rust related PR, just want to highlight my plan for implement this:

Within function subtract_dyn(), we currently use typed_dict_math_op() and math_op().

I would like to impement antoher function for timestamp types, called timestamp_math_op(), which deals with timestamp type specific operations. And always returns Duration when calculating between two timestamps.

Thanks for working on this issue :)
After merging that PR in datafusion, timestamp - timestamp operations give results in Interval types, and it is planned to get arithmetics moved to arrow-rs. Can you check that PR to not have an inconsistent behavior?

@tustvold
Copy link
Contributor

tustvold commented Apr 1, 2023

After merging that PR in datafusion, timestamp - timestamp operations give results in Interval types,

Can we perhaps revisit this decision in DataFusion, interval types are probably not the correct type to be returning for such operations on timestamps. A duration faithfully represents the difference between two timestamps in absolute time, an interval does not, instead representing a logical quantity the meaning of which depends on the timestamp it is applied to. I would expect timestamp arithmetic to return Duration, with query engines able to insert type coercions if they really need an interval for some reason. Crucially a duration can be converted to an interval, but the reverse transformation is not generally possible

@alamb
Copy link
Contributor Author

alamb commented Apr 1, 2023

I think it would be fine to support converting to duration in arrow-rs (and we can convert to Interval in datafusion as needed)

The reason we are pushing ahead with interval (rather than interval and duration) in DataFusion is to get something working incrementally without having to sort out all the subtleties with Intervals, Durations, arithmetic and conversions.

Then I think over time we can and will add more sophistication (like making the distinction between Duration and Interval and coercing automatically between them) to DataFusion

Thus, I suggest we get the kernels correct in arrow-rs (and provide the appropriate casting operations) and then we can upgrade DataFusion to use them.

So in this case, let's have timestamp - timestamp produce duration in arrow.rs sounds good.

I will also file a ticket about casting to/from Duration and Interval

@alamb
Copy link
Contributor Author

alamb commented Apr 1, 2023

I filed #3998 to track casting durations to/from intervals

@ozankabak
Copy link

The latter is significantly easier to reason about, although it would appear postgres lacks such a construct.

Yes, the issue is most DB's (and SQL itself) simply use the timestamp - timestamp = interval pattern. However, from our perspective using durations is fine as long as there is casting/coercing mechanism to take care of the transformation cheaply.

@alamb
Copy link
Contributor Author

alamb commented Apr 3, 2023

Yes, the issue is most DB's (and SQL itself) simply use the timestamp - timestamp = interval pattern. However, from our perspective using durations is fine as long as there is casting/coercing mechanism to take care of the transformation cheaply.

I think the core problem is that the clear distinction between "duration" and "interval" that is made in Arrow and Rust's standard library is not found in SQL (e.g. there is no SQL duration type). The coercion / casting logic in DataFusion I think is the right place to reconcile the various Arrow types (Intervals with different time units, durations with different time units)

So in other words, even though I expect SQL / DataFusion users to mostly work with Intervals, having DataFusion sort out how to call the appropriate interval kernels in arrow-rs with coercion, etc would be the ideal approach

@ozankabak
Copy link

SGTM

@tustvold tustvold added the arrow Changes to the arrow crate label May 19, 2023
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #4244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants