Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Right-closed time_bucket #5806

Open
mvanaltvorst opened this issue Jun 21, 2023 · 15 comments
Open

[Feature]: Right-closed time_bucket #5806

mvanaltvorst opened this issue Jun 21, 2023 · 15 comments
Labels
feature-request Feature proposal

Comments

@mvanaltvorst
Copy link

mvanaltvorst commented Jun 21, 2023

What problem does the new feature solve?

In the current state of TimescaleDB, it is unclear from the docs whether time_bucket creates left- or right-closed intervals. Through experimentation I found that time_bucket provides left-closed, right-opened, left-labeled intervals.

What does the feature do?

It would be desirable to be able to choose on which sides your intervals are open and close. It would also be a nice feature to be able to select which side of the interval should be taken as a label for the time_bucket.

Implementation challenges

Subtle off-by-one errors for materialized views which might depend on the current behaviour of time_bucket.

@mvanaltvorst mvanaltvorst added the feature-request Feature proposal label Jun 21, 2023
@mvanaltvorst
Copy link
Author

In the meantime, is there a way to circumvent this issue? I tried adding intervals to my time bucket labels, however this does not work with MATERIALIZED VIEW's.

@jkittner
Copy link

you can imitate a right labelled behavior with:

SELECT
  time_bucket('5 minutes', time - '1 second'::INTERVAL) + '5 minutes' AS five_min,
  avg(cpu)
FROM metrics
GROUP BY five_min
ORDER BY five_min DESC LIMIT 10;

(the time you subtract needs to be smaller than the smallest interval your data has in the table metrics (e.g. if it's 1 minute, 1 second is fine)
but that unfortunately does not work with MATERIALIZED VIEW.

Maybe you could achieve the same using the offset argument in the time_bucket function instead of the subtraction

@mvanaltvorst
Copy link
Author

mvanaltvorst commented Jun 25, 2023 via email

@mkindahl
Copy link
Contributor

mkindahl commented Aug 14, 2023

An alternative is is to expose the function list of "approved" functions for continuous aggregates as a catalog table and allow any functions to be registered there and and store the information we have in the internal cache that marks functions as "ok". You still have to know what you do when you write something like this, but at least it would make it possible to register any "cagg-friendly" function to be used with a continuous aggregate.

@melicheradam
Copy link

Hi @mkindahl

Can you elaborate a bit? So simulating a right-closed time-bucket for continuous aggregates is not possible right now, is that right? Or is this some kind of workaround you posted?

@mvanaltvorst
Copy link
Author

For reference, pd.DataFrame.resample() has optional closed and label arguments for this. I think Pandas' API works quite well, is it fundamentally difficult to add these optional arguments to the time_bucket function?

@mkindahl
Copy link
Contributor

Hi @mkindahl

Can you elaborate a bit? So simulating a right-closed time-bucket for continuous aggregates is not possible right now, is that right? Or is this some kind of workaround you posted?

Hi @melicheradam !

It is possible to write your own version of a right-closed bucketing function (for example, as done in #5864 (comment)) that is straightforward to use from a normal query. However, since this function will not be marked as "approved", it cannot be used in a continuous aggregate.

There is a hard-coded list of approved functions that is used when deciding what can be allowed in continuous aggregates and the internal function cache is populated with it on startup. Instead of using a hard-coded list, code could be re-written to store this information in the catalog, which would allow new functions to be added there and "approved" for use in continuous aggregates as bucketing functions.

It does place some requirements on any authors of such functions, and it is probably necessary to add some checks to make sure that these functions do not do anything wildly out of the expected, but it would allow users to solve this problem.

@mvanaltvorst
Copy link
Author

I think a dynamic list of approved continuous aggregate functions would add a lot of complexity if the only goal is to create simple right-closed time buckets, are there other current Github issues such an implementation would solve?

@melicheradam
Copy link

@mvanaltvorst

There was one linked above that was closed because it is related to this one. I have found some issues that are about the offsets, which may be in effort to create right-closed right-labeled time_bucket.

In the issue linked above, there is already an implementation for such time_bucket ready, but it cannot be used in continuous aggregate. If someone could include it in the approved list of functions it would be great. I would prepare a PR myself but I am lacking the required experience for this.

#2265

#4594

#4116

#2970

@melicheradam
Copy link

Also for time_bucket_gapfill

@Sanprof
Copy link

Sanprof commented Sep 22, 2023

Are there any updates regarding this?

Sorry, but all our customers don't like to see aggregated points on the graph which represent the values in the future and we really look stupid because of that and always try to explain such behavior but no one cares about our explanations.

We found that the Timescale extension is pretty compatible for us in the aggregation and in the performance. It covers most of our needs.

In case we don't find in near time the solution to be able to use right-closed time buckets we must leave this extension and rebuild our project using something else. We don't like it because some of the resources were already put into implementing integration with the Timescale extension.

@melicheradam
Copy link

Hey @Sanprof ,

currently we are using a hack that does not degrade performance like this, which simulates right-closed timebuckets (aggregations) correctly

SELECT time_bucket('5min', timestamp, "offset" => '1ms') + INTERVAL '5min' - INTERVAL '1ms', avg(value) ....

But you cannot use this for continuous aggregates, those can only be left-labeled.

@Sanprof
Copy link

Sanprof commented Sep 22, 2023

Hey @melicheradam,

Thanks for your suggestion, but as you rightly mentioned we need the right-closed time bucket function for continuous aggregates.
For our regular selects, we are using our own time_window function which covers all our needs.

@melicheradam
Copy link

@Sanprof oh sorry, have not realized you were the one originally posted it :D , but you might wanna consider doing some performance tests of your custom function vs offset like i sent above, because we have seen some degradation in performance using a custom function very similar to yours.

@Sanprof
Copy link

Sanprof commented Sep 22, 2023

@melicheradam sure, we can make comparison tests for both approaches. I don't promise it will be near time but we will make it once we have time for that.

A bit more info about our time_window function)
Normally, we don't call time_window function but use its body directly in the select query (copy/paste stuff :D) when we are querying big data tables. For the small statistical data, we use time_window function directly in the select query because we don't care about milliseconds of execution.

I can't remember exactly the result of the performance tests we made but when we tried aggregate data for ~100,000,000 records and used call of time_window directly in the select query the performance was really low compared with using a body of this function inside the select query.

A long time ago we refused to use such small functions for querying big data tables. E.g. you can find such a function like iif (I guess it came from the excel) with the simple body (CASE WHEN ..) which some database developers like to use to simplify writing queries (and it actually good) but it stops working as expected when you query big data tables - timings for selects straightaway up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Feature proposal
Projects
None yet
Development

No branches or pull requests

5 participants