-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datafusion-query-cache
- caching intermediate results for faster repeated queries
#12779
Comments
I need to understand more how DataFusion works but based on your description makes a lot of sense. |
corrected, thank you.
You can configure currently
How you store cached results is up to you, but they're effectively a table of
You could also extract table name pretty easily, or we could make it available, so you could purge the cache however you like. |
That's really cool. |
Do you have any suggestions for pruning old data using a dynamic lower bound? I think that's the cheery on top. From what I understand, the pattern timestamp > now() - interval '1 day' is commonly used and seems difficult to implement effectively since "final" can union the new states but cannot extract old ones. |
@metegenez, yes I agree that's a complicated. I think there are two case:
Luckily the first case which is somewhat easier is more common I think. |
This is very cool, wanted to share a few ideas/concepts that worked well for https://github.com/spiceai/spiceai – recently added simple caching using DataFusion with a slightly different approach – without solving the delta problem between prefetched/cached and actual data. In our case, we control dataset updates in most cases and just perform cache invalidation. Responding with outdated information per configurable TTL for other cases is expected behavior.
Cache implementation: https://github.com/spiceai/spiceai/blob/trunk/crates/cache/src/lru_cache.rs |
Done -- https://github.com/datafusion-contrib/datafusion-query-cache |
The first method is the well-known pre-aggregation, which is widely used. It's possible to automate this by writing a logical plan analyzer for the operation, though it may not support all functions out of the box. Basic functions like count, min, max, sum, last, and first are straightforward, but more complex ones, like median, add complications. In the second case, I recall implementing a retract mechanism for certain aggregation functions to support streaming window operations. While it's not achievable through SQL syntax, adding a physical operator to retract unwanted data might be a viable solution, especially if the filtered column is ordered. To sum up, the first method offers more flexibility when adjusting time ranges, such as moving from the last hour to the last six hours, as long as the data has already been pre-aggregated. |
I've moved the code to https://github.com/datafusion-contrib/datafusion-query-cache, I'll try and resume work on it soon. |
Last week on a call with @alamb and some of the Pydantic team (@adriangb @dmontagu @davidhewitt) I introduced the idea of caching intermediate results in DataFusion timeseries queries to make repeat queries much faster.
Frustrated by the ambivalence with which my idea was received, I've spending the weekend implementing a first draft (😉):
Meet
datafusion-query-cache
.This is obviously a very early WIP, but I think it proves that the intermediate caching idea I proposed works.
Feedback and contributions very welcome, I'd love to donate the project to
datafusion-contrib
even DataFusion itself if there's willingness? My ulterior motive being that the more widely useddatafusion-query-cache
is, the less likely it is that changes in DataFusion will break it.From the README:
How it works (the very quick version)
Let's say you run the query:
Then 10 minutes later you run the same query — by default DataFusion will process every one of the millions of record in the
stock_prices
table again to calculate the result, even though only the last 10 minutes of data has changed.Obvious we could save a lot of time and compute if we could remember the result of the first query, then combining it with a query on the last 10 minutes of data to get a result.
That's what
datafusion-query-cache
does!The key is that often in timeseries data, new data is inserted with a
timestamp
column that is close tonow()
, so it's trivial to know what results we can cache and what results we must recompute.datafusion-query-cache
doesn't have opinions about where the cached data is stored, instead you need to implement theQueryCache
trait to store data. A very simpleMemoryQueryCache
is provided for testing, we should addObjectStoreQueryCache
too.How it works (the longer version)
Some people reading the above example will already being asking
The best bit is: DataFusion already has all the machinery to combine partial query results, so
datafusion-query-cache
doesn't need any special logic for different aggregations, indeed it doesn't even know what they are.Instead we just hook into the right place in the physical plan to provide the cached results, constrain the query on new data and store the new result.
Let's look at an example
The physical plan for
looks something like this (lots of details omitted):
Notice how the
input
for the top levelAggegateExec
is anotherAggegateExec
? That's DataFusion allowing parallel execution by splitting the data into chunks and aggregating them separately. The output of the innerAggegateExec
(notemode: Parital
) will look something like:avg(price)[count]
avg(price)[sum]
The top level
AggegateExec
with (mode: Final
), then combines these partial results to get the final answer.This "combine partial results" is exactly what
datafusion-query-cache
uses to combine the cached result with the new data.So
datafusion-query-cache
, would rewrite the above query to have the following physical plan:The beauty is, if we wrote a more complex query, say:
datafusion-query-cache
doesn't need to be any cleverer, DataFusion does the hard work of combining the partial results, even accounting for the different buckets and aggregations and combining them correctly.Prior art
Other database have similar concepts, e.g. continuous aggregates in TimeScaleDB, but they require explicit setup. In contrast,
datafusion-query-cache
analyses queries (including subqueries) and automatically applies the cache if it can.What's supported
GROUP BY
aggregation queries with a static lower bound (or no lower bound)GROUP BY
) with a static lower bound (or no lower bound)GROUP BY
aggregation queries with a dynamic lower bound (e.g .timestamp > now() - interval '1 day'
) - this requires aFilterExec
wrapping theUnionExec
and discarding older dataGROUP BY
) with a dynamic lower bound - this is harder, we probably have to rewrite the aggregation to include agroup_by
clause, then filter, then aggregate again???How to use
datafusion-query-cache
implementsQueryPlanner
,OptimizerRule
,UserDefinedLogicalNodeCore
andExecutionPlan
to customise query execution.Usage is as simple as calling
with_query_cache
on aSessionStateBuilder
, here's a complete (if minimal) example of creating aSessionContext
:See
examples/demo.rs
for a more complete working example.The text was updated successfully, but these errors were encountered: