Make recursive queries efficient #1122

d4nielfr4nco · 2019-07-14T02:56:36Z

It would be really nice if Presto supported recursive queries.

A potential use case would be to dynamically parse generic json strings to get all nested and unnested json keys.

electrum · 2019-07-14T04:35:25Z

Can you give some examples of the JSON parsing? It might be possible to solve that directly without generic recursive query support.

OmerJog · 2019-12-11T09:23:13Z

I'm not really sure why presto doesn't support it.
mentioned here:
https://stackoverflow.com/questions/55428322/do-presto-sql-support-recursive-query-using-cte-just-like-sql-server-e-g-emplo

According to :
https://stackoverflow.com/questions/32037048/sql-tree-structure
It's part of standard SQL and supported by all modern DBMS

Isn't presto must support SQL standard?

kgeis · 2020-02-10T22:17:50Z

Here is a recursive query that I use to create a date dimension table. I would love to see it work in Presto!

  WITH dates(calendar_date,
             day_index,
             working_day_index) AS
       (SELECT date '1950-01-01',
               0,
               0
         UNION ALL
        SELECT calendar_date + interval '1' day,
               day_index + 1,
               working_day_index + CASE DOW(calendar_date + interval '1' day) WHEN 6 THEN 0 WHEN 7 THEN 0 ELSE 1 END
          FROM dates
         WHERE calendar_date + interval '1' day < date '2051-01-01')
SELECT calendar_date,
       TO_ISO8601(calendar_date),
       day_index,
       working_day_index
  FROM dates;

tooptoop4 · 2020-03-01T14:26:58Z

https://docs.oracle.com/cd/B28359_01/server.111/b28286/queries003.htm#SQLRF52315 have some good examples too, classic example is grandfather id, ie get the manager id of each employee up the chain

ddrinka · 2020-04-14T23:08:19Z

I asked a question in the Slack channel that I think could be solved with recursive queries:
If some condition exists, output a row. Based on whether a row was output, decide when to output the next row. In my case I wanted to only output rows at most every 5 seconds based on a timestamp. But I've now run into another scenario where I once again needed the current row's value to decide how to output future rows.

If direction is 1 and total is negative, apply value as delta to the total.
If direction is -1 and total is positive, apply value as delta to the total.
Otherwise, delta is NULL and the total doesn't change.

direction	value	delta	total	notes
1	10	10	10
1	20	NULL	10	No addition here because total is positive
-1	30	-30	-20	We can go down
-1	40	NULL	-20	Now we're negative, can't keep going down
1	50	50	30	Back up

Poking around led me to this: https://stackoverflow.com/questions/52798009/how-to-use-result-of-previous-calculated-row-based-on-a-condition-in-sql

I think recursive CTEs are what I need...

I assume this is a very large feature?

Obviously not SQL standard, but if there was a way to use the current value of the sum inside a sum() OVER (), this would be by-far the cleanest way to write what I'm working on.

lhofhansl · 2020-05-01T03:19:43Z

I might (depending on what happens to #3535) have a related use cases:

I'm adding standard SQL definition_schema base tables to in the PR.
With recursive queries information_schema such as applicable_roles could be expressed as actual views. These need to be recursive because applicable roles are transitive (a user might have been granted a role X by being granted role Y which itself was granted X, a.s.o.)

martint · 2020-05-21T23:49:05Z

In order to implement recursive CTEs, we need support for:

temporary tables
multi-step query plans
(possibly), planning-execution loops, but we can work around this as explained further below.

Here's a short analysis of how recursive CTEs work. For a query such as:

WITH RECURSIVE t AS (
    SELECT 1 as n
    UNION ALL
    SELECT n + 1 FROM t WHERE n < 10
)
SELECT * from t

The query is semantically equivalent to doing the following steps, where f(x) is the result of applying the recursion step (i.e., the SELECT n + 1 FROM t WHERE n < 10 query) to the results of the previous iteration, and t0, … tN are the results of each iteration. SELECT 1 AS n is the base or "seed" step:

t0 = SELECT 1
t1 = f(t0)
t2 = f(t1)
...
tN = f(tN-1)

The result of the query is the UNION ALL of t1, t2, …, tN.

If we were to fully expand the query based on current Presto capabilities, the query would turn into:

SELECT 1 n
UNION ALL
SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10
UNION ALL
SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10) WHERE n < 10
UNION ALL
SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10) WHERE n < 10) WHERE n < 10
UNION ALL
SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10) WHERE n < 10) WHERE n < 10) WHERE n < 10
...

We end up with N executions of the base query (instead of, ideally, just 1) and O(N^2) “terms” in the expansion. To be able to do this in a reasonably efficient way, we'd need to:

Be able to store the results of each t_i in a temporary table
Plan and schedule the terms once the previous one completes executing, otherwise the temporary table referenced by each step won't exist during planning, there won't be a TableHandle to refer to it, etc.. We could work around this by setting a max recursion depth and pre-generating the temporary tables for all the steps. We’d still need strict ordering of execution for the terms to ensure that step k reads data from step k - 1 after its temporary table has been populated.

The query could then be executed according to this sequence:

<create temporary tables for t0...tN given a max recursion depth of N>

INSERT INTO t0 SELECT 1
INSERT INTO t1 SELECT n + 1 FROM t0 WHERE n < 10
INSERT INTO t2 SELECT n + 1 FROM t1 WHERE n < 10
...
INSERT INTO tN SELECT n + 1 FROM tN-1 WHERE n < 10

TABLE t0 UNION ALL TABLE t1 UNION ALL TABLE t2 UNION ALL ... tN

lhofhansl · 2020-08-13T21:19:52Z

Alternative implementations could use a single temp table with a <depth> column to identify the right recursion depth.

I.e, the i'th query is INSERT INTO <temp> SELECT n + 1, depth + 1 FROM <temp> WHERE n < 10 AND depth = <i>
In that case even the UNION ALL'ing is not needed, since the one temp table has all the data, and for UNION we'd do a DISTINCT scan over all columns but the <depth> column.
For temp tables we might want to have a temp-connector that configures the catalog/schema into which to place to temp tables. Then an install can decide whether to use the memory or other connector for temp tables - based on size and speed.

Or, one could have two temp tables, a working table and an intermediary table and empty the intermediary after each iteration.

tech-tendai · 2020-09-22T08:35:13Z

Even the expanded format @martint suggested above is not working for me.

Beyond the second UNION ALL, I get an error "Column 'n' cannot be resolved".

findepi · 2020-09-22T09:47:40Z

@tech-tendai you can reach out for help on the #troubleshooting channel on our slack

findepi added the enhancement New feature or request label Dec 11, 2019

kasiafi self-assigned this Jun 26, 2020

kasiafi mentioned this issue Jun 26, 2020

Support recursive CTEs #4250

Merged

martint mentioned this issue Aug 10, 2020

WITH RECURSIVE is inefficient #4771

Closed

hashhar changed the title ~~Feature Request - Recursive Queries Support~~ Make recursive queries efficient Mar 4, 2022

martint mentioned this issue Aug 4, 2022

add support for temp tables #5686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make recursive queries efficient #1122

Make recursive queries efficient #1122

d4nielfr4nco commented Jul 14, 2019

electrum commented Jul 14, 2019 via email

OmerJog commented Dec 11, 2019

kgeis commented Feb 10, 2020

tooptoop4 commented Mar 1, 2020

ddrinka commented Apr 14, 2020

lhofhansl commented May 1, 2020

martint commented May 21, 2020 •

edited

Loading

lhofhansl commented Aug 13, 2020

tech-tendai commented Sep 22, 2020

findepi commented Sep 22, 2020

Make recursive queries efficient #1122

Make recursive queries efficient #1122

Comments

d4nielfr4nco commented Jul 14, 2019

electrum commented Jul 14, 2019 via email

OmerJog commented Dec 11, 2019

kgeis commented Feb 10, 2020

tooptoop4 commented Mar 1, 2020

ddrinka commented Apr 14, 2020

lhofhansl commented May 1, 2020

martint commented May 21, 2020 • edited Loading

lhofhansl commented Aug 13, 2020

tech-tendai commented Sep 22, 2020

findepi commented Sep 22, 2020

martint commented May 21, 2020 •

edited

Loading