Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make recursive queries efficient #1122

Open
d4nielfr4nco opened this issue Jul 14, 2019 · 10 comments
Open

Make recursive queries efficient #1122

d4nielfr4nco opened this issue Jul 14, 2019 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@d4nielfr4nco
Copy link

It would be really nice if Presto supported recursive queries.

A potential use case would be to dynamically parse generic json strings to get all nested and unnested json keys.

@electrum
Copy link
Member

electrum commented Jul 14, 2019 via email

@OmerJog
Copy link

OmerJog commented Dec 11, 2019

I'm not really sure why presto doesn't support it.
mentioned here:
https://stackoverflow.com/questions/55428322/do-presto-sql-support-recursive-query-using-cte-just-like-sql-server-e-g-emplo

According to :
https://stackoverflow.com/questions/32037048/sql-tree-structure
It's part of standard SQL and supported by all modern DBMS

Isn't presto must support SQL standard?

@findepi findepi added the enhancement New feature or request label Dec 11, 2019
@kgeis
Copy link

kgeis commented Feb 10, 2020

Here is a recursive query that I use to create a date dimension table. I would love to see it work in Presto!

  WITH dates(calendar_date,
             day_index,
             working_day_index) AS
       (SELECT date '1950-01-01',
               0,
               0
         UNION ALL
        SELECT calendar_date + interval '1' day,
               day_index + 1,
               working_day_index + CASE DOW(calendar_date + interval '1' day) WHEN 6 THEN 0 WHEN 7 THEN 0 ELSE 1 END
          FROM dates
         WHERE calendar_date + interval '1' day < date '2051-01-01')
SELECT calendar_date,
       TO_ISO8601(calendar_date),
       day_index,
       working_day_index
  FROM dates;

@tooptoop4
Copy link
Contributor

https://docs.oracle.com/cd/B28359_01/server.111/b28286/queries003.htm#SQLRF52315 have some good examples too, classic example is grandfather id, ie get the manager id of each employee up the chain

@ddrinka
Copy link
Member

ddrinka commented Apr 14, 2020

I asked a question in the Slack channel that I think could be solved with recursive queries:
If some condition exists, output a row. Based on whether a row was output, decide when to output the next row. In my case I wanted to only output rows at most every 5 seconds based on a timestamp. But I've now run into another scenario where I once again needed the current row's value to decide how to output future rows.

If direction is 1 and total is negative, apply value as delta to the total.
If direction is -1 and total is positive, apply value as delta to the total.
Otherwise, delta is NULL and the total doesn't change.

direction value delta total notes
1 10 10 10
1 20 NULL 10 No addition here because total is positive
-1 30 -30 -20 We can go down
-1 40 NULL -20 Now we're negative, can't keep going down
1 50 50 30 Back up

Poking around led me to this: https://stackoverflow.com/questions/52798009/how-to-use-result-of-previous-calculated-row-based-on-a-condition-in-sql

I think recursive CTEs are what I need...

I assume this is a very large feature?

Obviously not SQL standard, but if there was a way to use the current value of the sum inside a sum() OVER (), this would be by-far the cleanest way to write what I'm working on.

@lhofhansl
Copy link
Member

I might (depending on what happens to #3535) have a related use cases:

I'm adding standard SQL definition_schema base tables to in the PR.
With recursive queries information_schema such as applicable_roles could be expressed as actual views. These need to be recursive because applicable roles are transitive (a user might have been granted a role X by being granted role Y which itself was granted X, a.s.o.)

@martint
Copy link
Member

martint commented May 21, 2020

In order to implement recursive CTEs, we need support for:

  1. temporary tables
  2. multi-step query plans
  3. (possibly), planning-execution loops, but we can work around this as explained further below.

Here's a short analysis of how recursive CTEs work. For a query such as:

WITH RECURSIVE t AS (
    SELECT 1 as n
    UNION ALL
    SELECT n + 1 FROM t WHERE n < 10
)
SELECT * from t

The query is semantically equivalent to doing the following steps, where f(x) is the result of applying the recursion step (i.e., the SELECT n + 1 FROM t WHERE n < 10 query) to the results of the previous iteration, and t0, … tN are the results of each iteration. SELECT 1 AS n is the base or "seed" step:

t0 = SELECT 1
t1 = f(t0)
t2 = f(t1)
...
tN = f(tN-1)

The result of the query is the UNION ALL of t1, t2, …, tN.

If we were to fully expand the query based on current Presto capabilities, the query would turn into:

SELECT 1 n
UNION ALL
SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10
UNION ALL
SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10) WHERE n < 10
UNION ALL
SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10) WHERE n < 10) WHERE n < 10
UNION ALL
SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT n + 1 FROM (SELECT 1 n) WHERE n < 10) WHERE n < 10) WHERE n < 10) WHERE n < 10
...

We end up with N executions of the base query (instead of, ideally, just 1) and O(N^2) “terms” in the expansion. To be able to do this in a reasonably efficient way, we'd need to:

  1. Be able to store the results of each t_i in a temporary table
  2. Plan and schedule the terms once the previous one completes executing, otherwise the temporary table referenced by each step won't exist during planning, there won't be a TableHandle to refer to it, etc.. We could work around this by setting a max recursion depth and pre-generating the temporary tables for all the steps. We’d still need strict ordering of execution for the terms to ensure that step k reads data from step k - 1 after its temporary table has been populated.

The query could then be executed according to this sequence:

<create temporary tables for t0...tN given a max recursion depth of N>

INSERT INTO t0 SELECT 1
INSERT INTO t1 SELECT n + 1 FROM t0 WHERE n < 10
INSERT INTO t2 SELECT n + 1 FROM t1 WHERE n < 10
...
INSERT INTO tN SELECT n + 1 FROM tN-1 WHERE n < 10

TABLE t0 UNION ALL TABLE t1 UNION ALL TABLE t2 UNION ALL ... tN

@lhofhansl
Copy link
Member

Alternative implementations could use a single temp table with a <depth> column to identify the right recursion depth.

I.e, the i'th query is INSERT INTO <temp> SELECT n + 1, depth + 1 FROM <temp> WHERE n < 10 AND depth = <i>
In that case even the UNION ALL'ing is not needed, since the one temp table has all the data, and for UNION we'd do a DISTINCT scan over all columns but the <depth> column.
For temp tables we might want to have a temp-connector that configures the catalog/schema into which to place to temp tables. Then an install can decide whether to use the memory or other connector for temp tables - based on size and speed.

Or, one could have two temp tables, a working table and an intermediary table and empty the intermediary after each iteration.

@tech-tendai
Copy link

Even the expanded format @martint suggested above is not working for me.

Beyond the second UNION ALL, I get an error "Column 'n' cannot be resolved".

@findepi
Copy link
Member

findepi commented Sep 22, 2020

@tech-tendai you can reach out for help on the #troubleshooting channel on our slack

@hashhar hashhar changed the title Feature Request - Recursive Queries Support Make recursive queries efficient Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests