[FEATURE] - Support pagination for PPL and SQL query #656

penghuo · 2022-06-23T23:55:01Z

Is your feature request related to a problem?

The new engine fetches a default size of index from OpenSearch set by this setting, the default value is 200. It is a blocker for ML plugin to pull more data by using PPL.
Cursor only support in V1 engine. . We should support it in V2 engine.

What solution would you like?

add pagination support for operator which will unblock ML plugin use case.
add cursor support for V2 engine.

seankao-az · 2022-07-07T18:41:03Z

[Draft] Design Doc

1 Overview

Support pagination requests for PPL and SQL query

2 Problem Statements

The V2 engine limits the response size of a query. This is a blocker for ML plugin to pull more data by using Piped Processing Language (PPL). While the V1 SQL engine supports pagination using cursor scrolling, PPL uses the V2 engine explicitly. Therefore, we need to migrate cursor support to the V2 engine.

The legacy engine took a stateless approach: the cursor encodes the necessary information for re-constructing the query: https://github.com/opensearch-project/sql/blob/main/docs/dev/Pagination.md
This worked well for SQL, but as PPL allows for extra calculations on the data rows returned from index scanning, we might not be able to rebuild context on the fly.

3 Requirements

3.1 Use Cases

3.1.1 Scrolling through web interface

SQL cursor
Sample requests:

POST /_plugin/_sql
{
  "fetch_size": 5,
  "query": "SELECT firstname, lastname FROM accounts WHERE age > 20 ORDER BY state"
}

{
  "schema": [...],
  "cursor": "eyJhIjp7fSwicyI6IkRYRjFaWEo1UVc1a1JtVjBZMmdCQUFBQUFBQUFBQU1XZWpkdFRFRkZUMlpTZEZkeFdsWnJkRlZoYnpaeVVRPT0iLCJjIjpbeyJuYW1lIjoiZmlyc3RuYW1lIiwidHlwZSI6InRleHQifSx7Im5hbWUiOiJsYXN0bmFtZSIsInR5cGUiOiJ0ZXh0In1dLCJmIjo1LCJpIjoiYWNjb3VudHMiLCJsIjo5NTF9",
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}

POST /_plugin/_sql
{
  "cursor": "eyJhIjp7fSwicyI6IkRYRjFaWEo1UVc1a1JtVjBZMmdCQUFBQUFBQUFBQU1XZWpkdFRFRkZUMlpTZEZkeFdsWnJkRlZoYnpaeVVRPT0iLCJjIjpbeyJuYW1lIjoiZmlyc3RuYW1lIiwidHlwZSI6InRleHQifSx7Im5hbWUiOiJsYXN0bmFtZSIsInR5cGUiOiJ0ZXh0In1dLCJmIjo1LCJpIjoiYWNjb3VudHMiLCJsIjo5NTF9"
}

{
  "schema": [...],
  "cursor": <next cursor>,
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}

POST /_plugin/_ppl
{
  "fetch_size": 5,
  "query": "source=accounts | where age > 20 | sort state | fields firstname,lastname"
}

{
  "schema": [...],
  "cursor": <first cursor>,
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}

POST /_plugin/_ppl
{
  "cursor": <first cursor>
}

{
  "schema": [...],
  "cursor": <next cursor>,
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}

3.1.2 Interacting with ML command in PPL to load full result set

Sample request:

POST /_plugin/_ppl
{
  "query": "source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3"
}

Users do not need to add the fetch_size parameter. The ml command implicitly invokes search and scroll to fetch all dataset.

3.1.3 Extending query size limit

#703
This allows for querying and processing large data sets. With this, the 3.1.2 use case can be resolved to some extent.
Sample request:

POST /_plugin/_ppl
{
  "query": "source=iris_data | head 100000 | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3"
}

3.2 Functional Requirements

Scrolling support for V2 query backend engine
Scrolling support for SQL and PPL requests

4 Measure of Success

Be able to run sample queries in Section 3.1.1
Extend size_limit setting in query engine to support unlimited index query. #703

5 Design

5.1 Background

5.1.1 Current execution flow

Query request

POST /_plugin/_sql
{
  "query": "SELECT * FROM table"
}

execution path: (1a) → (2a) → (3)

Scroll query request

POST /_plugin/_sql
{
  "fetch_size": 5,
  "query": "SELECT * FROM table"
}

execution path: (1a) → (2b) → (3) → legacy OpenSearch request

Scroll cursor request

POST /_plugin/_sql
{
  "cursor": <cursor>
}

execution path: (1b) → legacy cursor executor

5.2 Design considerations

5.2.1 Physical plan generation

Physical plan for index scan does not solely depend on the logical plan for the query anymore. The presence/absence of fetch_size parameter will decide whether to make a regular query request or a scroll request to OpenSearch. Therefore, the request handler should pass this info down to the bottom where physical plan is generated.

Option 1

Let the Planner generate different physical plans for query and scroll requests.

plan(logical plan, is not scroll) => physical index scan with query request
plan(logical plan, is scroll) => physical index scan with scroll request

Option 2

Let the Analyzer generate different logical plans for query and scroll requests.

analyze(unresolved plan, is not scroll) => logical index scan for query request
analyze(unresolved plan, is scroll) => logical index scan for scroll request

5.2.2 Scrolling interaction with limit/offset

Limit

#703

Offset

OpenSearch doesn't allow scroll requests to have offset != 0.
To bypass this restriction, we can set the offset of a scroll request to be 0 and skip the first few results for the user before returning. However, it is suggested that search_after be used if we need to page through more than index.max_result_window hits.

5.2.3 Physical plan retention

Extra calculations are required for some queries, such as the parse, eval, dedup commands in ppl.
When handling pagination for source=index | other commands, the scroll request for OpenSearch handles only the first part source=index, but not the rest | other commands. We should create context and keep it alive for each scroll request.

Context lookup

Where does such context live? In local memory? Shared storage?

5.2.4 Backward compatibility with legacy cursor format

https://github.com/opensearch-project/sql/blob/main/legacy/src/main/java/org/opensearch/sql/legacy/cursor/DefaultCursor.java
This cursor was designed for stateless requests.

5.3 Design Overview

5.3.1 Journey of Request

5.3.2 Backend: OpenSearch Request

Interface to the OpenSearch engine

OpenSearchQueryRequest: for regular query request
OpenSearchScrollRequest: for scroll request
- Stateful to maintain scroll ID between calls to client search method

5.3.3 Backend: Physical Plan Generation

We add a new request builder to the OpenSearchIndexScan plan.
Currently, physical plan depends solely on the (optimized) logical plan. However, the same logical plans can lead to different physical plans because:

A regular query request and a scroll query request generate the exact same logical plan, but they need different physical plans to invoke different OpenSearch requests.
The presence/absence of ML commands can determine how we want to fetch data.
Extend size_limit setting in query engine to support unlimited index query. #703 other commands could also require scrolling depending on the parameter passed in

We add a PlanContext component to solve the problem. The PlanContext can be set during request handling and query analyzing. When building the physical plan, the planner uses it to decide which index scan mode to use.

5.3.4 Frontend: Request Handling

If the request is a scroll query request, set PlanContext so that this information is passed to the planner later. This same logic will apply to both SQL and PPL request handling.

5.3.5 Frontend: Response Formatting

5.3.6 Cursor generation and mapping

vmmusings · 2022-07-15T20:26:34Z

In the second cursor request can there be a scenario with some extra calculations along with push down. how do you retain entire physical plan?

seankao-az · 2022-07-25T20:42:29Z

In the second cursor request can there be a scenario with some extra calculations along with push down. how do you retain entire physical plan?

Thanks for pointing this out. I've come up with a design idea that takes this into consideration. It also avoids many of the problems faced in the PoC PR #693

The design in poc: scroll query request #693 uses the the scroll ID returned from OpenSearchScrollRequest as the cursor response for the sql/ppl request. New design hides it and exposes a new cursor we generated, so we no longer need to break PhysicalPlan interface to get the scroll ID.
retains the physical plan for extra calculations

penghuo · 2022-11-22T16:29:01Z

relate to #947

rrlamichhane · 2023-03-03T04:03:07Z

What's the status of this?

Yury-Fridlyand · 2023-03-03T19:03:21Z

WIP, you can track it in Bit-Quill#226

Yury-Fridlyand · 2023-03-10T02:34:49Z

OpenSearch.SQL.pagination.phase.1.demo.mp4

Yury-Fridlyand · 2023-05-18T22:50:18Z

Pagination.with.WHERE.clause.mp4

Yury-Fridlyand · 2023-06-17T01:01:34Z

- Pagination basement - support select * from <table> Support for pagination in v2 engine of SELECT * FROM <table> queries #1666
- Support WHERE clause - Pagination Phase 2: Support WHERE clause, column list in SELECT clause and for functions and expressions in the query. #1500
- Support ORDER BY clause - Pagination Phase 2: Support ORDER BY clauses and queries without FROM. #1599
- Support LIMIT clause - WIP
- Support system queries - tracked by [FEATURE] Paginate system queries in v2 #1712
- Support OFFEST clause
- Support aggregation and GROUP BY clause
- Support NESTED function - tracked by [FEATURE] Pagination with NESTED function in v2 #1718
- Support pagination in PPL

Yury-Fridlyand · 2023-06-20T21:55:22Z

Superseded by #1759

penghuo added enhancement New feature or request untriaged labels Jun 23, 2022

seankao-az mentioned this issue Jul 15, 2022

poc: scroll query request #693

Closed

6 tasks

seankao-az mentioned this issue Jul 15, 2022

PoC fetch all #697

Closed

6 tasks

anirudha removed the untriaged label Jul 25, 2022

This was referenced Jul 28, 2022

Adds plan context for Scroll physical plan building #713

Closed

Extend size_limit setting in query engine to support unlimited index query. #703

Open

penghuo mentioned this issue Nov 22, 2022

[BUG] Unknown index when querying data with fetch_size #947

Open

Yury-Fridlyand mentioned this issue Mar 10, 2023

Support pagination in V2 engine, phase 1 Bit-Quill/opensearch-project-sql#226

Merged

18 tasks

Yury-Fridlyand mentioned this issue Mar 28, 2023

Support pagination in V2 engine, phase 1 (#226) #1483

Closed

6 tasks

Yury-Fridlyand mentioned this issue Apr 5, 2023

Support pagination in V2 engine, phase 1 #1497

Merged

6 tasks

Yury-Fridlyand added the pagination Pagination feature, ref #656 label May 1, 2023

acarbonetto mentioned this issue Jun 19, 2023

[BUG] Non-paginated results returned silently if query not supported by cursor #78

Open

Yury-Fridlyand mentioned this issue Jun 20, 2023

[FEATURE] Implement pagination in V2 #1759

Open

18 tasks

Yury-Fridlyand closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] - Support pagination for PPL and SQL query #656

[FEATURE] - Support pagination for PPL and SQL query #656

penghuo commented Jun 23, 2022

seankao-az commented Jul 7, 2022 •

edited by anirudha

Loading

vmmusings commented Jul 15, 2022

seankao-az commented Jul 25, 2022 •

edited

Loading

penghuo commented Nov 22, 2022

rrlamichhane commented Mar 3, 2023

Yury-Fridlyand commented Mar 3, 2023

Yury-Fridlyand commented Mar 10, 2023

Yury-Fridlyand commented May 18, 2023

Yury-Fridlyand commented Jun 17, 2023

Yury-Fridlyand commented Jun 20, 2023

[FEATURE] - Support pagination for PPL and SQL query #656

[FEATURE] - Support pagination for PPL and SQL query #656

Comments

penghuo commented Jun 23, 2022

seankao-az commented Jul 7, 2022 • edited by anirudha Loading

[Draft] Design Doc

1 Overview

2 Problem Statements

3 Requirements

3.1 Use Cases

3.1.1 Scrolling through web interface

3.1.2 Interacting with ML command in PPL to load full result set

3.1.3 Extending query size limit

3.2 Functional Requirements

4 Measure of Success

5 Design

5.1 Background

5.1.1 Current execution flow

Query request

Scroll query request

Scroll cursor request

5.2 Design considerations

5.2.1 Physical plan generation

Option 1

Option 2

5.2.2 Scrolling interaction with limit/offset

Limit

Offset

5.2.3 Physical plan retention

Context lookup

5.2.4 Backward compatibility with legacy cursor format

5.3 Design Overview

5.3.1 Journey of Request

5.3.2 Backend: OpenSearch Request

5.3.3 Backend: Physical Plan Generation

5.3.4 Frontend: Request Handling

5.3.5 Frontend: Response Formatting

5.3.6 Cursor generation and mapping

vmmusings commented Jul 15, 2022

seankao-az commented Jul 25, 2022 • edited Loading

penghuo commented Nov 22, 2022

rrlamichhane commented Mar 3, 2023

Yury-Fridlyand commented Mar 3, 2023

Yury-Fridlyand commented Mar 10, 2023

Yury-Fridlyand commented May 18, 2023

Yury-Fridlyand commented Jun 17, 2023

Yury-Fridlyand commented Jun 20, 2023

seankao-az commented Jul 7, 2022 •

edited by anirudha

Loading

seankao-az commented Jul 25, 2022 •

edited

Loading