Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Top-K enhancement when having plan like SortOperator + LimitOperator #2857

Open
qianheng-aws opened this issue Jul 24, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@qianheng-aws
Copy link
Contributor

qianheng-aws commented Jul 24, 2024

Is your feature request related to a problem?
Currently, when PPL having a sort and a followed head in its query, pattern like:

source | sort $field | head k

if using SQL, it should be pattern like:

select $field from source order by $field limit k;

It will using SortOperator and LimitOperator to calculate the final result if they are not being pushed down into the OpenSearch DSL. As shown in this issue #2802

POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights
    | eval FlightMin = FlightTimeMin 
    | sort -FlightMin | head 5 | fields FlightMin
  """
}

# explain results
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightMin]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 5,
          "offset": 0
        },
        "children": [
          {
            "name": "SortOperator",
            "description": {
              "sortList": {
                "FlightMin": {
                  "sortOrder": "DESC",
                  "nullOrder": "NULL_LAST"
                }
              }
            },
            "children": [
              {
                "name": "EvalOperator",
                "description": {
                  "expressions": {
                    "FlightMin": "FlightTimeMin"
                  }
                },
                "children": [
                  {
                    "name": "OpenSearchIndexScan",
                    "description": {
                      "request": """OpenSearchScrollRequest(initialSearchRequest=SearchRequest{searchType=QUERY_THEN_FETCH, indices=[opensearch_dashboards_sample_data_flights], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], routing='null', preference='null', requestCache=null, scroll=Scroll{keepAlive=1m}, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=null, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"from":0,"size":10000,"timeout":"1m"}, cancelAfterTimeInterval=null, pipeline=null, phaseTook=null}, scrollTimeout=1m, indexName=opensearch_dashboards_sample_data_flights, scrollId=, needClean=true)"""
                    },
                    "children": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

In some cases like above where sort and limit operator cannot be pushed down, it may need to scan all documents whose size could be very large, or user just wants to set plugins.query.size_limit to be very large, it will be a challenge for the current implementation.

What solution would you like?
SortOperator + LimitOperator is not the most efficient way to calculate the top-k results, and it may have risk of OOM since SortOperator doesn't support spilling in current implementation(just use java.util.PriorityQueue which is all in memory)

PriorityQueue<ExprValue> sorted = new PriorityQueue<>(1, sorter::compare);
).

As we only care about the top-k elements in output of SortOperator, the better way is to kick off the unnecessary elements and only maintain k elements in this operator. It can both reduce the time complexity to O(nlogk) and memory capacity to O(k), and ensure correct results as well.

What alternatives have you considered?

Do you have any additional context?

@qianheng-aws qianheng-aws added enhancement New feature or request untriaged labels Jul 24, 2024
@qianheng-aws qianheng-aws changed the title [FEATURE] Top-K enhancement when having PPL pattern like source | sort $field | head k [FEATURE] Top-K enhancement when having plan like SortOperator + LimitOperator Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants