Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a bug when running queries with a limit clause #16643

Merged
merged 15 commits into from
Jul 9, 2024

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented Jun 24, 2024

When creating the queryDefinition and calculating shuffleSpecs, we defaulted to a shuffling everything into a single partition if we required a limit(regardless of the resultShuffleSpecFactory passed to the queryKit). This was followed by the limit stage having no shuffling (as it would always create a single output partition).

This logic is correct, as long as we do not require any specific partitioning for the final output, since the limit would be the final MSQ stage(this would be the case for instance, in the case of writing SELECT query results into a query report).

However, for query types which expect partitions of a specific size, this causes a bug. For example, an async query which applies a LIMIT of 1000, with a context parameter rowsPerPage:100. This would expect the limit processor to create a 10 segments of size 100 each. However, the limit processor would still create a single segment as an output. This causes issues with certain queries which would expect multiple partitions.

This issue is present with all SELECT queries except ones which write to the query report. REPLACE or INSERT queries which have an ALL partitioning are also affected (since we do not allow LIMIT along with other partitions).

Even worse, we still sometimes expect these partitions, and this causes an exception when the subsequent stage tries to read them.

This PR aims to correct this by adding a shuffling based on the resultShuffleSpecFactory after a limit processor depending on the query destination. LimitFrameProcessors currently do not update the partition boosting column, so we also add the boost column to the previous stage, if one is required.

Behaviour changes

INSERT + LIMIT with PARTITIONED BY ALL

Before: Only the scan stage would have a shuffle spec (mix). Would create a single segment. rowsPerSegment would not be respected.

After: Scan stage has mix shuffle, limit stage has sort shuffle spec. rowsPerSegment parameter of 1000 is respected. If we have a clustering, the first scan stage is a sort shuffle spec instead.
Screenshot 2024-07-05 at 9 45 40 PM

INSERT + LIMIT with PARTITIONED BY DAY (if we supported it)

Before: Only the scan stage would have a shuffle spec (mix, this would ignore day partitioning). Would create a single segment. rowsPerSegment would not be respected.
After: Scan stage has mix shuffle, limit stage has sort shuffle spec. rowsPerSegment parameter of 1000 is respected. (probably)
SELECT * FROM foo LIMIT 100 -> taskReport
Before: Works as expected.
After: Works as expected.
Screenshot 2024-07-05 at 9 49 56 PM

SELECT * FROM foo LIMIT 100 -> durableStorage (async queries)
Before: Only the scan stage would have a shuffle spec (mix). Would create a single segment. rowsPerPage would not be respected.
After: The scan stage has a mix shuffle spec. The limit stage has a sort with target size shuffle, determined by rowsPerPage. We have an additional selectResults stage
Screenshot 2024-07-05 at 9 51 48 PM

SELECT * FROM foo LIMIT 100 ORDER BY dim1 -> durableStorage (async queries)
Before: Only the scan stage would have a shuffle spec (mix). Would create a single segment. rowsPerPage would not be respected.
After: The scan stage has a sort shuffle spec with target count of worker count. The limit stage has a sort with target size shuffle, determined by rowsPerPage. We have an additional selectResults stage.
Screenshot 2024-07-05 at 9 54 17 PM
SELECT * FROM foo LIMIT 100 -> export
Before: Only the scan stage would have a shuffle spec (mix). Would create a single segment. rowsPerPage would not be respected.
After: The scan stage has a mix shuffle spec. The limit stage has a sort with target size shuffle, determined by rowsPerPage.
Screenshot 2024-07-05 at 9 57 41 PM
SELECT * FROM (SELECT * FROM foo ORDER BY dim1 LIMIT 10000) ORDER BY dim2 LIMIT 5000 with 1 worker
Before: Works as expected.
After: Works as expected.
Screenshot 2024-07-05 at 10 00 24 PM

SELECT * FROM (SELECT * FROM foo ORDER BY dim1 LIMIT 10000) ORDER BY dim2 LIMIT 5000 with 2 worker
Before: Inner limit produces single partition on one of the workers, other workers produce nothing on the limit. The next stage throws an exception when another worker tries to read one of these missing partitions.

2024-07-05T14:02:38,587 INFO [controller-status-checker-0] org.apache.druid.msq.exec.WorkerImpl - Controller task[query-3eacaa7e-d5a0-434e-934a-d319751fd6a4] for worker task[query-3eacaa7e-d5a0-434e-934a-d319751fd6a4-worker1_0] failed. Canceling.
2024-07-05T14:02:38,589 INFO [task-runner-0-priority-0] org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "query-3eacaa7e-d5a0-434e-934a-d319751fd6a4-worker1_0",
  "status" : "FAILED",
  "duration" : 1319,
  "errorMsg" : "WorkerRpcFailed: RPC call to task failed unrecoverably: [query-3eacaa7e-d5a0-434e-934a-d319751fd6a4-worker0_0]",
  "location" : {
    "host" : null,
    "port" : -1,
    "tlsPort" : -1
  }
}

After: Works as expected. The scan stages have shuffles with target number of partitions=1 (to handle the sorting), the inner limit stage has a sort with target number of partitions=3(for worker count), and the final limit has sort with targetSize = 1000 (rowsPerPage). We have a selectResults stage since we are doing a sort in the previous stage.
Screenshot 2024-07-05 at 10 01 53 PM

INSERT + Select dim1, count(*) GROUP BY 1 LIMIT 5000 PARTITIONED BY ALL with rowsPerSegment:1000
Before: The group by stages have globalSort and null shuffle specs. The limit stage creates a single partition.
Screenshot 2024-07-06 at 9 19 31 PM
After: The limit stage has sort shuffle spec into targetSize.rowsPerSegment parameter of 1000 is respected.
Screenshot 2024-07-06 at 9 44 50 PM

GroupBy -> taskReport
Before: Works as expected.
After: Works as expected.
Screenshot 2024-07-06 at 9 58 37 PM

GroupBy -> durableStorage (asyncQuery)
Before: The group by stages have globalSort and null shuffle specs. The limit stage creates a single partition. Would create a single page. rowsPerPage would not be respected.
After: The limit stage has a sort shuffle spec into targetSize. We always add an additional stage selectResults. rowsPerPage parameter of 50 is respected.
Screenshot 2024-07-06 at 9 59 08 PM

GroupBy -> export
Before: The group by stages have globalSort and null shuffle specs. The limit stage creates a single partition. Would create a single page. rowsPerPage would not be respected.
After: The limit stage has a sort shuffle spec into targetSize. rowsPerPage parameter of 50 is respected.
Screenshot 2024-07-06 at 10 01 34 PM


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jun 24, 2024
Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. The finalShuffleSpec must also be set in the case of ingest queries. Otherwise, MSQ will disregard the rowsPerSegment parameter, if we have a LIMIT clause.

@adarshsanjeev
Copy link
Contributor Author

adarshsanjeev commented Jun 24, 2024

The finalShuffleSpec must also be set in the case of ingest queries.

Ingest queries do not allow limit, unless the partitioning is ALL, so this case shouldn't happen.

I see, the case you mean is with ALL partitioning and LIMIT, should create more than 1 partitions. I think this is an existing bug. I have included this case as well. Thanks for the catch!

@adarshsanjeev adarshsanjeev changed the title Fixes an bug when using limit with durable storage queries Fixes an bug when running queries with a limit clause Jun 26, 2024
@@ -132,12 +133,12 @@ public QueryDefinition makeQueryDefinition(
shuffleSpecFactoryPreAggregation = intermediateClusterBy.isEmpty()
? ShuffleSpecFactories.singlePartition()
: ShuffleSpecFactories.globalSortWithMaxPartitionCount(maxWorkerCount);
shuffleSpecFactoryPostAggregation = doLimitOrOffset
shuffleSpecFactoryPostAggregation = doLimitOrOffset && !needsFinalShuffling
Copy link
Contributor

@LakshSingla LakshSingla Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as on the other line - Consider a query like the following where LIMIT is not the final stage (example query, assuming the inner limit is not in the final stage):

SELECT col1 FROM (SELECT col1 FROM foo GROUP BY col1 LIMIT 1) LIMIT 20

If the group by returns 10 partitions, the limit would work on them individually, which is not desirable. I think this change isn't required as we are always setting up limit at the end.

Can you also add a case with such a query where the limit isn't a final stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a subquery, the needsFinalShuffling should always be false here, so I believe that case should still work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My example was incorrect, but I am still doubtful. The only reason that this change seems incorrect is what happens if doLimitOrOffset=true but the factory doesn't generate a single partition.
for example: EXPORT ... SELECT col1 FROM foo GROUP BY col1 LIMIT 1
In that case, the LIMIT wouldn't be correct

Limit processor must always receive a single partition, and having ShuffleSpecFactories.singlePartition ensures that. The earlier code ensured that. The new code doesn't ensure that, and it can at best have the same result as the old code, or be incorrect (as is my guess). Alternatively, in which case will this change be useful over the original code.

Comment on lines 189 to 191
final ShuffleSpec finalShuffleSpec = needsFinalShuffling ?
shuffleSpecFactoryPreAggregation.build(resultClusterBy, true) :
null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can relax this constraint even when we don't have a final shuffle. In intermediate queries, having more partitions will improve the parallelization. Why do we not partition further in case of LIMIT processor?

@@ -119,7 +120,8 @@ public QueryDefinition makeQueryDefinition(
// We ignore the resultShuffleSpecFactory in case:
// 1. There is no cluster by
// 2. There is an offset which means everything gets funneled into a single partition hence we use MaxCountShuffleSpec
if (queryToRun.getOrderBys().isEmpty() && hasLimitOrOffset) {
// 3. The destination does not require shuffling after the limit stage to get outputs of a specific size.
if (queryToRun.getOrderBys().isEmpty() && hasLimitOrOffset && !needsFinalShuffling) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not checked, it would short circuit to using the MixShuffleSpec. We need to add the boost column, create the clusterBy etc for the scan stage, if we intend to have a final shuffle, so this check should be required.

@adarshsanjeev adarshsanjeev changed the title Fixes an bug when running queries with a limit clause Fixes a bug when running queries with a limit clause Jun 26, 2024
@adarshsanjeev
Copy link
Contributor Author

I have refactored the code in the QueryKits and added additional tests to address the comments.

@@ -185,7 +178,7 @@ public QueryDefinition makeQueryDefinition(
.inputs(new StageInputSpec(firstStageNumber))
.signature(signatureToUse)
.maxWorkerCount(1)
.shuffleSpec(null) // no shuffling should be required after a limit processor.
.shuffleSpec(finalShuffleSpec) // Apply the final shuffling after limit spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of shuffling. Is there any way we can avoid reshuffling the data by the same cluster by, and just repartition? Perhaps not without any supersorter changes, but I wanted to confirm.

Copy link
Contributor

@LakshSingla LakshSingla Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well maybe we can preserve the optimisation that we had earlier - if there's no orderBy and if doLimitOrOffset == true, we don't need to partition boost the intermediate shuffleSpec.

Actually, we don't need to partitionBoost the intermediate shuffle spec in any case (i.e. the shuffleSpec for the scan stage if there's a limit present) - Since it's all going into a single partition anyway, the partitionBoost won't have any use.
i.e. If there's a limit present, only the final stage should have partition boosting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial ScanFP is the one who increments the boost column, if we do not apply the boost column at that stage, the limit processor output would have boost columns of 0, which can't be split.
I guess additional changes would be needed to allow boosting to work with LimitProcessors before this optimization can be made.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final comments after the last change. The patch looks good now.

@@ -185,7 +178,7 @@ public QueryDefinition makeQueryDefinition(
.inputs(new StageInputSpec(firstStageNumber))
.signature(signatureToUse)
.maxWorkerCount(1)
.shuffleSpec(null) // no shuffling should be required after a limit processor.
.shuffleSpec(finalShuffleSpec) // Apply the final shuffling after limit spec.
Copy link
Contributor

@LakshSingla LakshSingla Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well maybe we can preserve the optimisation that we had earlier - if there's no orderBy and if doLimitOrOffset == true, we don't need to partition boost the intermediate shuffleSpec.

Actually, we don't need to partitionBoost the intermediate shuffle spec in any case (i.e. the shuffleSpec for the scan stage if there's a limit present) - Since it's all going into a single partition anyway, the partitionBoost won't have any use.
i.e. If there's a limit present, only the final stage should have partition boosting.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the patch!

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adarshsanjeev Please update the description with before and after changes to the number of stages and what kind of local testing is done.

Copy link
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

scanShuffleSpec = finalShuffleSpec;
} else {
// If there is a limit spec, check if there are any non-boost columns to sort in.
boolean requiresSort = clusterByColumns.stream()
Copy link
Contributor

@LakshSingla LakshSingla Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to see if we are sorting on the non-boost column? Isn't that automatically added by the ScanQueryKit et al? Maybe we can simplify the condition by checking if the orderBy is non-empty (before adding any boosting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a if-branch for window functions above this bit of code, which adds its own cluster bys, which are not dependent on the orderBy.

Comment on lines +175 to +176
// Note that we still need the boost column to be present in the row signature, since the limit stage would
// need it to be populated to do its own shuffling later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the limit factory couldn't partition boost. Am I mistaken?

Copy link
Contributor

@LakshSingla LakshSingla Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otoh, if there isn't any partition boosting, then even the original code would have run into similar problem if there wasn't a cluster key right - too large partitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit factory does not increment the partition boosting at regular intervals. The value would be all 0 if it was not added to the row signature at the scan stage.

even the original code would have run into similar problem

The original code would have mix shuffle speced. It would always make a single partition, so yes, it should have been too large.

@LakshSingla LakshSingla merged commit af5399c into apache:master Jul 9, 2024
54 checks passed
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
Add a shuffling based on the resultShuffleSpecFactory after a limit processor depending on the query destination. LimitFrameProcessors currently do not update the partition boosting column, so we also add the boost column to the previous stage, if one is required.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants