Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki: Fix multitenant querying #7708

Merged
merged 9 commits into from
Nov 24, 2022

Conversation

DylanGuedes
Copy link
Contributor

@DylanGuedes DylanGuedes commented Nov 16, 2022

What this PR does / why we need it:
We recently broke multitenant querying because of recent changes to how timeouts working across Loki. This PR fixes this by:

  • Adapt the timeout wrapper to work with multitenant queries. It will take the shortest timeout across all given tenants
  • Adapt the query engine timeout assigning to work with multitenant queries. It will take the shortest timeout between all the tenants
  • Adapt query sharding to use smallest max query parallelism across given tenants
  • Add a functional test to ensure multitenant is behaving as expected

Signed-off-by: DylanGuedes [email protected]
Signed-off-by: Mehmet Burak Devecí [email protected]

Which issue(s) this PR fixes:
#7696

Special notes for your reviewer:
The regression was probably introduced by #7555

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0.2%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@chaudum
Copy link
Contributor

chaudum commented Nov 16, 2022

@DylanGuedes can you check with #7703 so you don't do things twice? :)

@DylanGuedes
Copy link
Contributor Author

@DylanGuedes can you check with #7703 so you don't do things twice? :)

Good call. I left a comment on his PR recommending that we proceed with this PR instead, since I'm adding a test that checks that this is really working and I noticed we also have to adapt things in other places.

@DylanGuedes DylanGuedes marked this pull request as ready for review November 21, 2022 15:10
@DylanGuedes DylanGuedes requested a review from a team as a code owner November 21, 2022 15:10
@DylanGuedes DylanGuedes added the backport release-2.7.x add to a PR to backport it into release 2.7.x label Nov 21, 2022
@DylanGuedes DylanGuedes changed the title Loki: Use longest timeout across all tenants Loki: Fix multitenant querying Nov 21, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0.2%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@pull-request-size pull-request-size bot added size/L and removed size/M labels Nov 21, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0.2%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@jeschkies jeschkies self-requested a review November 21, 2022 15:48
@JordanRushing
Copy link
Contributor

JordanRushing commented Nov 21, 2022

Does it make sense to opt for the lower of multiple limits rather than the higher? The former doesn't introduce unanticipated increases in resource consumption whereas the latter does if I understand correctly.

@DylanGuedes
Copy link
Contributor Author

Does it make sense to opt for the lower of multiple limits rather than the higher? The former doesn't introduce unanticipated increases in resource consumption whereas the latter does if I understand correctly.

Good question. I feel like we can find arguments for both, but personally, I think going for the higher values is better because it avoids that scenario where a smaller tenant makes queries not operational for bigger tenants if they have limits that are smaller enough.

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great find. Thanks for these awesome tests! However, we should always use the most restrictive limit. See #5626 for the original introduction.

pkg/logql/engine.go Outdated Show resolved Hide resolved
@@ -99,17 +99,25 @@ func (ast *astMapperware) Do(ctx context.Context, r queryrangebase.Request) (que
return ast.next.Do(ctx, r)
}

userID, err := tenant.TenantID(ctx)
// use biggest max query parallelism across all given tenants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. We should take the most restrictive.

pkg/querier/http.go Outdated Show resolved Hide resolved
pkg/querier/http.go Outdated Show resolved Hide resolved
"github.com/grafana/loki/integration/cluster"
)

func TestMultiTenantQuery(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

Copy link
Contributor

@JordanRushing JordanRushing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending additional dialogue about picking the higher vs. lower limit for a multi-tenant query.

@DylanGuedes
Copy link
Contributor Author

This is a really great find. Thanks for these awesome tests! However, we should always use the most restrictive limit. See #5626 for the original introduction.

Thanks, I hope those will help! Regarding the limits: I'm not strongly opinionated for proceeding with the bigger values, but personally, I think they work better here. That said, I'll wait a little more for others to share opinions before reverting the change etc.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0.1%
+        distributor	0%
+            querier	0.1%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@DylanGuedes
Copy link
Contributor Author

@jeschkies @JordanRushing I pushed a commit using the smallest limits instead, WDYT?

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0.1%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-0.1%
+        distributor	0%
+            querier	0.1%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment. I am happy to take over the PR if you give me right access to your repository.

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@DylanGuedes
Copy link
Contributor Author

See my comment. I am happy to take over the PR if you give me right access to your repository.

I pushed a commit addressing it, WDYT?

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@DylanGuedes DylanGuedes merged commit ad2260a into grafana:main Nov 24, 2022
@grafanabot
Copy link
Collaborator

Hello @DylanGuedes!
Backport pull requests need to be either:

  • Pull requests which address bugs,
  • Urgent fixes which need product approval, in order to get merged,
  • Docs changes.

Please, if the current pull request addresses a bug fix, label it with the type/bug label.
If it already has the product approval, please add the product-approved label. For docs changes, please add the type/docs label.
If none of the above applies, please consider removing the backport label and target the next major/minor release.
Thanks!

Abuelodelanada pushed a commit to canonical/loki that referenced this pull request Dec 1, 2022
**What this PR does / why we need it**:
We recently broke multitenant querying because of recent changes to how
timeouts working across Loki. This PR fixes this by:
- Adapt the timeout wrapper to work with multitenant queries. It will
take the shortest timeout across all given tenants
- Adapt the query engine timeout assigning to work with multitenant
queries. It will take the shortest timeout between all the tenants
- Adapt query sharding to use smallest max query parallelism across
given tenants
- Add a functional test to ensure multitenant is behaving as expected

Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: Mehmet Burak Devecí <[email protected]>

**Which issue(s) this PR fixes**:
grafana#7696

**Special notes for your reviewer**:
The regression was probably introduced by
grafana#7555
@chaudum chaudum added type/bug Somehing is not working as expected backport release-2.7.x add to a PR to backport it into release 2.7.x and removed backport release-2.7.x add to a PR to backport it into release 2.7.x labels Jan 25, 2023
@grafanabot
Copy link
Collaborator

The backport to release-2.7.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-7708-to-release-2.7.x origin/release-2.7.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ad2260aec2d4d587c50a3dfb68d2c89ed2bf3157
# Push it to GitHub
git push --set-upstream origin backport-7708-to-release-2.7.x
git switch main
# Remove the local backport branch
git branch -D backport-7708-to-release-2.7.x

Then, create a pull request where the base branch is release-2.7.x and the compare/head branch is backport-7708-to-release-2.7.x.

chaudum pushed a commit that referenced this pull request Jan 25, 2023
**What this PR does / why we need it**:
We recently broke multitenant querying because of recent changes to how
timeouts working across Loki. This PR fixes this by:
- Adapt the timeout wrapper to work with multitenant queries. It will
take the shortest timeout across all given tenants
- Adapt the query engine timeout assigning to work with multitenant
queries. It will take the shortest timeout between all the tenants
- Adapt query sharding to use smallest max query parallelism across
given tenants
- Add a functional test to ensure multitenant is behaving as expected

Signed-off-by: DylanGuedes <[email protected]>
Signed-off-by: Mehmet Burak Devecí <[email protected]>

**Which issue(s) this PR fixes**:
#7696

**Special notes for your reviewer**:
The regression was probably introduced by
#7555
chaudum added a commit that referenced this pull request Jan 25, 2023
Co-authored-by: Dylan Guedes <[email protected]>
Co-authored-by: Owen Diehl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release-2.7.x add to a PR to backport it into release 2.7.x backport-failed size/L type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants