Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

querier: Add option for different gRPC client Loadbalancing options for StoreAPI #1083

Open
bwplotka opened this issue Apr 25, 2019 · 24 comments

Comments

@bwplotka
Copy link
Member

Currently if someone pass domain to --store of via file SD without dns loopups, the target is passed to gRPC DialContext which then if A lookup gives more than one response it loadbalance. However the default seems to be pickfirst not roundrobin, which might be confusing for some cases, like e.g smoothing the traffic to multiple replicas of store gateways.

AC:

  • Add another --store and fileSD option to configure gRPC LB types
@bwplotka
Copy link
Member Author

cc @devnev @mjd95

@povilasv
Copy link
Member

povilasv commented May 3, 2019

IMO we shouldnt really do this. I had a chat with grpc folks about 30min dns refresh interval hardcoded into go grpc package and they said PRs changing that arent welcome. And it doesnt make sense for them to implement all of the LB strategies in all the languages. ATM its easier to run envoy, nginx, etc to do this. In future they are working on grpc lb server

@povilasv
Copy link
Member

povilasv commented May 3, 2019

This is the snippet from our chat:

Povilas Versockas @povilasv Mar 12 12:56
Hey, guys I've been looking at GRPC loadbalancing on Kubernetes and I would like to go with go-grpc client looking up DNS records and doing round robin loadbalancing. The only thing worries me is that right now the default DNS refresh interval is 30 mins (https://github.com/grpc/grpc-go/blob/master/resolver/dns/dns_resolver.go#L47) and there is no way to change it. Would you guys bee open for a PR, which makes this configureable? I was thinking to add option WithRefreshInterval(time.Duration) DialOption. Or do you guys generally recommend proxy solutions (like envoy) doing load balancing in K8s?

Eric Anderson @ejona86 Mar 12 17:20
@povilasv, please no. Even the current interval would like to be removed. Instead, use KeepaliveParams with MaxConnectionAge. When the client receives disconnects, it will refresh DNS.

Povilas Versockas @povilasv Mar 12 19:20
@ejona86 but what about the case where I scale up my replicas? in this case clients will never know about new servers for 30 mins?
IMO in ideal scenario grpc LB pool should be refreshed based on DNS record's TTL, no?

Eric Anderson @ejona86 Mar 12 20:05
@povilasv, you set the MaxConnectionAge as the longest you are willing to wait for the clients to rebalance. It works for scaling up.
TTL causes trouble because 1) we can't get it many times and 2) it causes stampeding herds when caching DNS servers are in play.
Note that most existing HTTP-based systems do not base anything off TTL. Instead, they just create new connections frequently.

Edgar Romero @alfasapy Mar 12 20:10
@ejona86 what are the drawbacks of a very short maxConnectionAge? Assuming that the maxConnectionAgeGrace is high enough to prevent in-flight request from being cancelled.

Eric Anderson @ejona86 Mar 12 20:12
It just creates more connections. At O(seconds) and definitely O(minutes) the connection cost is already amortized. There will be "hickups" of more latency when creating new connections.

Povilas Versockas @povilasv Mar 12 21:23
ok makes sense, thank you :)
I feel like this should be documented better, is there a place I could do that?

Eric Anderson @ejona86 Mar 12 22:17
It's not clear to me where that place would be.

Povilas Versockas @povilasv Mar 12 22:22
ok, thanks for the help, I really appreciate it

@povilasv
Copy link
Member

povilasv commented May 3, 2019

IMO pickfirst is pretty good LB as default, Grafana -> Thanos Query -> Thanos Store becomes sticky per user, meaning if grafana refreshes and send the query it will be in cache already.

@bwplotka
Copy link
Member Author

bwplotka commented May 9, 2019

I don't think it is good as it does not evenly loadbalance the load. At least we might want to add option of changing this for first iteration..

Wonder if some more customized LB makes sense then? That is aware of client context?

@povilasv
Copy link
Member

The only proper way to evenly balance the load and don't deal with upscale issues is to run sidecar proxies.

The only other option is to add maxConnectionAge to Thanos Store, Sidecar and Rule GRPC servers, which is super hacky and I wouldn't like that.

AFAIK the grpc lookaside loadbalancer, where grpc client would actually connect to lookaside server get it's routes from there and keep refreshing them is still in development

https://grpc.io/blog/loadbalancing/

So we don't really have any good options currently, a part from asking users to use sidecar proxies.

@povilasv
Copy link
Member

hmm maybe it's not that bad we could do something like on all of our grpc servers:

	grpcServer := grpc.NewServer(
		grpc.UnaryInterceptor(grpc_middleware.ChainUnaryServer(
			grpc_prometheus.UnaryServerInterceptor,
			grpc_recovery.UnaryServerInterceptor(),
			grpc_logrus.UnaryServerInterceptor(log.NewEntry(log.StandardLogger())),
		)),
		grpc.KeepaliveParams(keepalive.ServerParameters{
			MaxConnectionIdle:     10 * time.Minute,
			MaxConnectionAge:      5 * time.Minute,
			MaxConnectionAgeGrace: 1 * time.Minute,
			Time:                  2 * time.Minute,
		}),
	)

@bwplotka
Copy link
Member Author

So I don't really understand the problem you are mainly trying to explain here. I don't think any dns refresh is needed. As you got answer is all about the broken connection and grpc client doing DNS lookup and choosing another host. Why is that problematic?

However the actual issue here is not about that. It's about loadbalancing method we can use. In fact we already use one (you have to use something), which is pickfirst. I vote we should have round robin available and maybe some sticky round robin as well that is aware of client. I am pretty sure there are existing Golang implementation of this on the wild (:

@bwplotka
Copy link
Member Author

OK, so I talked with @povilasv offline and the issue with RR is:

upscale if you do round robin gprc client wont see the new node

(https://kubernetes.io/blog/2018/11/07/grpc-load-balancing-on-kubernetes-without-tears/)

Maybe it's right. I would then suggest using something like this:
https://github.com/improbable-eng/kedge/blob/9a6745fb10e47ef4da101bb8e1cafd329ca397a7/pkg/kedge/http/lbtransport/policy.go#L60
It's battle tested on our prod already.

@povilasv
Copy link
Member

@bwplotka Maybe you guys could seperate it out into a different go package?

As importing kedge doesn't seem reasonable 🗡️
A part from that 👍

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@stale stale bot closed this as completed Jan 18, 2020
@bwplotka
Copy link
Member Author

We can indeed. Let's get back to this. (:

@bwplotka bwplotka reopened this Jul 22, 2020
@stale stale bot removed the stale label Jul 22, 2020
@stale
Copy link

stale bot commented Aug 21, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 21, 2020
@stale
Copy link

stale bot commented Aug 28, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 28, 2020
@kakkoyun kakkoyun removed the stale label Aug 31, 2020
@kakkoyun kakkoyun reopened this Aug 31, 2020
@stale
Copy link

stale bot commented Oct 30, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Oct 30, 2020
@kakkoyun kakkoyun removed the stale label Oct 31, 2020
@bwplotka
Copy link
Member Author

bwplotka commented Nov 9, 2020

Very needed, potentially blocking #3373

Also again mentioned here https://cloud-native.slack.com/archives/CK5RSSC10/p1604923594031700

@stale
Copy link

stale bot commented Jan 8, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 8, 2021
@kakkoyun
Copy link
Member

Still valid.

@stale stale bot removed the stale label Jan 12, 2021
@stale
Copy link

stale bot commented Mar 16, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Mar 16, 2021
@onprem onprem removed the stale label Apr 1, 2021
@stale
Copy link

stale bot commented Jun 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 2, 2021
@stale
Copy link

stale bot commented Jun 16, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jun 16, 2021
@aymericDD
Copy link
Contributor

I think it is still interesting.

@stale
Copy link

stale bot commented Nov 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@yeya24
Copy link
Contributor

yeya24 commented Mar 29, 2024

Is this still relevant after we have endpoint group?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants