Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: viper dynamic config does not work #14452

Closed
Tracked by #13990
deepthi opened this issue Nov 3, 2023 · 2 comments
Closed
Tracked by #13990

Bug Report: viper dynamic config does not work #14452

deepthi opened this issue Nov 3, 2023 · 2 comments
Labels
Milestone

Comments

@deepthi
Copy link
Member

deepthi commented Nov 3, 2023

Overview of the Issue

Some of vtgate's healthcheck flags have been defined as dynamic, because they are expected to be settable at runtime from /debug/env. However, the implementation of dynamic config is broken. It does not respect flags from the command line, and it does not use the defaults specified in code. It always falls back to the default (0) value for the type of the config.

The symptom is that vtgate's healthcheck ends up with no healthy REPLICA tablets in its list, because minNumTablets is set to 0. Users end up getting errors from @replica queries like this

1105: target: commerce.-.replica: no healthy tablet available for 'keyspace:"commerce" shard:"-" tablet_type:REPLICA' 

Credit to @aquarapid for figuring out that the problem was with viper/flag handling.

Reproduction Steps

This is actually non-trivial to reproduce. Local testing did not run into the same issue. That is because locally there is no load and replication lag is always 0. And the code always returns 1 replica if there's only 1.
So it is necessary to run with at least 2 replicas, preferably more and with some significant replica query load.
On a system with load, deploying vitess 17.0.0+ will throw replica query errors.

I added logging in replicationlag.go which exposed the problem.

func FilterStatsByReplicationLag(tabletHealthList []*TabletHealth) []*TabletHealth {
	log.Infof("REPLAG: min=%v, low=%v, high=%v", minNumTablets.Get(), lowReplicationLag.Get(), highReplicationLagMinServing.Get())
...

Binary Version

17.0.0+

Operating System and Environment details

Any

Log Fragments

I1102 19:39:27.649699       1 vtgate.go:662] Execute: target: commerce.-.replica: no healthy tablet available for 'keyspace:"commerce" shard:"-" tablet_type:REPLICA', request: map[]

I1103 00:23:18.515254       1 replicationlag.go:153] REPLAG: min=0, low=0s, high=0s
@deepthi deepthi added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Nov 3, 2023
@deepthi deepthi added this to the v18.0.0 milestone Nov 3, 2023
@deepthi deepthi added Component: General Changes throughout the code base Component: Query Serving and removed Needs Triage This issue needs to be correctly labelled and triaged labels Nov 3, 2023
@deepthi
Copy link
Member Author

deepthi commented Nov 3, 2023

There's a workaround for the specific vtgate/healthcheck issue, which is to set --legacy_replication_lag_algorithm=false. That is a static flag (default true) and mitigates the issue to some extent. Tablets with lag of even 1s will however still not be used.

@deepthi
Copy link
Member Author

deepthi commented Nov 3, 2023

Backports: #14454 and #14455. Once they are both merged, we can close this issue.

@deepthi deepthi mentioned this issue Nov 3, 2023
45 tasks
@deepthi deepthi closed this as completed Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant