Backport of fix: use Envoy's default for validate_clusters to fix breaking routes when some backend clusters don't exist into release/1.19.x #21621
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #21587 to be assessed for backporting due to the inclusion of the label backport/1.19.
The below text is copied from the body of the original PR.
Description
The validate_clusters option in Envoy's route configuration says:
"An optional boolean that specifies whether the clusters that the route table refers to will be validated by the cluster manager. If set to true and a route refers to a non-existent cluster, the route table will not load. If set to false and a route refers to a non-existent cluster, the route table will load and the router filter will return a 404 if the route is selected at runtime. This setting defaults to true if the route table is statically defined via the route_config option. This setting default to false if the route table is loaded dynamically via the rds option. Users may wish to override the default behavior in certain cases (for example when using CDS with a static route table)."
We are setting it dynamically via RDS, but overriding the default value to set it explicitly to true. This means when a cluster that the route is supposed to point to doesn't exist, the route can fail to route to any of its backends. This case can be triggered if you have a router -> resolver where the resolver has backends on different peers/wan federated backends, and you add a route to a backend that doesn't exist. The non-existent backend causes the existing backends to fail. I was not able to trigger this case in a single cluster setup, but with a peered backend it can be triggered.
Because, the traffic doesn't just blackhole, but rather returns a 503, this actually seems to be the desired behavior, rather than making all other routing paths within that route fail due to a missing cluster. This is similar to the conclusion that was reached within the Jira ticket.
This PR removes the code that overrides the default value of this validate_clusters option.
Testing & Reproduction steps
Links
PR Checklist
Overview of commits