Backport of Fix xDS missing endpoint race condition. into release/1.17.x #19874
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #19866 to be assessed for backporting due to the inclusion of the label backport/1.17.
The below text is copied from the body of the original PR.
The following PR is mostly a clone of work done by @ksmiley with some minor tweaks. I would like to thank him for tracking down and describing this complicated situation in such great detail. His work is greatly appreciated.
See the following issues for more context:
#17640
#17641
This fixes the following race condition:
Prior to this fix, it would have resulted in the endpoints NOT existing in Envoy. This occurred because the cluster update implicitly clears the endpoints in Envoy, but we would never re-send the endpoint data to compensate for the loss, because we would incorrectly ACK the invalid old endpoint hash. Since the endpoint's hash did not actually change, they would not be resent.
The fix for this is to effectively clear out the invalid pending ACKs for child resources whenever the parent changes. This ensures that we do not store the child's hash as accepted when the race occurs.
An escape-hatch environment variable
XDS_PROTOCOL_LEGACY_CHILD_RESEND
was added so that users can revert back to the old legacy behavior in the event that this produces unknown side-effects. Visit the following thread for some extra context on why certainty around these race conditions is difficult: envoyproxy/envoy#13009Overview of commits