-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
503s and _No_route_to_host
errors due to routing to non-existent Endpoints
#4685
Comments
hey @coro, we haven't seen such issues yet, can you also run to avoid this, you can set up retries and passive health checks, here's an example
|
Thanks for the suggestion @arkodg! On the endpointslice front, anecdotally I can tell you that we did not see the IPs of the non-existent Pods in the EndpointSlices, but we will check again with that and the It's worth mentioning that this was on a KServe |
cc @dprotaso |
@arkodg Just a quick update, we can confirm that this does not happen with EG v1.1.1 / Envoy v1.31.0. Rolling back seems to have resolved this issue. We haven't had a moment to safely reproduce this and get your additional debug info just yet, so we will let you know again when we have that. |
We can confirm that pods can be terminated and Yet we still get logs after that point in time from envoy saying There seems to be some routing still within envoy that is trying to route to a dead/non existant pod. Perhaps this is cached or perhaps there is a long lived connection still open or trying to be used? This is continuing from the same environment as @coro (We work on the same team) |
Hi! Looks like we were able to identify the issue. We've started comparing proxy endpoints After upgrade controller does not update it's proxy endpoints anymore - only during the startup which as you can imagine in very dynamic k8s world causes a lot of issues. Kubernetes endpointslice is being updated but the endpoints on the proxy are static since the startup of gateway - We were comparing different tags and I think this is what caused the regression altho I'm not 100% sure - just giving you a hint. (As those changes were cherry-picked from 1.2.0 into 1.1.3 where we also noticed same issue) #4336 |
unsure how status updater can affect endpointslice reconciliation |
We’re also facing the same issue. And downgrade EG from v1.2.1 to v1.1.2 resolved this issue. |
We're getting the same error on our side too, rollbacking to v1.1.2 seemed fixing the issue on our side too |
@ligol @ovaldi @evilr00t @sam-burrell @coro Are there any easy way to reproduce this? I tried to modify the replicas of the deploy to increase/descrease pods, but can't reproduce this issue. Also are there any errors/warnings in the EG/Envoy logs while this happened? Can you also try Looks like the only significant change within v1.1.3 is the upgrade of envoy to v1.31.3. |
I will try to test the different combinaison tomorrow, when I'll have some traffic on the preprod environement, otherwise I don't really have any other way to reproduce |
These are all the combinations we have tested ✔️ Working ✔️ Working ✔️ Working 🛑 Not Working 🛑 Not Working 🛑 Not Working All on EKS cluster v1.29.8 |
The following shell output is how we are defining Not Working EndpointSlice FOOBAR_ENDPOINT endpoints: ['10.0.100.160', '10.0.100.32', '10.0.102.251', '10.0.103.11', '10.0.104.40', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.13', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD: ['10.0.100.160', '10.0.100.32', '10.0.102.251', '10.0.103.11', '10.0.104.40', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.13', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD: ['10.0.102.251', '10.0.106.41', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.116', '10.0.96.13', '10.0.97.194', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD: ['10.0.102.251', '10.0.106.41', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.116', '10.0.96.13', '10.0.97.194', '10.0.98.131'] Some gateway pods have IPs that don't exist in the endpoint slice The following is the script we are using to test this: import subprocess
import json
from datetime import datetime
from pprint import pprint
HTTP_ROUTE = "HTTP_ROUTE"
ENDPOINT_SLICE = "ENDPOINT_SLICE"
GATEWAY_POD_PREFIX = "GATEWAY_POD_PREFIX"
# Define the command to get the EndpointSlice
command = ["kubectl", "get", "endpointslice", "-o", "json", ENDPOINT_SLICE]
result = subprocess.run(command, capture_output=True, text=True)
endpointslice_output = result.stdout
endpointslice_data = json.loads(endpointslice_output)
# Print the EndpointSlice information
ENDPOINT_SLICE_endpoints = [
endpoint["addresses"][0]
for endpoint in endpointslice_data["endpoints"]
if endpoint["conditions"]["ready"] == True and endpoint["conditions"]["serving"] == True
]
ENDPOINT_SLICE_endpoints = sorted(ENDPOINT_SLICE_endpoints)
print(f"EndpointSlice {ENDPOINT_SLICE} endpoints: \t\t\t\t\t\t\t\t\t\t{ENDPOINT_SLICE_endpoints}")
command = ["kubectl", "get", "pods", "-n", "kube-system", "-o", "json"]
result = subprocess.run(command, capture_output=True, text=True)
pods_output = result.stdout
# Parse the JSON output to get the pod names
pods_data = json.loads(pods_output)
envoy_pods = [
pod["metadata"]["name"]
for pod in pods_data["items"]
if pod["metadata"]["name"].startswith(GATEWAY_POD_PREFIX)
]
all_selected_endpoints = []
for envoy_pod in envoy_pods:
# Define the command to run
command = [
"egctl", "config", "envoy-proxy", "endpoint", "-n", "kube-system",
envoy_pod
]
# Run the command and capture the output
result = subprocess.run(command, capture_output=True, text=True)
output = result.stdout
# Parse the JSON output
data = json.loads(output)
# Extract the required information
endpoints = data["kube-system"][envoy_pod]["dynamicEndpointConfigs"]
selected_endpoints = [
lb_endpoint["endpoint"]["address"]["socketAddress"]["address"]
for ep in endpoints
if ep["endpointConfig"]["clusterName"] == HTTP_ROUTE
for lb_endpoint in ep["endpointConfig"]["endpoints"][0]["lbEndpoints"]
]
selected_endpoints = sorted(selected_endpoints)
print(f"Selected endpoints for {HTTP_ROUTE} on {envoy_pod}: \t{selected_endpoints}") |
We have turned |
Is there anything we can do to assist with this? |
I tried the below setup with both 1.2.1 and 1.1.2 Constantly create and delete pods with:
And access the Gateway with
And I got some 503 erros with both the 1.2.1 and 1.1.2. After stopping creating and deleting of pods, the 503 errors ceased for both versions. There doesn't seem to be any noticeable difference in behavior between versions 1.2.1 and 1.1.2. Probably there's something missing in my configuration, so I can't reproduce this? @sam-burrell @ligol can you help me to reproduce this? |
I think it's acceptable for the endpoints in envoy xDS temporarily differ from the k8s endpointslice. This delay occurs because Envoy Gateway (EG) needs time to propagate the changes to Envoy. However, if an endpoint still exists in Envoy for a while after being deleted from the EndpointSlice, it indicates a problem. |
@zhaohuabing we are going to try if the revert fix the issue. |
@dromadaire54 Please also try #4767 if you have a stable reproducable env. Thanks! |
@zhaohuabing can you provide a custom image from your docker hub repo (with the commit reverted) that others can try ? |
I prefer #4767 because reverting in 4755 will cause regression of another issue. |
Update: please use zhaohuabing/gateway-dev:e406088d7 for #4767. The new image contains the fix for e2e tests. |
@evilr00t It's temporary or the deleted pod IPs will be in the endpoints for a long time? |
on provided screenshot, when I've scaled up the replica set - new IPs of Pods were added only to 1 envoy proxy. Deleted pods are removed from that 1 working proxy. 2 Remaining proxies are not updated at all and shows the IPs of Pods that were available during the startup of those proxies. |
Thanks for looking into this @zhaohuabing This is a great project! I am seeing similar issues and finally took some time to try and make some reproducible steps using some of the scripts in the gateway repo and here is what I've come up with:
The behavior doesn't appear with the |
@tskinner-oppfi Thanks for testing the fixes! V1.2.2 will be released this week with the fix #4754 . |
@zhaohuabing thanks for the fix no error during several days in our env. |
Hello @dromadaire54 @tskinner-oppfi |
I've tested the image
The end result being I see a mixture of 200 and 0 status codes. |
Hi, |
Recently, after upgrading to v1.2.0 and v1.2.1, we encountered the 503s issue. And, after upgrading to v1.2.2, we encountered the DPANIC issue. So for v1.2.3, we decided to wait and see... 😂
|
@dromadaire54 can you share the access logs o/p ? Im not sure how the fix that went into v1.2.3 can cause it (we only wait during init) , the issue @tskinner-oppfi is hitting is tied to a TCP listener without an attached TCPRoute which looks like another issue (maybe configuration issue) and needs to be investigated separately |
@arkodg I shared you the logs by dm in slack. |
Description:
We have been seeing many 503 errors when connecting to a Service with a lot of Pod churn.
We also saw in the logs that in these cases, the
upstream_host
that Envoy was attempting to connect to was for Pods that no longer existed in the cluster. These Pods could have been terminated over 50 mins earlier.Repro steps:
Simple setup of Gateway (AWS NLB) -> HTTPRoute -> Service pointing to a Deployment with a lot of Pod churn.
Environment:
Gateway: v1.2.1 (also seen on v1.1.3) (not seen on v1.1.1)
Envoy: v1.32.1 (also seen on v1.31.1) (not seen on v1.31.0)
EKS cluster v1.29
Logs:
The generated cluster config for this route was:
The text was updated successfully, but these errors were encountered: