Temporarily retain endpoints which are in active use #777

XAMPPRocky · 2023-08-22T12:15:20Z

I don't think should be the actual solution, but I needed to test it, and figure we can also use the PR to discuss how to solve the problem in a better way.

Consider the following infrastructure

proxy -> management server
proxy -> relay <- agent

If the connection to the relay or management server, or the agent is disrupted for some reason, and those services need to be restarted, there's a chance that the proxy or relay will receive a response before the agent or management server has received information from k8s, which would lead to the proxy clearing its all of its endpoints, which will lead to player's traffic being dropped for that period of time.

This PR changes that behaviour by adding a shared atomic counter to endpoints, and when it merges new configuration will retain the old endpoints that are currently in use. It will still replace their metadata so if we receive the same endpoint, but we'll keep any in-use endpoints for their session duration.

markmandel · 2023-08-23T23:46:06Z

Hmnn. That's super interesting.

I actually wonder if we solve this in the same way we solve proxies not being available behind a loadbalancer if they have no endpoints.

quilkin/src/admin.rs

Lines 98 to 106 in 1fc3954

    
           fn check_proxy_readiness(config: &Config) -> Response<Body> { 
        
               if config.clusters.read().endpoints().count() > 0 { 
        
                   return Response::new("ok".into()); 
        
               } 
        
               let mut response = Response::new(Body::empty()); 
        
               *response.status_mut() = StatusCode::INTERNAL_SERVER_ERROR; 
        
               response 
        
           }

Which is to say a relay, agent, or management server can't return OK from /ready unless it has some endpoints. So if the proxy does restart for some reason, it (in theory) won't get added back into the LB until it has some sort of configuration.

The downside is - if there are legitimately no Endpoints, then that can't be communicated to the proxies 🤔 so maybe that's not the best approach.

Unless there is some way on the relay / agent / etc layer to not return true from /ready until they can say "hey, I've gone and checked at least once from the config I need to check from" and therefore won't send data back up to the proxy?

Did that make sense? Basically I think we should manage this in the '/ready handler since that's what will determine if it's endpoint (and it's configuration) is available to the outside world.

markmandel · 2023-08-24T19:46:11Z

src/endpoint/locality.rs

+                // This will mean that updated metadata such as new tokens
+                // will be respected, but we will still retain older
+                // endpoints that are currently actively used in a session.
+                Entry::Occupied(entry) => {


Is this dangerous? Basically if an endpoint is getting a malicious traffic and the solution the end user wants to implement is to drop the endpoint entirely to protect the internal resources -- doesn't this mean that that won't work while the malicious traffic is continuing?

Basically if an endpoint is getting a malicious traffic and the solution the end user wants to implement is to drop the endpoint entirely to protect the internal resources -- doesn't this mean that that won't work while the malicious traffic is continuing?

Well yeah, that is why we shouldn't accept this version. I'll write up my response to your other questions/proposed solution in a separate comment, but I think what we need is to be able to have different verbs here. We need to be able to distinct between "services are disrupted or a bug has caused a endpoint to be lost in the chain" where we want the load balancer to retain that endpoint because we don't want player's to be dropped mid-game, and "stop sending traffic to this endpoint because we want to stop traffic to it".

XAMPPRocky · 2023-08-25T15:51:30Z

Did that make sense? Basically I think we should manage this in the '/ready handler since that's what will determine if it's endpoint (and it's configuration) is available to the outside world.

So I don't like that approach for couple reasons.

gRPC Connection won't initiate until /ready has completed. This does mean that there will be additional latency or spurious unresolvable connection failures between services. When games are allocated we should want to have that information distributed as quickly as possible.
Works for management servers, not for relay. Even though it won't initiate until it has servers, in a multi-cluster setup, what we're more likely to run into is that one cluster goes down, and the load balancer won't be aware of that because it's getting its information from the relay rather than directly from the management server, it will just see an incomplete config.

I think what we need is new verbs for communicating changes to load balancers, so that's actually clear when we intentionally want to delete or change endpoints, and when there's potentially a service disruption, misconfiguration, or a bug, and have the load balancers handle those differently, as I think absent an intentionality (from removing a token, using a firewall, etc). The load balancer should strive to allow player's to continue playing to the end of their session.

This could be having "Delta xDS" replace our current xDS protocol, so that we're communicating changes, and as added benefit we're reducing the amount of traffic we're communicating as we scale.

markmandel · 2023-08-25T22:18:23Z

This could be having "Delta xDS" replace our current xDS protocol, so that we're communicating changes, and as added benefit we're reducing the amount of traffic we're communicating as we scale.

Ultimately that's probably the better path, I think you are right.

If a proxy hasn't receive the request to remove something, then it will just keep it. If the relay server crashes, and a agent behind that also happened to crash at the same time - the relay can't tell any of the proxies to delete something that it wasn't aware of in the first place.

On the management server / agent perspective - I would say it still make sense to have a /ready that only returns true one it has checked and parsed the initial state of whatever world it's looking at. That way at least it can't send out an empty config before it actually has information. That at least solves issues at that layer.

gRPC Connection won't initiate until /ready has completed. This does mean that there will be additional latency or spurious unresolvable connection failures between services. When games are allocated we should want to have that information distributed as quickly as possible.

At the agent / management layer, I don't think this is a concern, because fast connections to systems that have bad data don't help anyone. We could block on the gRPC request until the initial check completes, but at that point we're introducing the same delay anyway? Or maybe a check at both layers is good, just to be extra sure. I'd lead more towards optimising more on correct information than speed of delivery for this type of scenario.

XAMPPRocky · 2023-08-26T08:56:56Z

We could block on the gRPC request until the initial check completes, but at that point we're introducing the same delay anyway?

Well actually I do like that better, because that delay is only on initialisation right? It wouldn't be whenever there's no endpoints. Even though I think you're right from the agent / management server perspective that it mostly doesn't matter, and what matters is that it's returning data. From the load balancer side, it's good to see stuff like "warning cannot connect to " whenever it has zero endpoints, but having the check be until it receives the first update from k8s does prevent that from happening outside of startup.

github-actions · 2023-10-22T21:40:32Z