-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom keep-alive strategy for outgoing OIDC connections #75515
Comments
Pinging @elastic/es-security (Team:Security) |
We faced the same problems with the HTTP client that Watcher uses, see #72736 for the relevant fixes:
Additionally you only get a timeout here if the |
I did some investigating last night and I have the following outcomes/conclusions:
Given the above, we shouldn't be reusing TCP connections so the problem above should not be manifesting? |
Maybe it is my reading, but I don't think the doc says connection reuse is only possible when there is a non-null |
Yes, the RST is only sent when re-using a lost connection. This sounds like the same fundamental problem, but with a slightly better-behaved backend which actively rejects unknown traffic rather than just dropping it.
I think it's better to adjust the config of the connection pool as proposed above. Making it optional (a) doesn't fix this for busier environments that need the pool and (b) requires you to decide whether your environment is busy enough to use a pool, whereas a connection pool with a better config (a) works for all environment sizes and (b) doesn't require users to make any choices to get it to work. |
Agreed. Having the pool do the right thing in all cases is the preferred option. I only brought up the deactivation option as it felt like an easier/faster to ship bugfix, but now that I have checked your other PR #72736 doing the right thing looks straight forward. |
Could this common HTTP client bug cause the same symptoms?
My understanding is TCP socket pools are decoupled from HTTP connections. In other words, TCP keep-alive can be enabled, even if HTTP sessions and HTTP pools are both disabled. It sounds like Elasticsearch OIDC realm has TCP keep-alive on, but no HTTP sessions and no HTTP pool. Also, a TLS session can be attached to TCP socket in the keep-alive cache. That allows for very fast reuse of a TCP socket for unrelated HTTP requests/responses. The only catch is the HTTP client must complete the full HTTP request/response cycle. Skipping getErrorStream() leaves unread bytes in the TCP recv buffer, and the underlying TCP pool has no idea how to recover from higher-layer errors in HTTP (or other protocols). How the TCP keep-alive pool discards the socket probably depends on the implementation and configuration. Also, there may be interference from firewalls, proxies, and gateways in between the client and server. I am wondering if TCP packets might be getting dropped during that error handling handshake. |
@DaveCTurner The other issue you mentioned is for HTTP connection pools? Does that mean the TCP keep-alive part is relevant, but not the HTTP pool part? Related question, I don't see anything in the comments about addressing HTTP layer connection drops. For proactive pool cleanup, I like the example of Apache DBCP2. It does SQL pooling, but it reuses Apache POOL2 under the covers, and the config concepts are common to HTTP pooling too:
|
If we tried to re-use a connection without having consumed the body of the previous response then we wouldn't see a timeout. It'd just fail immediately since the next bytes consumed would almost certainly not be a legal start to a HTTP response. This is certainly a possible way to misuse the HTTP state machine in a very general sense but AFAICT in the client we use connections are returned to the pool by the act of consuming the response body, there's no other explicit "close" method that could leave things in an invalid state.
I understood it to be the opposite: there's a HTTP pool but no TCP keepalives. If there's no pool then every request would use a fresh connection which would mean you'd only get this kind of timeout if the other end genuinely took over 30s to respond.
I'm not sure what you mean here and there aren't many options. If TCP keepalives go unacknowledged then the connection is closed and further attempts to use it will yield immediate exceptions. There is no "TCP keep-alive pool", keepalives apply to each individual TCP connection.
I'm not following this either. What is an "error handling handshake"? The firewall/proxy/whatever that causes the problem is certainly dropping TCP packets but it's nothing to do with any handshaking - all the handshakes took place long before the problem occurred and we know they succeeded because otherwise we wouldn't be considering the connection to be suitable for use.
I don't follow here either. Do you see a mechanism for trying to use a HTTP connection after leaving it idle for many minutes that isn't pooling?
The HTTP layer doesn't have a concept of a connection, but in principle it's possible that there's some state shared between consecutive requests on the same TCP connection and it's this higher-level session state which goes awry. However if the TCP layer were fine and something went wrong at a higher layer then we'd not get the timeout we're discussing, we'd get a fast failure. |
What do we need to progress on this issue? I am pretty sure the approach proposed by @DaveCTurner within his first comment should help to address most of the issues described in this thread (#72736). |
Thank you for your comment @stephan-erb-by. We have identified that this should be addressed and as you point out, we have candidate options for resolution. It is on our roadmap and we plan on addressing this soon. |
Just an idea for anyone else running into this frequently: A cron job that sign into Kibana every few minutes works wonders. It's a crude hack but gets the job done - at least for us for now :) |
@stephan-erb-by can you share more insights about your hack ? You're using an Oauth2 client-credentials flow and you're POST'ing the creds somewhere on a Kibana endpoint for authentication ? |
@xens yes that is pretty much it. As an illustration, here how this could be done in Python. The snippet is not fully runnable, but it should help you to get the idea. with requests.Session() as session:
session.headers.update({"User-Agent": "Mozilla/5.0 Gecko/20100101 Firefox/96.0"})
# Log into Kibana
login_url = urljoin(kibana_endpoint, "/internal/security/login")
login_payload = {
"currentURL": urljoin(kibana_endpoint, "/login?next=%2F"),
"providerName": "oidc1",
"providerType": "oidc",
}
headers = {"kbn-xsrf": "true"}
response = session.post(
login_url, allow_redirects=True, json=login_payload, headers=headers
)
response.raise_for_status()
# We get redirected to our SSO / IDP
login_response_location = r.json()["location"]
my_sso_url = get_my_sso_url(session, login_response_location)
# Next, log into our SSO with our credentials. This is specific to whatever is in use
response = session.get(f"{my_sso_url}&login_hint={user}")
response.raise_for_status()
log_into_my_sso_with_creds(session, response, user, pw)
# Now we get redirected back to Kibana which should finally let us in.
response = session.get(f"{my_sso_url}&login_hint={user}", allow_redirects=True)
response.raise_for_status()
logger.info("Successful Kibana/Elastic login!") |
@stephan-erb-by thanks that's really helpful. Can you also share the run-frequency that you've setup ? Ideally I'd like to find the sweet-spot between having a working workaround and not consuming too much tokens due to quota with our IDP. |
That depends on bit on how fast your connections timeout. I think we are currently doing 10 logins in short succession every 5 minutes. It also depends a bit on how many Elastic & Kibana nodes you have as each has their own independent connection pool (If I understood that correct) |
@stephan-erb-by thanks I was able to implement something! My initial idea of using a different OAuth2-flow to ease the token retrieval process was completely wrong, the Elastic backend expects an Authorization-Code to call the IDP to obtain the access_token so we need to stick with the OAuth2 authorization-code-flow. It was very tricky, I had to use Selenium to follow my IDP login flow as I couldn't bypass the login SPA easily with this OAuth2-flow. Unfortunately the workaround will badly impact our IDP access_tokens quota (>80'000 tokens / month :( ) so I hope that Elastic will come with a solution soon. |
Has anyone addressed these questions?
|
In some environment, the back-channel connection can be dropped without sending a TCP RST to ES. When that happens, reusing the same connection results into timeout error. This PR adds a new http.connection_pool_ttl setting to control how long a connection in the OIDC back-channel pool can be idle before it is closed. This allows ES to more actively close idle connections to avoid the timeout issue. Resolves: elastic#75515
In some environment, the back-channel connection can be dropped without sending a TCP RST to ES. When that happens, reusing the same connection results into timeout error. This PR adds a new http.connection_pool_ttl setting to control how long a connection in the OIDC back-channel pool can be idle before it is closed. This allows ES to more actively close idle connections to avoid the timeout issue. The new setting has a 3min default which means idle connections are closed every 3 min if server response does not specify a shorter keep-alive. Resolves: #75515
Facing the same related issue in elastic 8.3.0, and also it is an intermittent issue.
|
@vikasmishra17 This is fixed in 8.4.0 and 8.3.3. You might want to upgrade. |
We have been seeing regular connection timeouts in certain cases when our stack is being used as an OIDC RP, with specific OP OIDC implementations/vendors. The problem manifests as authentication timeouts for end users.
The summary of what we have figured out happens is as follows:
CloseableAsyncClient
) doesn't set a HTTP keep-alive duration. This is presumably because we are using theorg.apache.http.impl.client.DefaultConnectionKeepAliveStrategy
which doesn't have a default value and the OP does not return a Keep-Alive HTTP response header in its responses to our requests to the OIDC endpoints.FIN
orRST
flag, for instance )TCP
packet withSYN
flag, the packet is dropped somewhere between us and the target server as malicious/wrong.HTTP
layer. Presumably, if we set theHTTP
connection timeout to a very large value, we will hit the underlyingTCP
connection timeout.We could expose a setting for the OIDC realm that sets the default minimum keep-alive (by implementing an ConnectionKeepAliveStrategy ) even/only if the server doesn't set a preference in the HTTP response.
This wouldn't fix the cause of the described issue but it could allow administrators to fine-tune the configuration to alleviate its effects.
The text was updated successfully, but these errors were encountered: