Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom keep-alive strategy for outgoing OIDC connections #75515

Closed
jkakavas opened this issue Jul 20, 2021 · 21 comments · Fixed by #87773
Closed

Custom keep-alive strategy for outgoing OIDC connections #75515

jkakavas opened this issue Jul 20, 2021 · 21 comments · Fixed by #87773
Assignees
Labels
:Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team

Comments

@jkakavas
Copy link
Member

We have been seeing regular connection timeouts in certain cases when our stack is being used as an OIDC RP, with specific OP OIDC implementations/vendors. The problem manifests as authentication timeouts for end users.

The summary of what we have figured out happens is as follows:

  • Our implementation (Apache HttpClient CloseableAsyncClient ) doesn't set a HTTP keep-alive duration. This is presumably because we are using the org.apache.http.impl.client.DefaultConnectionKeepAliveStrategy which doesn't have a default value and the OP does not return a Keep-Alive HTTP response header in its responses to our requests to the OIDC endpoints.
  • Something ( a host, a LB, the target server itself ) between us and the vendor/implementation drops the connection at some point but doesn't notify us (sending a packet with FIN or RST flag, for instance )
  • Our implementation is not aware that the connection is dropped and since it assumes keep-alive duration is indefinitely, it attempts to reuse the same socket.
  • Since the new connection attempt is not preambled by a TCP packet with SYN flag, the packet is dropped somewhere between us and the target server as malicious/wrong.
  • Our implementation doesn't get a response and times out on the HTTP layer. Presumably, if we set the HTTP connection timeout to a very large value, we will hit the underlying TCP connection timeout.

We could expose a setting for the OIDC realm that sets the default minimum keep-alive (by implementing an ConnectionKeepAliveStrategy ) even/only if the server doesn't set a preference in the HTTP response.
This wouldn't fix the cause of the described issue but it could allow administrators to fine-tune the configuration to alleviate its effects.

@jkakavas jkakavas added the :Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) label Jul 20, 2021
@elasticmachine elasticmachine added the Team:Security Meta label for security team label Jul 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@DaveCTurner
Copy link
Contributor

We faced the same problems with the HTTP client that Watcher uses, see #72736 for the relevant fixes:

  • Call HttpClientBuilder#setConnectionTimeToLive (using a TTL value obtained from a setting)
  • Enable TCP keepalives on the underlying socket by default (also user-configurable)

Additionally you only get a timeout here if the net.ipv4.tcp_retries2 sysctl is not configured as we recommend.

@jkakavas
Copy link
Member Author

I did some investigating last night and I have the following outcomes/conclusions:

  • It looks like that the pooled connection manager in use, will only try to reuse connections iff there is TLS client authentication enabled (or NTLM but not relevant here). See http://people.apache.org/~olegk/tutorial-httpclient-4.4/html/advanced.html for more details. We are not manually trying to share context in between request executions and as such our client should not even try to reuse connections ( the HttpClient UserToken would be null )
  • I did some testing before and after introducing a similar fix as in 72736 and it verified that we seem to be using different TCP connections. A pcap of the testOutgoingConnectionsReuse test is available in oidc-ttl.zip
  • We don't support TLS client authentication for our OIDC realm ( only basic, post and JWT)

Given the above, we shouldn't be reusing TCP connections so the problem above should not be manifesting?

@ywangd
Copy link
Member

ywangd commented Aug 24, 2021

  • It looks like that the pooled connection manager in use, will only try to reuse connections iff there is TLS client authentication enabled (or NTLM but not relevant here)

Maybe it is my reading, but I don't think the doc says connection reuse is only possible when there is a non-null UserToken. It says a non-null token will be used to avoid mis-reuse of connection since stateful connection can contain sensitive data. So stateful connection is either not reused at all or only reuse if its for the same principal. But in the case of null UserToken, connection reuse should be happening? This seems to be the relevant source code and it seems all null state connections are shared.

@stephan-erb-by
Copy link

I am seeing a slight different but related error scenario: In our case we receive the rst package if the connection is dropped. However the connection pool only learns about when re-using the connection:

[elasticsearch.server][WARN] Authentication to realm oidc1 failed - Failed to authenticate user with OpenID Connect (Caused by ElasticsearchSecurityException[Failed to exchange code for Id Token using the Token Endpoint.]; nested: SocketException[Connection reset];)

This generates an error that directly bubbles up to the user
Screenshot 2021-09-17 at 17 22 33

If you have 4 Kibana instances running a user will have to re-try their login up to 4 times before it is succesful.

Have you considered making the entire connection pool optional? For environments with just a few users this would be more than fine.

@DaveCTurner
Copy link
Contributor

I am seeing a slight different but related error scenario: In our case we receive the rst package if the connection is dropped. However the connection pool only learns about when re-using the connection:

Yes, the RST is only sent when re-using a lost connection. This sounds like the same fundamental problem, but with a slightly better-behaved backend which actively rejects unknown traffic rather than just dropping it.

Have you considered making the entire connection pool optional? For environments with just a few users this would be more than fine.

I think it's better to adjust the config of the connection pool as proposed above. Making it optional (a) doesn't fix this for busier environments that need the pool and (b) requires you to decide whether your environment is busy enough to use a pool, whereas a connection pool with a better config (a) works for all environment sizes and (b) doesn't require users to make any choices to get it to work.

@stephan-erb-by
Copy link

stephan-erb-by commented Sep 20, 2021

Agreed. Having the pool do the right thing in all cases is the preferred option.

I only brought up the deactivation option as it felt like an easier/faster to ship bugfix, but now that I have checked your other PR #72736 doing the right thing looks straight forward.

@justincr-elastic
Copy link
Contributor

Could this common HTTP client bug cause the same symptoms?

  1. HTTP client sends request to server, and HTTP server starts streaming an error back.
  2. HTTP client reads response header, sees a 4xx/5xx error, and closes the HTTP connection without consuming getErrorStream().

My understanding is TCP socket pools are decoupled from HTTP connections. In other words, TCP keep-alive can be enabled, even if HTTP sessions and HTTP pools are both disabled. It sounds like Elasticsearch OIDC realm has TCP keep-alive on, but no HTTP sessions and no HTTP pool.

Also, a TLS session can be attached to TCP socket in the keep-alive cache. That allows for very fast reuse of a TCP socket for unrelated HTTP requests/responses. The only catch is the HTTP client must complete the full HTTP request/response cycle. Skipping getErrorStream() leaves unread bytes in the TCP recv buffer, and the underlying TCP pool has no idea how to recover from higher-layer errors in HTTP (or other protocols).

How the TCP keep-alive pool discards the socket probably depends on the implementation and configuration. Also, there may be interference from firewalls, proxies, and gateways in between the client and server. I am wondering if TCP packets might be getting dropped during that error handling handshake.

@justincr-elastic
Copy link
Contributor

We faced the same problems with the HTTP client that Watcher uses, see #72736 for the relevant fixes:

  • Call HttpClientBuilder#setConnectionTimeToLive (using a TTL value obtained from a setting)
  • Enable TCP keepalives on the underlying socket by default (also user-configurable)

Additionally you only get a timeout here if the net.ipv4.tcp_retries2 sysctl is not configured as we recommend.

@DaveCTurner The other issue you mentioned is for HTTP connection pools? Does that mean the TCP keep-alive part is relevant, but not the HTTP pool part?

Related question, I don't see anything in the comments about addressing HTTP layer connection drops. For proactive pool cleanup, I like the example of Apache DBCP2. It does SQL pooling, but it reuses Apache POOL2 under the covers, and the config concepts are common to HTTP pooling too:

  • validationQuery: Test connection at the application later. That implicitly verifies and fixes TCP disconnects too.
  • testOnBorrow: True executes validationQuery during borrow, so caller gets a validated existing connection or a new connection (i.e. implicitly validated).
  • testWhileIdle: True executes validationQuery during idle, so testOnBorrow has higher probability of getting a validated existing connection.

@DaveCTurner
Copy link
Contributor

Could this common HTTP client bug cause the same symptoms?

If we tried to re-use a connection without having consumed the body of the previous response then we wouldn't see a timeout. It'd just fail immediately since the next bytes consumed would almost certainly not be a legal start to a HTTP response. This is certainly a possible way to misuse the HTTP state machine in a very general sense but AFAICT in the client we use connections are returned to the pool by the act of consuming the response body, there's no other explicit "close" method that could leave things in an invalid state.

It sounds like Elasticsearch OIDC realm has TCP keep-alive on, but no HTTP sessions and no HTTP pool.

I understood it to be the opposite: there's a HTTP pool but no TCP keepalives. If there's no pool then every request would use a fresh connection which would mean you'd only get this kind of timeout if the other end genuinely took over 30s to respond.

How the TCP keep-alive pool discards the socket probably depends on the implementation and configuration.

I'm not sure what you mean here and there aren't many options. If TCP keepalives go unacknowledged then the connection is closed and further attempts to use it will yield immediate exceptions. There is no "TCP keep-alive pool", keepalives apply to each individual TCP connection.

I am wondering if TCP packets might be getting dropped during that error handling handshake.

I'm not following this either. What is an "error handling handshake"? The firewall/proxy/whatever that causes the problem is certainly dropping TCP packets but it's nothing to do with any handshaking - all the handshakes took place long before the problem occurred and we know they succeeded because otherwise we wouldn't be considering the connection to be suitable for use.

Does that mean the TCP keep-alive part is relevant, but not the HTTP pool part?

I don't follow here either. Do you see a mechanism for trying to use a HTTP connection after leaving it idle for many minutes that isn't pooling?

Related question, I don't see anything in the comments about addressing HTTP layer connection drops.

The HTTP layer doesn't have a concept of a connection, but in principle it's possible that there's some state shared between consecutive requests on the same TCP connection and it's this higher-level session state which goes awry. However if the TCP layer were fine and something went wrong at a higher layer then we'd not get the timeout we're discussing, we'd get a fast failure.

@stephan-erb-by
Copy link

What do we need to progress on this issue?

I am pretty sure the approach proposed by @DaveCTurner within his first comment should help to address most of the issues described in this thread (#72736).

@jkakavas
Copy link
Member Author

jkakavas commented Dec 2, 2021

Thank you for your comment @stephan-erb-by. We have identified that this should be addressed and as you point out, we have candidate options for resolution. It is on our roadmap and we plan on addressing this soon.

@stephan-erb-by
Copy link

stephan-erb-by commented Feb 24, 2022

Just an idea for anyone else running into this frequently: A cron job that sign into Kibana every few minutes works wonders. It's a crude hack but gets the job done - at least for us for now :)

@xens
Copy link

xens commented Mar 7, 2022

@stephan-erb-by can you share more insights about your hack ? You're using an Oauth2 client-credentials flow and you're POST'ing the creds somewhere on a Kibana endpoint for authentication ?

@stephan-erb-by
Copy link

@xens yes that is pretty much it.

As an illustration, here how this could be done in Python. The snippet is not fully runnable, but it should help you to get the idea.

   with requests.Session() as session:
        session.headers.update({"User-Agent": "Mozilla/5.0 Gecko/20100101 Firefox/96.0"})

        # Log into Kibana
        login_url = urljoin(kibana_endpoint, "/internal/security/login")
        login_payload = {
            "currentURL": urljoin(kibana_endpoint, "/login?next=%2F"),
            "providerName": "oidc1",
            "providerType": "oidc",
        }
        headers = {"kbn-xsrf": "true"}
        response = session.post(
            login_url, allow_redirects=True, json=login_payload, headers=headers
        )
        response.raise_for_status()

        # We get redirected to our SSO / IDP
        login_response_location = r.json()["location"]
        my_sso_url = get_my_sso_url(session, login_response_location)

        # Next, log into our SSO with our credentials. This is specific to whatever is in use
        response = session.get(f"{my_sso_url}&login_hint={user}")
        response.raise_for_status()
        log_into_my_sso_with_creds(session, response, user, pw)

        # Now we get redirected back to Kibana which should finally let us in. 
        response = session.get(f"{my_sso_url}&login_hint={user}", allow_redirects=True)
        response.raise_for_status()

        logger.info("Successful Kibana/Elastic login!")

@xens
Copy link

xens commented Mar 10, 2022

@stephan-erb-by thanks that's really helpful. Can you also share the run-frequency that you've setup ? Ideally I'd like to find the sweet-spot between having a working workaround and not consuming too much tokens due to quota with our IDP.

@stephan-erb-by
Copy link

That depends on bit on how fast your connections timeout. I think we are currently doing 10 logins in short succession every 5 minutes. It also depends a bit on how many Elastic & Kibana nodes you have as each has their own independent connection pool (If I understood that correct)

@xens
Copy link

xens commented Mar 11, 2022

@stephan-erb-by thanks I was able to implement something!

My initial idea of using a different OAuth2-flow to ease the token retrieval process was completely wrong, the Elastic backend expects an Authorization-Code to call the IDP to obtain the access_token so we need to stick with the OAuth2 authorization-code-flow.

It was very tricky, I had to use Selenium to follow my IDP login flow as I couldn't bypass the login SPA easily with this OAuth2-flow.

Unfortunately the workaround will badly impact our IDP access_tokens quota (>80'000 tokens / month :( ) so I hope that Elastic will come with a solution soon.

@justincr-elastic
Copy link
Contributor

Has anyone addressed these questions?

  1. Should we be using TCP keep-alive or HTTP pooling at all?
  2. If yes, should we provide an option to disable both so customers have a workaround?

@ywangd ywangd self-assigned this Jun 16, 2022
ywangd added a commit to ywangd/elasticsearch that referenced this issue Jun 17, 2022
In some environment, the back-channel connection can be dropped
without sending a TCP RST to ES. When that happens, reusing the same
connection results into timeout error.

This PR adds a new http.connection_pool_ttl setting to control how long
a connection in the OIDC back-channel pool can be idle before it is
closed. This allows ES to more actively close idle connections to avoid
the timeout issue.

Resolves: elastic#75515
ywangd added a commit that referenced this issue Jun 20, 2022
In some environment, the back-channel connection can be dropped
without sending a TCP RST to ES. When that happens, reusing the same
connection results into timeout error.

This PR adds a new http.connection_pool_ttl setting to control how long
a connection in the OIDC back-channel pool can be idle before it is
closed. This allows ES to more actively close idle connections to avoid
the timeout issue.

The new setting has a 3min default which means idle connections are
closed every 3 min if server response does not specify a shorter keep-alive.

Resolves: #75515
@vikasmishra17
Copy link

Facing the same related issue in elastic 8.3.0, and also it is an intermittent issue.

[2022-11-17T13:51:46,543][WARN ][o.e.x.s.a.RealmsAuthenticator] [warm-data-4] Authentication to realm oidc1 failed - Failed to authenticate user with OpenID Connect (Caused by org.elasticsearch.ElasticsearchSecurityException: Failed to exchange code for Id Token using the Token Endpoint.)

Caused by: java.net.SocketTimeoutException: 5,000 milliseconds timeout on connection http-outgoing-3 [ACTIVE]
        ... 11 more

@ywangd
Copy link
Member

ywangd commented Nov 17, 2022

@vikasmishra17 This is fixed in 8.4.0 and 8.3.3. You might want to upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Security/Authentication Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc) Team:Security Meta label for security team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants