Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oidc doesn't recover when auth-server was not avaliable at startup #16725

Closed
joggeli34 opened this issue Apr 22, 2021 · 20 comments · Fixed by #16773
Closed

Oidc doesn't recover when auth-server was not avaliable at startup #16725

joggeli34 opened this issue Apr 22, 2021 · 20 comments · Fixed by #16773
Assignees
Labels
area/oidc kind/bug Something isn't working
Milestone

Comments

@joggeli34
Copy link
Contributor

Describe the bug

When runing quarkus with oidc enabled, the application starts even when the oidc-server is currently not avaliable. Each request to the api results in a error:

io.quarkus.oidc.OIDCException: Tenant configuration has not been resolved at io.quarkus.oidc.runtime.OidcAuthenticationMechanism.resolve(OidcAuthenticationMechanism.java:61) at io.quarkus.oidc.runtime.OidcAuthenticationMechanism.authenticate(OidcAuthenticationMechanism.java:40) at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_Subclass.authenticate$$superaccessor2(OidcAuthenticationMechanism_Subclass.zig:512) at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_Subclass$$function$$2.apply(OidcAuthenticationMechanism_Subclass$$function$$2.zig:41) at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:54) at io.quarkus.arc.runtime.devconsole.InvocationInterceptor.proceed(InvocationInterceptor.java:63) at io.quarkus.arc.runtime.devconsole.InvocationInterceptor.monitor(InvocationInterceptor.java:51) at io.quarkus.arc.runtime.devconsole.InvocationInterceptor_Bean.intercept(InvocationInterceptor_Bean.zig:521) at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41) at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:41) at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:32) ...

Even when the oidc-server gets avaliable after some time, the requests aren't recovering from this error. Not even after the time defined in quarkus.oidc.token.forced-jwk-refresh-interval.

Expected behavior

The requests should either recover after the oidc-server is back or the application should not start if the oidc-server is not avaliable.

Configuration

# Authentication
quarkus.oidc.auth-server-url=https://the-server/auth/realms/local-dev
quarkus.oidc.client-id=the-client

Environment:

Quarkus version or git rev

1.13.2.Final

@joggeli34 joggeli34 added the kind/bug Something isn't working label Apr 22, 2021
@quarkus-bot
Copy link

quarkus-bot bot commented Apr 22, 2021

/cc @pedroigor, @sberyozkin

@sberyozkin
Copy link
Member

sberyozkin commented Apr 22, 2021

@joggeli34 Is it running in the dev mode ? A test is available confirming it recovers in the dev mode after the address is entered.

@sberyozkin
Copy link
Member

sberyozkin commented Apr 22, 2021

Keeping a timer in a production mode is expensive - especially in a multi-tenant case which would require a timer per each tenant. If it is not a dev mode then you should use a quarkus.oidc.connection-delay property

@joggeli34
Copy link
Contributor Author

It happens in dev mode and in production.

I don't understand the quarkus.oidc.connection-delay property exactly. Would this help to recover or just delaying the requests as long as defined there when no auth-server is avaliable? So when setting the value to 20s, but the auth-server is avaliable after 1min it doesn't recover?

@sberyozkin
Copy link
Member

OK, I see - no, as I said no timer is allocated to keep pinging the OIDC server until it may come up. I expect that in production OIDC server is already running - so we are really talking about a test/dev scenario where it may not have started just yet - so yes - if you set this property to 20 sec then it will miss if it will start after 1 min so set to 2 mins.

@joggeli34
Copy link
Contributor Author

Ok, now I understand. Thanks.
We will handle it now with livenessProbes and the quarkus.oidc.connection-delay, so that the application gets restarted when the liveness is not reached after the same amount of time as the connection-delay.

Normally our OIDC-server (keycloak) is already running, but we had a major issue which triggered the whole cluster to reboot...

Would it be an idea that quarkus termines if it can't reach the OIDC server after the connection-delay? Because now the applications stays in a unusable and unrecoverable state forever, when not restarted from outside (liveness-probe).

@sberyozkin
Copy link
Member

sberyozkin commented Apr 22, 2021

@joggeli34 I think I can update the code to give it a single retry when the statically configured OIDC connection is required but it is marked as not initialized. I'll give a try.
Note you can also delay it easily with TenantConfigResolver :-)

@joggeli34
Copy link
Contributor Author

I now tested with different quarkus-versions and noticed that this behavior was introduced in 1.13.1.Final.
With 1.13.0.Final the application just throws and error and prevents to start (ERROR: Failed to start application (with profile dev))

Is that correct?

@sberyozkin
Copy link
Member

@joggeli34 Yes - it was affecting the user experience in the dev mode.

@joggeli34
Copy link
Contributor Author

I now tested with the parameter settings:

# Authentication
quarkus.oidc.auth-server-url=https://id-develop.xxx.cloud/auth/realms/local-dev
quarkus.oidc.client-id=xxx-api
quarkus.oidc.connection-delay=2m

In the dev-mode it doesn't recover even if the oidc-server is up after 1m. It just keeps responding all requests with OICDException.

> Task :quarkusDev
Listening for transport dt_socket at address: 5005
__  ____  __  _____   ___  __ ____  ______ 
 --/ __ \/ / / / _ | / _ \/ //_/ / / / __/ 
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \   
--\___\_\____/_/ |_/_/|_/_/|_|\____/___/   
2021-04-22 17:46:51,401 INFO  [io.qua.oid.run.OidcRecorder] (Quarkus Main Thread) Connecting to IDP for up to 60 times every 2 seconds
2021-04-22 17:46:52,455 INFO  [io.quarkus] (Quarkus Main Thread) metrics 1.0-SNAPSHOT on JVM (powered by Quarkus 1.13.1.Final) started in 3.816s. Listening on: http://localhost:8080
2021-04-22 17:46:52,455 INFO  [io.quarkus] (Quarkus Main Thread) Profile dev activated. Live Coding activated.
2021-04-22 17:46:52,455 INFO  [io.quarkus] (Quarkus Main Thread) Installed features: [cdi, micrometer, mutiny, oidc, resteasy, security, smallrye-context-propagation]
2021-04-22 17:46:56,244 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-9) HTTP Request to /hello/bla failed, error id: d4a53af8-a840-413f-9044-51ca96a09206-1: io.quarkus.oidc.OIDCException: Tenant configuration has not been resolved
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism.resolve(OidcAuthenticationMechanism.java:61)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism.authenticate(OidcAuthenticationMechanism.java:40)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_Subclass.authenticate$$superaccessor3(OidcAuthenticationMechanism_Subclass.zig:625)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_Subclass$$function$$3.apply(OidcAuthenticationMechanism_Subclass$$function$$3.zig:41)
	at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:54)
	at io.quarkus.arc.runtime.devconsole.InvocationInterceptor.proceed(InvocationInterceptor.java:63)
	at io.quarkus.arc.runtime.devconsole.InvocationInterceptor.monitor(InvocationInterceptor.java:51)
	at io.quarkus.arc.runtime.devconsole.InvocationInterceptor_Bean.intercept(InvocationInterceptor_Bean.zig:521)
	at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
	at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:41)
	at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:32)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_Subclass.authenticate(OidcAuthenticationMechanism_Subclass.zig:580)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_ClientProxy.authenticate(OidcAuthenticationMechanism_ClientProxy.zig:189)
	at io.quarkus.vertx.http.runtime.security.HttpAuthenticator.attemptAuthentication(HttpAuthenticator.java:100)
	at io.quarkus.vertx.http.runtime.security.HttpAuthenticator_Subclass.attemptAuthentication$$superaccessor1(HttpAuthenticator_Subclass.zig:283)
	at io.quarkus.vertx.http.runtime.security.HttpAuthenticator_Subclass$$function$$1.apply(HttpAuthenticator_Subclass$$function$$1.zig:33)
	at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:54)
	at io.quarkus.arc.runtime.devconsole.InvocationInterceptor.proceed(InvocationInterceptor.java:63)
	at io.quarkus.arc.runtime.devconsole.InvocationInterceptor.monitor(InvocationInterceptor.java:51)
	at io.quarkus.arc.runtime.devconsole.InvocationInterceptor_Bean.intercept(InvocationInterceptor_Bean.zig:521)
	at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
	at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:41)
	at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:32)
	at io.quarkus.vertx.http.runtime.security.HttpAuthenticator_Subclass.attemptAuthentication(HttpAuthenticator_Subclass.zig:240)
	at io.quarkus.vertx.http.runtime.security.HttpAuthenticator_ClientProxy.attemptAuthentication(HttpAuthenticator_ClientProxy.zig:157)
	at io.quarkus.vertx.http.runtime.security.HttpSecurityRecorder$2.handle(HttpSecurityRecorder.java:101)
	at io.quarkus.vertx.http.runtime.security.HttpSecurityRecorder$2.handle(HttpSecurityRecorder.java:51)
	at io.vertx.ext.web.impl.RouteState.handleContext(RouteState.java:1038)
	at io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:137)
	at io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:132)
	at io.quarkus.micrometer.runtime.binder.vertx.VertxMeterFilter.handle(VertxMeterFilter.java:28)
	at io.quarkus.micrometer.runtime.binder.vertx.VertxMeterFilter.handle(VertxMeterFilter.java:14)
	at io.vertx.ext.web.impl.RouteState.handleContext(RouteState.java:1038)
	at io.vertx.ext.web.impl.RoutingContextImplBase.iterateNext(RoutingContextImplBase.java:137)
	at io.vertx.ext.web.impl.RoutingContextImpl.next(RoutingContextImpl.java:132)
	at io.quarkus.vertx.http.runtime.devmode.VertxHttpHotReplacementSetup$3.handle(VertxHttpHotReplacementSetup.java:86)
	at io.quarkus.vertx.http.runtime.devmode.VertxHttpHotReplacementSetup$3.handle(VertxHttpHotReplacementSetup.java:75)
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:327)
	at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:366)
	at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:834)

2021-04-22 17:47:37,244 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-9) HTTP Request to /hello/bla failed, error id: d4a53af8-a840-413f-9044-51ca96a09206-2: io.quarkus.oidc.OIDCException: Tenant configuration has not been resolved
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism.resolve(OidcAuthenticationMechanism.java:61)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism.authenticate(OidcAuthenticationMechanism.java:40)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_Subclass.authenticate$$superaccessor3(OidcAuthenticationMechanism_Subclass.zig:625)
	at io.quarkus.oidc.runtime.OidcAuthenticationMechanism_Subclass$$function$$3.apply(OidcAuthenticationMechanism_Subclass$$function$$3.zig:41)
	at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:54)
	at io.quarkus.arc.runtime.devconsole.InvocationInterceptor.proceed(InvocationInterceptor.java:63)
...

@sberyozkin
Copy link
Member

sberyozkin commented Apr 22, 2021

@joggeli34 As I said there is no timer there pinging the unavailable server all the time. In the dev mode one would start with no address at all or some template address - the server would start - you can verify the request fails - next you set a correct address and it works - but here it is slightly different - you just have OIDC server coming up at some point but unless you refresh the dev endpoint somehow, perhaps via a browser, etc, then it won't try to re-initialize.
So as I said I'll add one more retry at the actual request time

@joggeli34
Copy link
Contributor Author

I'm sorry and I feel a bit stupid but I still don't get it right.

As I understood, the parameter quarkus.oidc.connection-delay is like the timer that will try to connect to the server. But as mention in the previous comment, this doesn't work. Or does that only work in production?

For me it's more a problem in production then in dev. Because when it doesn't work in dev, I just restart.
In production I like to recover as soon as possible. But with the current implementation I have to wait until the startupProbeTime of kubernetes fails until it restart instead of instantaneous as soon as quarkus knows that it can't recover on its own as it was before 1.13.1

Further there is no log why the oicd is not reachable. I don't even have a log that it is not correctly set up, just some log that the requests to my api are not working correctly because the tenant is not configured. Before 1.13.1 I got a "unknown host" exception if I had a typo in the oicd address.

Does the change you propose leads to a shutdown of the application?

I don't want to be tedious, probably I just don't understand it correctly, sorry ;-)

@sberyozkin
Copy link
Member

@joggeli34 Np at all, I'll look into it, we'll try to make it right

@sberyozkin sberyozkin self-assigned this Apr 23, 2021
@sberyozkin
Copy link
Member

sberyozkin commented Apr 23, 2021

Hi @joggeli34

I'll restore the earlier functionality where the application was failing to start when it failed to connect to the OIDC server - but only if auth-server-url is an invalid URL - if it is a valid URL then a debug message will be displayed that the connection is not possible at the moment but an attempt will be made to establish it during the initial request.

It feels it will be the right balance - as going forward we can indeed consider activating a timer, possibly a single one for all the tenants, trying to establish the connection or also doing the liveness pings and trying to restore the connection if it fails after it has been established.

The role of connection-delay would only be about trying to make sure the connection is available before the requests are starting to arrive.

Note this creating a connection during the first request is already supported with the dynamic TenantConfigResolver approach - so the above proposal would make it more consistent really where we have a valid URL but the connection is not ready yet - so internally it would actually be a special TenantConfigResolver which would take care of resolving it during the initial request - thus making a real recovery try...

@joggeli34
Copy link
Contributor Author

Thanks, yes that sounds very reasonable this way.

@sberyozkin
Copy link
Member

@joggeli34 Please see #16725.
Thanks

@quarkus-bot quarkus-bot bot added this to the 2.0 - main milestone Apr 28, 2021
@gsmet gsmet modified the milestones: 2.0.0.Alpha2, 1.13.4.Final May 10, 2021
@oole
Copy link

oole commented May 20, 2021

Is there any chance this fix will make it to the 1.13.x release line? I stumbled over this issue and was excited when I saw the PR, but just now saw that is was removed from the 1.13.4.final milestone.

@sberyozkin
Copy link
Member

Hi @joggeli34, and @oole

I've just noticed only now it was not backported - I messed it up because I removed the test resource which is no longer needed in 2.0.0.x and forgot it was still needed in 1.13.x - but looks like 1.13.6.Final was the last one. Sorry about it.
@gsmet unless 1.13.7.Final is still on the map ? If yes then I'll create a PR to backport that fix but with the test resource still availabe...

@sberyozkin
Copy link
Member

@joggeli34 @oole Backport PR is here #17662

@oole
Copy link

oole commented Jun 3, 2021

Great news, thanks @sberyozkin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/oidc kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants