Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unblock SmallRye Health exposed routes #37352

Merged
merged 1 commit into from
Jan 23, 2024

Conversation

xstefank
Copy link
Member

Fixes #35099

Issue #36977 caused by the original fix is addressed with the help of @computerlove (big thanks Marvin!).

For a quick reference, the issue was caused by the checks still being initialized in the blocking thread (through scheduler) which initialized them to run on wrong thread. This fix contains fixes to extensions/smallrye-health/runtime/src/main/java/io/quarkus/smallrye/health/runtime/QuarkusAsyncHealthCheckFactory.java that will mitigate this scenario and make sure that:

  • blocking HealthCheck always runs on blocking executor
  • nonblocking AsyncHealthCheck always runs on eventloop

@ahus1
Copy link
Contributor

ahus1 commented Nov 28, 2023

I've tested this change for the KC project and it works as we would expect it to work: When there is no check or only asynchronous checks, the execution path doesn't touch a the Quarkus thread pool and no requests are queued. This way, we won't see a timeout in an overload situation where blocking requests are queued, and no error when load shedding on executor queue length is active.

Thank you very much!

@xstefank
Copy link
Member Author

xstefank commented Jan 5, 2024

I'll force push this since it seems that CI is stuck

@xstefank xstefank force-pushed the i35099-non-blocking-routes-2 branch from 7af7de5 to 72711fb Compare January 5, 2024 07:07

This comment has been minimized.

@xstefank
Copy link
Member Author

xstefank commented Jan 5, 2024

@geoand, @cescoffier can you take a look please?

@geoand
Copy link
Contributor

geoand commented Jan 5, 2024

Is there anyway we can add a test similar to #36977?

@xstefank xstefank force-pushed the i35099-non-blocking-routes-2 branch 2 times, most recently from f8194e0 to 4c55998 Compare January 9, 2024 13:42
@xstefank
Copy link
Member Author

xstefank commented Jan 9, 2024

@geoand, added test with the user scenario.

This comment has been minimized.

@xstefank xstefank force-pushed the i35099-non-blocking-routes-2 branch from 4c55998 to 85fdcb8 Compare January 9, 2024 13:57

This comment has been minimized.

@xstefank xstefank force-pushed the i35099-non-blocking-routes-2 branch from 85fdcb8 to ba1a3aa Compare January 9, 2024 16:11
@xstefank
Copy link
Member Author

xstefank commented Jan 9, 2024

Refactored the new test into a new additional-tests module because of cyclic dependency. I don't like creating a maven module just for testing but I'm not able to reproduce the issue in different way.

This comment has been minimized.

@xstefank xstefank force-pushed the i35099-non-blocking-routes-2 branch 3 times, most recently from fa055db to 52153f2 Compare January 10, 2024 07:42
@xstefank
Copy link
Member Author

Couldn't sleep about the new test module so I found a way how to avoid it :D. The new BlockingNonBlockingTest test fails when you revert changes to QuarkusAsyncHealthCheckFactory.java which is cause and fix for #36977.

@geoand
Copy link
Contributor

geoand commented Jan 10, 2024

This looks good to me, but we'll need a +1 from @cescoffier as well

This comment has been minimized.

public void testRegisterHealthOnBlockingThreadStep1() {
// wait for the initial startup health call to finish
try {
Thread.sleep(5000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how we could use Awaitability instead of this hard coded sleep?
This is likely going to fail on slow systems and introduce delay on fast ones.

try {
Thread.sleep(5000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just throw the InterruptedException

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't since it's override.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overridden from where?

if (!inMemoryLogHandler.getRecords().isEmpty()) {
LogRecord logRecord = inMemoryLogHandler.getRecords().get(0);
assertEquals(Level.WARNING, logRecord.getLevel());
assertFalse(logRecord.getMessage().contains("has been blocked for"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we iterate over the whole set of messages? It might not be the last one.

static final class BlockingHealthCheck implements HealthCheck {
@Override
public HealthCheckResponse call() {
// block for 3s which is more than allowed default blocking duration of eventloop (2s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead I would you an illegal operation such as:

Uni.createFrom().item(42).onItem().delayIt().by(Duration.ofMillis(10)).await().indefinitely();

If you are on the event loop, this operation will be rejected as you are not allowed to use await().

@xstefank xstefank force-pushed the i35099-non-blocking-routes-2 branch from 52153f2 to 188c343 Compare January 18, 2024 16:26
@quarkus-bot quarkus-bot bot added the kind/enhancement New feature or request label Jan 23, 2024
@gsmet gsmet added triage/backport-3.8 and removed kind/enhancement New feature or request labels Jan 23, 2024
@vmuzikar
Copy link

@xstefank Thanks for the PR!

Keycloak would be intersted in getting this into 3.7/3.8. :)

@ivivanov-bg
Copy link

ivivanov-bg commented Mar 11, 2024

Hello,

After this change (I just updated from 3.6.2 to 3.8.2, but the last working version is 3.6.9 - also works if downgrade only the quarkus-smallrye-health package)
I get exception in the log when I try to send POST request to the /q/health endpoint.

The request also doesn't return (postman keeps spinning)

the stacktrace is the following:

2024-03-11 21:51:38,983 ERROR [io.qua.mut.run.MutinyInfrastructure               ] (vert.x-worker-thread-1) Mutiny had to drop the following exception: java.lang.NullPointerException: Cannot invoke "io.vertx.core.Context.runOnContext(io.vertx.core.Handler)" because "context" is null
	at io.smallrye.mutiny.vertx.MutinyHelper.lambda$executor$3(MutinyHelper.java:32)
	at io.smallrye.mutiny.operators.uni.UniEmitOn$UniEmitOnProcessor.onItem(UniEmitOn.java:34)
	at io.smallrye.mutiny.operators.uni.UniAndCombination$AndSupervisor.computeAndFireTheOutcome(UniAndCombination.java:151)
	at io.smallrye.mutiny.operators.uni.UniAndCombination$AndSupervisor.check(UniAndCombination.java:130)
	at io.smallrye.mutiny.operators.uni.UniAndCombination$UniHandler.onItem(UniAndCombination.java:220)
	at io.smallrye.mutiny.operators.uni.UniOperatorProcessor.onItem(UniOperatorProcessor.java:47)
	at io.smallrye.mutiny.operators.uni.UniOnItemTransform$UniOnItemTransformProcessor.onItem(UniOnItemTransform.java:43)
	at io.smallrye.mutiny.operators.uni.UniOperatorProcessor.onItem(UniOperatorProcessor.java:47)
	at io.smallrye.mutiny.operators.uni.builders.UniCreateFromItemSupplier.subscribe(UniCreateFromItemSupplier.java:29)
	at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
	at io.smallrye.mutiny.operators.uni.UniOnFailureFlatMap.subscribe(UniOnFailureFlatMap.java:31)
	at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
	at io.smallrye.mutiny.operators.uni.UniOnItemTransform.subscribe(UniOnItemTransform.java:22)
	at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
	at io.smallrye.mutiny.operators.uni.UniRunSubscribeOn.lambda$subscribe$0(UniRunSubscribeOn.java:27)
	at io.smallrye.mutiny.vertx.MutinyHelper.lambda$blockingExecutor$6(MutinyHelper.java:62)
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:190)
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:276)
	at io.vertx.core.impl.ContextImpl.lambda$internalExecuteBlocking$2(ContextImpl.java:209)
	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
	at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
	at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
	at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Any hints what could have been wrong ?

Edit:
The only healthchek I use is from agroal 3.8.2, connecting to an Oracle database

@geoand
Copy link
Contributor

geoand commented Mar 12, 2024

@ivivanov-bg is there any cahcne you can attach a sample application that exhibits the behavior you describe?

Thanks

@ivivanov-bg
Copy link

I will try (not sure how much time it will take though).
I was more hoping on some hints, as I was there was similar issue with mongodb (#36977) but didn't had a chance to go over how it was actually fixed.

If anyone has hints - please share, if not - I will post back again when I have the sample application (for now - my workaround is to downgrade quarkus-smallrye-health to 3.6.9)

Thanks

@geoand
Copy link
Contributor

geoand commented Mar 12, 2024

If anyone has hints

In the case of the previous bug, there was nothing the application could do, it was a subtle bug in the way Quarkus handles health checks.

@ivivanov-bg
Copy link

sample-app.zip

Attached is the sample app:
It works okay when run as:

mvn clean test -Dquarkus-smallrye-health.version=3.6.9

But it fails when run as:

mvn clean test -Dquarkus-smallrye-health.version=3.7.1

@geoand
Copy link
Contributor

geoand commented Mar 12, 2024

Thanks

@geoand
Copy link
Contributor

geoand commented Mar 12, 2024

@xstefank seems like you have your work cut out for you 😉 ^

@ivivanov-bg
Copy link

Is this because of the custom identity provider ?
(because in similar project, without such, I don't face the issue)

@geoand
Copy link
Contributor

geoand commented Mar 12, 2024

Could very well be

@ivivanov-bg
Copy link

Actually - I just tested it
it's because of the CustomAuthenticationMechanism
might be related with the Munity library update.

I tried creating the Uni<SecurityIdentity> in different ways in case securityIdentity is null but every time it fails the same way

@xstefank
Copy link
Member Author

I'm not sure why, but in your sample app Vertx.currentContext() returns null in the SmallRyeHealthHandlerBase#doHandle() which should always run on vert.x thread now but in your case it doesn't.

@xstefank
Copy link
Member Author

xstefank commented Mar 12, 2024

@ivivanov-bg A quick fix on your side:

Context orCreateContext = Vertx.vertx().getOrCreateContext();
return Uni.createFrom().item(securityIdentity).emitOn(command -> orCreateContext.runOnContext(v -> command.run()));

This works but I need to verify with Vert.x guys if this is something we should be handling in sr-health extension.

@cescoffier
Copy link
Member

cescoffier commented Mar 12, 2024

No, do NOT do that, you are creating a new instance of Vert.x, unmanaged and unclosed.

You need:

  1. check if you are on a duplicated context
  2. if so, use that duplicated context, in a last runOnContext
  3. if you are not on a duplicated context - well, no problem, you should be able to block

@xstefank
Copy link
Member Author

I totally missed that you're creating a single thread executor. So better would be:

@Alternative
@Priority(1)
@ApplicationScoped
@Slf4j
@RequiredArgsConstructor(onConstructor_ = @Inject)
public class CustomAuthenticationMechanism implements HttpAuthenticationMechanism {

    private final BasicAuthenticationMechanism basicAuth;

    @Inject
    Vertx vertx;

    @Override
    public Uni<SecurityIdentity> authenticate(RoutingContext context, IdentityProviderManager identityProviderManager) {
        return basicAuth.authenticate(context, identityProviderManager)
                        .emitOn(command -> vertx.getOrCreateContext().runOnContext(v -> command.run()))
                        .onItem()
                        .transformToUni(securityIdentity -> {
            if (securityIdentity != null && !securityIdentity.getRoles().contains("admin")) {
                final String username = securityIdentity.getPrincipal().getName();

                if (username == null || username.equals("test1")) {
                    return Uni.createFrom()
                       .failure(new ForbiddenException(String.format("User %s not allowed to access %s", username, context.normalizedPath())));
                }
            }

            return Uni.createFrom().item(securityIdentity);
        });
    }

    @Override
    public Uni<ChallengeData> getChallenge(RoutingContext context) {
        return basicAuth.getChallenge(context);
    }

    @Override
    public Set<Class<? extends AuthenticationRequest>> getCredentialTypes() {
        return Set.copyOf(basicAuth.getCredentialTypes());
    }
}

So this will either get existing or create new context (but not the whole vertx).

What I still don't understand why this HttpAuthenticationMechanism overrides the reactive routes on which we run health check. Maybe @sberyozkin?

@cescoffier
Copy link
Member

The CustomAuthenticationMechanism is slightly broken:

  • it creates a new thread pool of one every time. (that's the .emitOn(Executors.newSingleThreadExecutor()))
  • it does not capture and restore the duplicated context

The second point breaks the health checks, as it expects to be called in a duplicated context (I just discussed a fallback with @xstefank). Note that it can break more than health checks, but tracing too.

@sberyozkin, could we imagine a safety guard before calling the authentication mechanism that will capture the duplicated context and restore it if the user code does not do it? (Can you point me to the code calling this?)

@ivivanov-bg
Copy link

it creates a new thread pool of one every time. (that's the .emitOn(Executors.newSingleThreadExecutor()))

There is actually one part in the actual code missing in the sample app. I need to perform a DB operation to fetch some data based on the called URL.

When using the suggested approach .emitOn(vertx.getOrCreateContext()::runOnContext)
(or even by completely removing the .emitOn part to use the same runner) I get:
io.quarkus.runtime.BlockingOperationNotAllowedException: Cannot start a JTA transaction from the IO thread.

So I ended up with:

  1. Run the basic auth
  2. Emit the Uni to a background executor (added field in the class to reuse, thanks @cescoffier for the hint)
  3. Do the DB operation
  4. Emit the result on the Vertx context (.emitOn(vertx.getOrCreateContext()::runOnContext))

Thank you all for the help

@cescoffier
Copy link
Member

@ivivanov-bg You need to capture the context and then restore it:

@Inject Vertx vertx;

    @Override
    public Uni<SecurityIdentity> authenticate(RoutingContext context, IdentityProviderManager identityProviderManager) {
        Executor contextExecutor = MutinyHelper.executor(Vertx.currentContext()); // Gets an executor restoring the current context.
        return basicAuth.authenticate(context, identityProviderManager)
                        .emitOn(Infrastructure.getDefaultExecutor()) // Switch to a worker thread
                        .onItem()
                        .transformToUni(securityIdentity -> {
            if (securityIdentity != null && !securityIdentity.getRoles().contains("admin")) {
                final String username = securityIdentity.getPrincipal().getName();

                if (username == null || username.equals("test1")) {
                    return Uni.createFrom()
                       .failure(new ForbiddenException(String.format("User %s not allowed to access %s", username, context.normalizedPath())));
                }
            }

            return Uni.createFrom().item(securityIdentity);
        })
        .emitOn(contextExecutor); // Switch back to the context
    }

@michalvavrik
Copy link
Member

michalvavrik commented Mar 16, 2024

@sberyozkin, could we imagine a safety guard before calling the authentication mechanism that will capture the duplicated context and restore it if the user code does not do it? (Can you point me to the code calling this?)

I don't think it's a good idea to accept this can happen. If user is really going to do that, he will know what he is doing because he is an expert I suppose.

For majority of users I suggest to do not deal with threads (it will seem a low level to many of users), but instead work with API.

I think the example should look like this:

@Inject BlockingSecurityExecutor blockingExecutor;

     @Override
    public Uni<SecurityIdentity> authenticate(RoutingContext context, IdentityProviderManager identityProviderManager) {
        return blockingExecutor
                         .executeBlocking(() -> basicAuth.authenticate(context, identityProviderManager).await().indefinitely())
                         .onItem()
                         .transformToUni(securityIdentity -> {
             if (securityIdentity != null && !securityIdentity.getRoles().contains("admin")) {
                 final String username = securityIdentity.getPrincipal().getName();
                 if (username == null || username.equals("test1")) {
                     return Uni.createFrom()
                        .failure(new ForbiddenException(String.format("User %s not allowed to access %s", username, context.normalizedPath())));
                 }
             }
 
             return Uni.createFrom().item(securityIdentity);
         });
     }

We should probably document this if there was an agreement on the solution, because giving advice inside some PR has almost no impact on others in same situation.

Also if there is a common case emitting Uni on the worker thread, we can provide a new method, but personally I'd expect it is not a common case, because then, Uni is unnecessary.

@geoand
Copy link
Contributor

geoand commented Mar 19, 2024

We should probably document this if there was an agreement on the solution, because giving advice inside some PR has almost no impact on others in same situation.

+100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a non-blocking handler for SmallRye Health Status
9 participants