Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vaadin 23.4.0. UI's response performance degrades after some time #19429

Open
moryakovdv opened this issue May 22, 2024 · 24 comments
Open

Vaadin 23.4.0. UI's response performance degrades after some time #19429

moryakovdv opened this issue May 22, 2024 · 24 comments

Comments

@moryakovdv
Copy link

moryakovdv commented May 22, 2024

Description of the bug

Hi.
In production environment UI's responsiveness degrades significantly after some time.
WildFly 23+NGINX
Vaadin 23.4.0. +Springboot
Automatic PUSH via WEBSOCKET
Default session duration 30min is set on Wildfly.

Application has rather fast (~400ms) async UI updates (all of them called within ui.access)
After initial start everything works just fine.

But after some time (hours or even days) each UI request (menu open, button pressing etc) starts to perform badly.
Vaadin Loading bar starts to blink and even stuck eventually.
Refreshing the UI's, Opening page in another browser don't work, so even NEW UIs perform this way.

Several thread dumps say that there is a blocking behavior on the same lock in some threads in Atmosphere engine.
See screenshots.
Probably there is deadlock somewhere.
Maybe such behavior occurs after session expiration or switching from websocket to long polling.

Actually I don't know how can I do further investigation.
In our Stage or Development environment everything works as expected.
Timed out sessions die, new sessions work correctly

Any response will be appreciated!

Expected behavior

.

Minimal reproducible example

Hard to reproduce, see screenshots
Selection_1264
Selection_1265
Selection_1266

Versions

  • Vaadin / Flow version: Vaadin 23.4.0
  • Java version: OpenJDK-11
  • OS version: Ubuntu
  • Browser version (if applicable): Any
  • Application Server (if applicable): Wildfly 23.0.1-Final + NGINX in front
  • IDE (if applicable):
@mcollovati
Copy link
Collaborator

Hi, thanks for creating the issue.
Could you provide a full thread dump, with the trace of all application threads?

@moryakovdv
Copy link
Author

@mcollovati, Marco, thanks for the answer.
thread dump attached
aedump-1.zip

@mcollovati
Copy link
Collaborator

It looks like that Atmosphere UUIDBroadcasterCache gets stuck while processing a terminal operation (anyMatch()) on a parallel stream.

    private boolean hasMessage(String clientId, String messageId) {
        ConcurrentLinkedQueue<CacheMessage> clientQueue = messages.get(clientId);
        return clientQueue != null && clientQueue.parallelStream().anyMatch(m -> Objects.equals(m.getId(), messageId));
    }
"default task-183" - Thread t@3646
   java.lang.Thread.State: RUNNABLE
	at [email protected]/java.util.concurrent.ConcurrentLinkedQueue$CLQSpliterator.trySplit(ConcurrentLinkedQueue.java:880)
	at [email protected]/java.util.stream.AbstractShortCircuitTask.compute(AbstractShortCircuitTask.java:114)
	at [email protected]/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
	at [email protected]/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at [email protected]/java.util.concurrent.ForkJoinPool$WorkQueue.helpCC(ForkJoinPool.java:1115)
	at [email protected]/java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:1957)
	at [email protected]/java.util.concurrent.ForkJoinTask.tryExternalHelp(ForkJoinTask.java:378)
	at [email protected]/java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:323)
	at [email protected]/java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:412)
	at [email protected]/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736)
	at [email protected]/java.util.stream.MatchOps$MatchOp.evaluateParallel(MatchOps.java:242)
	at [email protected]/java.util.stream.MatchOps$MatchOp.evaluateParallel(MatchOps.java:196)
	at [email protected]/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at [email protected]/java.util.stream.ReferencePipeline.anyMatch(ReferencePipeline.java:528)
	at deployment.ROOT.war//org.atmosphere.cache.UUIDBroadcasterCache.hasMessage(UUIDBroadcasterCache.java:259)
	at deployment.ROOT.war//org.atmosphere.cache.UUIDBroadcasterCache.addMessageIfNotExists(UUIDBroadcasterCache.java:207)
	at deployment.ROOT.war//org.atmosphere.cache.UUIDBroadcasterCache.addToCache(UUIDBroadcasterCache.java:146)
	at deployment.ROOT.war//com.vaadin.flow.server.communication.LongPollingCacheFilter.filter(LongPollingCacheFilter.java:102)

The call happens during the execution of a BroadcasterFilter (LongPollingCacheFilter); execution of every filter happens in a synchronized() block that locks on the filter instance.
So, all other requests, are blocked by the broadcaster lock that is not released because of the pending stream processing operation.

That said, I can't say why the stream is stuck. It looks like the clientQueue is continuously getting new elements 🤔
Does your application perhaps perform a very high rate of push invocation?

@moryakovdv
Copy link
Author

Thanks for investigation.
Yes, app has high rate of async updates.
I'm confused with Long polling. Push annotation is set to use WEBSOCKETs.
Does it mean that for some client Atmosphere switches to Long polling due to, say, network latency?

@mcollovati
Copy link
Collaborator

LongPollingCacheFilter is always executed, but it performs actions only if the transport is long polling.
So, as you said, it seems that the transport is switched to long polling

@mcollovati
Copy link
Collaborator

This issue seems similar Atmosphere/atmosphere#2262

@moryakovdv
Copy link
Author

Yes, I saw that topic.
But, I have no ideas how to get rid of it.
It would be hard to reduce number of push invocations.
Does it make sense to switch on PushMode=Manual and delegate ui.push to some queue?

@mcollovati
Copy link
Collaborator

You could maybe try to copy/paste the UUIDBroadcasterCache and rewrite the hasMessage method to perform the anyMatch on a copy of the list

@moryakovdv moryakovdv changed the title Vaadin 23.4.0. UI's response performace degrades after some time Vaadin 23.4.0. UI's response performance degrades after some time May 29, 2024
@moryakovdv
Copy link
Author

moryakovdv commented May 29, 2024

Well... Is it sufficient to use
@BroadcasterCacheService public class MyBroadcasterCache implements BroadcasterCache {...}
from the docs
to use my implementation further?

@mcollovati
Copy link
Collaborator

Does it make sense to switch on PushMode=Manual and delegate ui.push to some queue?

I don't have an answer for this, sorry. It could help, but it could also only move the problem on a different layer.
Anyway, it might be worth it to try. Queueing messages and perform less push calls could prevent the lock

@moryakovdv
Copy link
Author

Does it make sense to switch on PushMode=Manual and delegate ui.push to some queue?

I don't have an answer for this, sorry. It could help, but it could also only move the problem on a different layer. Anyway, it might be worth it to try. Queueing messages and perform less push calls could prevent the lock

Just thought this approach could switch parallel updates to serial.

@mcollovati
Copy link
Collaborator

Well... Is it sufficient to use
@BroadcasterCacheService public class MyBroadcasterCache implements BroadcasterCache {...}
from the docs
to use my implementation further?

IIRC you have to set it with the servlet init patameter, otherwise Flow will force UUIDBroadcasterCache

@moryakovdv
Copy link
Author

moryakovdv commented May 30, 2024

IIRC you have to set it with the servlet init patameter, otherwise Flow will force UUIDBroadcasterCache

You are right, the following code in PushRequestHandler class uses UUIDBroadcasterCache

static AtmosphereFramework initAtmosphere(final ServletConfig vaadinServletConfig) {
        AtmosphereFramework atmosphere = new AtmosphereFramework(false, false) {
            @Override
            protected void analytics() {
                // Overridden to disable version number check
            }

            @Override
            public AtmosphereFramework addInitParameter(String name,
                    String value) {
                if (vaadinServletConfig.getInitParameter(name) == null) {
                    super.addInitParameter(name, value);
                }
                return this;
            }
        };

        atmosphere.addAtmosphereHandler("/*", new PushAtmosphereHandler());
        atmosphere.addInitParameter(ApplicationConfig.BROADCASTER_CACHE,
                UUIDBroadcasterCache.class.getName());
    ...

But I cannot find the proper way to force it use my CustomBroadcasterCache.
I tried init-param , context-param in web.xml, @BroadcasterCacheService annotation and the following code to make it work:

@ManagedBean
public class AtmosphereInitializer implements ServletContextInitializer {
	@Override
	public void onStartup(ServletContext servletContext) {
		servletContext.setInitParameter("org.atmosphere.cpr.AtmosphereConfig.getInitParameter", "true");

		servletContext.setInitParameter("org.atmosphere.cpr.broadcaster.shareableThreadPool", "true");
		servletContext.setInitParameter("org.atmosphere.cpr.broadcaster.maxProcessingThreads", "8");
		servletContext.setInitParameter("org.atmosphere.cpr.broadcasterCacheClass",
				"com.kmp.market.CustomBroadcasterCache");
	}
}

My class com.kmp.market.CustomBroadcasterCache is completely ignored by PushRequestHandler.
BTW, other instructions in the above method , e.g. servletContext.setInitParameter("org.atmosphere.cpr.broadcaster.maxProcessingThreads", "8"); work correctly.

Could you please show me the right way?
Perhaps, I am in a mess with Vaadin+SpringBoot configuration.
Thanks in advance.

@mcollovati
Copy link
Collaborator

I think you need to set the parameter to the Vaadin servlet.
Take a look at this comment for a similar use case: #16664 (comment)

Anyway, I would also investigate why the push connection is downgraded to long polling

@moryakovdv
Copy link
Author

Finally, I got it.
Playing with the code below CustomBroadcasterCache is linked to framework:

@Bean
	BeanPostProcessor patchAtmosphereBroadcaster() {
	    return new BeanPostProcessor() {
	        @Override
	        public Object postProcessBeforeInitialization(Object bean, String beanName) throws BeansException {
	            if (bean instanceof ServletRegistrationBean<?>) {
	                ServletRegistrationBean<?> reg = (ServletRegistrationBean<?>) bean;
	                if (reg.getServlet() instanceof SpringServlet) {
	                	reg.addInitParameter("org.atmosphere.cpr.AtmosphereConfig.getInitParameter", "true");
	                	reg.addInitParameter("org.atmosphere.cpr.maxSchedulerThread", String.valueOf(maxSchedulerThread));
	                	
	                	reg.addInitParameter("org.atmosphere.cpr.broadcaster.shareableThreadPool", "true");
	                	reg.addInitParameter("org.atmosphere.cpr.broadcaster.maxProcessingThreads", String.valueOf(maxProcessingThreads));
	                	reg.addInitParameter("org.atmosphere.cpr.broadcaster.maxAsyncWriteThreads", String.valueOf(maxAsyncWriteThreads));
	                	
	                    reg.addInitParameter("org.atmosphere.cpr.broadcasterCacheClass", "com.kmp.market.CustomBroadcasterCache");
	                }
	            }
	            return bean;
	        }
	    };
	}

@moryakovdv
Copy link
Author

moryakovdv commented May 31, 2024

Now some investigations:

  1. without any further modifications I added
private boolean hasMessage(String clientId, String messageId) { 
....
int size = clientQueue.size();
   if (size>0)
	System.out.println(size);
...
  1. Browser requested application with NGINX started, WEBSOCKET-transport instantiated and works as expected
  2. The seldom output shows 1-2 messages in the queue
  3. Stop NGINX, Browser shows Connection lost
  4. Start NGINX again, Browser reconnects
  5. The output counter becomes mad and shows the constantly increasing number of messages in the queue

So, probably we have a Long polling after emulated NGINX restart.

@moryakovdv
Copy link
Author

moryakovdv commented May 31, 2024

And more:
Simply put a breakpoint into hasMessage and any UI will stuck with Vaadin loader with no response, even when that UI was connected without proxy.

@mcollovati
Copy link
Collaborator

And more: Simply put a breakpoint into hasMessage and any UI will stuck with Vaadin loader with no response, even when that UI was connected without proxy.

I think this is somehow expected. When pushing changes to the client, a VaadinSession lock is held, so if you block execution in hasMessage, any other access to VaadinSession will wait for the VaadinSession lock to be released.

@mcollovati
Copy link
Collaborator

The output counter becomes mad and shows the constantly increasing number of messages in the queue

This is probably because while Atmosphere is trying to push cached messages to the client, the application is constantly adding other new messages, making the queue never empty.

So, probably we have a Long polling after emulated NGINX restart.

You can check it in the browser network tab: if the web socket channel is closed, you might see HTTP push requests happen continuously.

@mcollovati
Copy link
Collaborator

Probably one of the questions here is: would it be possible to reconnect with websocket transport after a network failure, instead of falling back to long polling?

image

@mcollovati
Copy link
Collaborator

If you are confident that web socket will ALWAYS work for the application clients, you can set websocket as fallback transport as well.

    @Bean
    VaadinServiceInitListener configureWebsocketFallbackForPush() {
        return serviceEvent ->  serviceEvent.getSource().addUIInitListener(uiEvent -> {
            uiEvent.getUI().getPushConfiguration().setFallbackTransport(Transport.WEBSOCKET);
        });
    }

image

@moryakovdv
Copy link
Author

moryakovdv commented May 31, 2024

If you are confident that web socket will ALWAYS work for the application clients, you can set websocket as fallback transport as well.

Thanks for the suggestion.
In what case the fallback again to websockets instead of long-polling may insult the application?
Connection issues? Old browsers? Some security settings?

@mcollovati
Copy link
Collaborator

I would say in cases where the client may not be able to establish a web socket connection at all.
Old browser could be a case. Probably also, corporate firewall and proxies may in some cases block some request. Or network connection timeouts.
Without the long polling fallback, push will basically not work, and changes will be propagated to the client only as a result of a server interaction (e.g. a click on a button)

@moryakovdv
Copy link
Author

Hello. We made further investigation.
Now we got rid from complete application hangs by using some modifications in BroadcasterCache but still we have performance issues with BroadcasterConfig.
Please, have a look on points at the picture.
Selection_1432

  1. connection is stable, server's CPU consumption is calm
  2. the network interface of the client is switched-off.
    Network tab in client's Firefox shows error in web-socket connection and loading-line is staring to blink
  3. the network interface of the client is switched-ON. Firefox shows reconnection messages and starts long-polling requests.
    Server logs shows "famous" message like 'UnsupportedOperationException: Unexpected message id from the client. Expected sync id: 137, got 138`
    The CPU on server starts burning (probably we have a lot of messages to broadcast to client we lost on step 2)
  4. User reloads page (pressing F5) in Firefox.
    Session is closed and consumption calms down.

While investigating thread dumps I found that default BroadcasterConfig is stuck in filter() in 'synchronized' block.
image

Probably old LongPollingCacheFilter object from previous session was not removed or renewed after client's reconnect.

I would appreciate your ideas about how to get rid of such situation.

PS: we have NGINX on stage and production as a front.
the above bug occurs in both environments, no matter what timings (proxy_send, proxy_connect etc... ) are set on NGINXs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🔎 Investigation
Development

No branches or pull requests

2 participants