Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure logging for async fetch causes Network threads to be BLOCKED #57501

Closed
Bukhtawar opened this issue Jun 2, 2020 · 5 comments
Closed
Labels
>bug needs:triage Requires assignment of a team area label

Comments

@Bukhtawar
Copy link
Contributor

Bukhtawar commented Jun 2, 2020

Below is the thread dump indicating transport_worker threads being BLOCKED, there were multiple threads in this state. Although this happened on 6.8 but I believe it would still happen on later versions too.

"elasticsearch[LARLAca][transport_worker][T#22]" #79 daemon prio=5 os_prio=0 tid=0x00007f2d90003800 nid=0x3006 waiting for monitor entry [0x00007f2dbdefa000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:399)
	- waiting to lock <0x00007f2f5e3c4d30> (a java.lang.Object)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at org.apache.logging.log4j.util.LoaderUtil.loadClass(LoaderUtil.java:168)
	at org.apache.logging.log4j.core.impl.ThrowableProxy.loadClass(ThrowableProxy.java:622)
	at org.apache.logging.log4j.core.impl.ThrowableProxy.toExtendedStackTrace(ThrowableProxy.java:738)
	at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:138)
	at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:122)
	at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:605)
	at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:64)
	at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38)
	at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:334)
	at org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:233)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:218)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:58)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:177)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
	at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:309)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:464)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:448)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:431)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:406)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
	at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2170)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2125)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2108)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1991)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1835)
	at org.apache.logging.log4j.spi.AbstractLogger.warn(AbstractLogger.java:2684)
	at org.elasticsearch.gateway.AsyncShardFetch.processAsyncFetch(AsyncShardFetch.java:219)
	- locked <0x00007f3442bbe760> (a org.elasticsearch.gateway.GatewayAllocator$InternalAsyncFetch)
	at org.elasticsearch.gateway.AsyncShardFetch$1.onResponse(AsyncShardFetch.java:286)
	at org.elasticsearch.gateway.AsyncShardFetch$1.onResponse(AsyncShardFetch.java:283)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:81)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.finishHim(TransportNodesAction.java:251)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onOperation(TransportNodesAction.java:228)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:151)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:205)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleResponse(TransportNodesAction.java:195)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1104)
	at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:985)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:203)
	at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:977)
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:952)
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:763)
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:426)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
	at java.lang.Thread.run(Thread.java:748)

@Bukhtawar Bukhtawar added >bug needs:triage Requires assignment of a team area label labels Jun 2, 2020
@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Jun 2, 2020

The AsyncShardFetch#processAsyncFetch handling is done on the network thread on version 7.7 as well, which means if one node is flaky the network threads on masters will be blocked longer due to the failure logging overhead and might impact other node pings and new node connections

@DaveCTurner
Copy link
Contributor

There's nothing actionable here so I am closing this. Please fill out the bug report template in full when reporting apparent bugs. We need significantly more information than you have provided here before we can even start to analyse a bug report. A stack trace from a single thread is rarely of any use and this one doesn't indicate anything particularly surprising happening.

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Jun 2, 2020

There's nothing actionable here so I am closing this

I thought the bug is, it's not right to have a logging overhead on Network threads. Can you please correct me if I am wrong @DaveCTurner . The stack trace shows that and I have the full thread dump https://gist.github.com/Bukhtawar/1b9e1ad07958d684a8d9ebce3fcb4889

I'll share the logs if I'm unable to repro

We need significantly more information than you have provided here before we can even start to analyse a bug report.

I am working on that if you can please bear with me.

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Jun 2, 2020

On this particular case since ClassLoader#loadClass requires a lock acquisition, as a result only one of transport_worker thread could acquire it causing the others network thread wanting to log the failure to get BLOCKED on the monitor(based on the thread dump).
The lock 0x00007f2f5e3c4d30 is the one other transport_workers are waiting on and only one gets to acquire at one point.

Do you think this is the expected and the desired behaviour @DaveCTurner ?

@Bukhtawar
Copy link
Contributor Author

If you see the implementation of the transport_worker its actually doing another IO. See JarLoader#getResource

"elasticsearch[LARLAca][transport_worker][T#7]" #55 daemon prio=5 os_prio=0 tid=0x00007f2de800b800 nid=0x2fee runnable [0x00007f2e048ee000]
   java.lang.Thread.State: RUNNABLE
	at java.util.zip.ZipFile.getEntry(Native Method)
	at java.util.zip.ZipFile.getEntry(ZipFile.java:316)
	- locked <0x00007f2f5c2e1310> (a java.util.jar.JarFile)
	at java.util.jar.JarFile.getEntry(JarFile.java:240)
	at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1054)
	at sun.misc.URLClassPath.getResource(URLClassPath.java:249)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
	- locked <0x00007f2f5e3c4d30> (a java.lang.Object)

@Bukhtawar Bukhtawar changed the title Failure logging for async fetch happens on Network threads Failure logging for async fetch causes Network threads to be BLOCKED Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

No branches or pull requests

2 participants