-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Serialization issue with RequestCacheStats, negative longs unsupported, use writeLong or writeZLong [-7386] #13343
Comments
The stack trace points to the total memory size stat becoming negative in The error occurs when this negative value is communicated over transport; the result is that whichever node is handling the REST request will successfully return the stats to the calling node (with the negative value included) while other nodes will fail to return the value. This is what is leading to the symptom of only seeing stats on the same node that the REST Request is directed toward. We need to further investigate what changed with the |
@dbwiddis Let me try taking a look at this. We did change things around RequestCache with Tiered caching feature in 2.12/2.13. |
@weily2 I need more info/help on this while I try to reproduce it on my end. It seems you were trying to upgrade domain from 2.11 to 2.13 and saw these errors. Did you see these errors during domain upgrade or after the upgrade was complete? Also were you performing a rolling upgrade? |
I am also seeing this error. The errors appear to occur until the affected node is restarted. After which the errors resume any where from 4 hours to 3 days later. The problem compounds as more nodes in a cluster begin presenting this error and more of the data becomes unavailable. Given the call stack includes a reference to org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor I deleted all rollups. The errors resumed about 4.5 hours later. I then disabled rollups in persistent cluster settings. Following rollup disable via cluster settings the errors ceased without node restarts. I will continue monitoring for the next few days to see if they occur again with rollups disabled. |
Disabling rollups didn't stop the 'Negative longs unsupported' errors. |
@lyradc Thanks for confirming.
|
|
@lyradc It would also help if you can mention/share the steps to reproduce so that I can try on my end. |
Is there a api to clear RequestCache on node ? |
@artemus717 There is an api for that. But it will not work in this scenario where it is going negative. |
I have raised the PR for this fix. It turns out that the issue occurs when an indexShard is deleted and then reallocated on the same node. So whenever stale entries from older shard are deleted, those are accounted for the new shard which has the same shardId. I was able to reproduce it via IT by recreating the above scenario. |
Note: |
@sgup432 sorry for the late reply. I do perform a rolling upgrade. After upgrade the log is normal ,but these error occurred after the upgrade was complete for a few hours. |
@weily2 @lyradc @artemus717 The fix for this would be available in 2.14 version release. |
Describe the bug
Hello, after cluster update from 2.11.1 to 2.13.0, I have these messages in opensearch.log.
[2024-04-23T03:18:03,647][WARN ][o.o.t.OutboundHandler ] [node-hot-06] send message failed [channel: Netty4TcpChannel{localAddress=/A.B.C.40:9300, remoteAddress=/A.B.C.69:60146}] java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-7386] at org.opensearch.core.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:310) ~[opensearch-core-2.13.0.jar:2.13.0] at org.opensearch.index.cache.request.RequestCacheStats.writeTo(RequestCacheStats.java:103) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.core.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:974) ~[opensearch-core-2.13.0.jar:2.13.0] at org.opensearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:285) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:146) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.core.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:974) ~[opensearch-core-2.13.0.jar:2.13.0] at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:651) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:104) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundMessage.serialize(OutboundMessage.java:81) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:235) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:221) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:275) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler.internalSend(OutboundHandler.java:197) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:192) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:159) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:96) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:482) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceivedDecorate(SecuritySSLRequestHandler.java:206) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:317) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:154) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:828) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) [opensearch-index-management-2.13.0.0.jar:2.13.0.0] at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) [opensearch-performance-analyzer-2.13.0.0.jar:2.13.0.0] at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.13.0.jar:2.13.0] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?] at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
So Is There Any idea, what is wrong?
Related component
Cluster Manager
To Reproduce
The docs and store field on A.B.C.34 and A.B.C.40 is None , And I can find previous logs in 34 and 40.
The docs and store field on A.B.C.35 and A.B.C.40 is None , And I can find previous logs in 35 and 40.
Expected behavior
Both Not None
Additional Details
No response
The text was updated successfully, but these errors were encountered: