-
Notifications
You must be signed in to change notification settings - Fork 780
No events received from ESH #5031
Comments
That's interesting, I have observed the same behavior twice now on my system, running ESH on a "naked" concierge + felix event admin stack. Events are flowing nicely if you are looking at the event admin (e.g. via the webconsole), but just not into ESH's event handler. And indeed, I also had no "blacklisting" message in the logs. One thing I noticed is that in the linked issues the "blacklisting" log entries were coming from the karaf wrapper of the event admin, which is available on karaf only. But I didn't understand so far why the event admin logs don't make it into the logs or if this mechanism is the root-cause at all... As this is to incredibly hard to reproduce, I currently have no idea yet how to further analyze the issue, apart from waiting... Just out of curiosity I upgraded my system to felix event admin 1.8.10 and it is running stable since 9 days now. But that doesn't mean anything and is way too early to declare this a success, especially as their code history doesn't indicate anything which would have fixed such a bug. Are you on a packaging sample or custom stack as well? |
PS: One more thing - I have a
configuration set in my system (the same "workaround" that was applied in the openHAB distro). And this happened regardless in my system. It's really weird. |
No custom stack here. Running with karaf. |
... and same felix configuration here. |
That makes it even more interesting now! I captured a stacktrace from when my system was broken the last time. I attach it here for the sake of completeness, but to me it looked completely unsuspicious in that regard: treaddump.txt |
Can you test a version that has this changes applied? |
@maggu2810 Afaik, this is already included in @SJKA's setup - so it does not seem to change anything wrt this issue. |
@SJKA Is this the case? Are you running a version with the above changes included? |
At that point I was running a build with the timestamp of 201801251005. PR #4700 was merged already on 20th of December 2017, therefore: yes, this change was included. |
The threaded event handler waits to read events from the queue (that is filled by ourselves). So, first try, set the log level of for this class / package to debug and you should see "Hey, you have really very few events." After that we can have a look at the one that puts events to that queue. |
@maggu2810 We are currently using the 0.9.0 release version in our project, so the changes from #4700 are not included for us yet. Testing with a newer version does not seems to make real sense now as @SJKA reproduced the issue also with the #4700 changes included. We currently try to find a way to easily reproduce the problem (so far this mostly happened in field test installations not under our direct control) and then have a closer look at the event flow on the bus. One further observation here is that sometimes (after hours) there seems to be some kind of "magic healing" and events are flowing again. Then also the events from the meantime seem to be processed. |
@aluedt Hm, if it make no sense really depends (IMHO). #4700 tries to fix the blacklisting because the event handling needs to much time. After #4700 the event handling method called by the framework puts the event to a queue and return immediately after that. So, it should not be blocked anymore because it takes to much time. As @SJKA still realized the same symptoms I see different possible causes. There could still be another problem in the event handling. So, the thread that consumed the events of the queue is blocked by another thing. Or we get blocked again -- then we need to check why we has been blocked. Or ... IMHO if you can reproduce it on a system (after hours is much better then after weeks), then it would be better to include #4700 (as it should at least improve the situation already). |
Let me summarize what we have learned so far
So in summary I reckon this looks really weird, especially in the realm of Java. Therefore I'm not really sure that we are searching in the right direction... (we should anyway, just to be on the safe side). Let me in addition to @maggu2810 ask another question: Which version of the felix event admin are you using? Or, as you are running on karaf - which karaf version are you on? The reason why I'm asking is because I discovered apache/felix@ac4a795#diff-7c5c723eccb30ca8376dc6571cf9bfa0 Previously, the event admin tried to account only the time against an event handler which it really consumed itself (as opposed to using only the elapsed milliseconds, which is what it is doing now - and which btw could explain the blacklisting observations in openHAB after the karaf update). It did this via a JMX bean, which for sure will do some native magic. It is not very likely and I actually don't have a clue about this low-level stuff, but what if there is a bug in there (i.e. the JVM or in combination with certain OS library versions) which somehow occasionally screws up the thread handling so it doesn't wake up anymore? As it seems you are pretty "lucky" in reproducing the issue, I actually think it would we worth updating the event admin to see if it helps. |
Small correction/addition: Karaf 4.1.3 was already updated to 1.4.10 as well, so this version is e.g. used in openHAB 2.2, while openHAB 2.1 was still using Karaf 4.0.8 and therefore EventAdmin 1.4.8. |
We're running on karaf, still in version 4.0.8 here. We're on this version for something like 11 months now and the event issue is pretty new, so this does not seem to be related. |
We've just been very lucky and had access with a debugger to a running ESH 0.9.0 in the state where no events are received through ESH. We found that events are actually flowing on the bus and received by OSGiEventManager.handleEvent. We further found that within OSGiEventManager.dispatchESHEvent the new EventSubscriber proxies introduced with the 0.9.0 release version are built just as expected and receive is called for these proxies then. This was the case for some central EventSubscriber from our application as well as ESH's EventLogger (amongst others). These two event subscribers immediately log when the respective receive methods are invoked but this just does not happen in that state. We also set a breakpoint in the receive method of our event subscriber and receive was not invoked in that state. So this seems to be exactly the spot where things go wrong. The issue seems very likely to be related to the changes introduced with #4655. This also fits the observation that for us the problem first occurred with the 0.9.0 release version but not with 0.9.0.b6. |
Can you create a thread-dump in that state? It would be interesting to see if and how the SafeCaller threads keep themselves busy with |
Also, some TRACE level logging of (Or maybe you can debug into there and see if they are actually scheduled or incorrectly held back for whatever reason - |
Here's a thread dump we just created on that machine (still in the "no events" state). Regarding the TRACE level logs, this may take some time till we manage to reproduce the problem again. |
The stacktrace doesn't contain any "safeCall" thread - so they are not all blocked at least. We can cross that off the list. |
That's very interesting. @aluedt great that you can reproduce the problem. |
Last Friday we prepared a custom ESH build from the 0.9.0 sources with the changes from #4655 rolled back manually and distributed a version of Coqon based on this custom ESH build to our fied testers. We have multiple field testers where the problem occurred frequently, at least daily. Today we received quite positive feedback for this "fix". None of our testers could reproduce the problem over the weekend after the update. So the problem seems solved with the changes rolled back. |
Thanks for the update, @aluedt, good to hear that you were able to mitigate that critical issue for the moment. So this leaves us with the insight that there must be some nasty bug in the SafeCaller then. |
I just ran into this bug again and could gather some additional logs: It's full of
This clearly indicated that there is a bug in the asynchronous job handling of the SafeCaller. I will of course have a look at it! @aluedt this proves that removing #4655 from your product release has been the right choice! PS: Really funny: My system ran stable with felix event admin 1.4.10 for almost 3 weeks now. After I downgraded to 1.4.8 again, it took less than 48h to run into this issue. Maybe it influences the timing a little, just enough to influence the likelihood of a race condition? Really funny... |
Congrats on catching it in flagranti with trace level 👍 |
The log above clearly states that the "identifiers" (which are the I stared at the code for a couple of days now - but I'm still not seeing how a scheduled job can be executed and not remove itself afterward from this map. I grabbed a heapdump from my system. And of course, it is visible that the @kaikreuzer do you have any ideas on that? |
@SJKA I will |
Looks like indeed the I will try to prove it and also create a fix for it. |
Ah, sorry, I didn't refresh my browser tab before posting this. Did you start already? |
Not yet, so feel free to create your fix. 😉 |
...as before it could end up with queued tasks and no monitoring thread if tasks were added right at the moment when the monitoring thread decided to terminate after there were no new tasks queued withing two seconds. fixes eclipse-archived#5031 Signed-off-by: Simon Kaufmann <[email protected]>
* fix synchronization in QueueingThreadPoolExecutor ...as before it could end up with queued tasks and no monitoring thread if tasks were added right at the moment when the monitoring thread decided to terminate after there were no new tasks queued withing two seconds. fixes #5031 Signed-off-by: Simon Kaufmann <[email protected]>
Thank you all so much! Glad to hear this is solved. |
All kudos go to @SJKA 👍 |
Big thanks! |
We are currently experiencing problems with receiving all kinds of events from ESH. It looks like for certain times we do not receive any events at all from ESH. If this happens, only events from ESH are affected, while at the same time we still receive events when the OSGI EventAdmin is used directly (we have some custom events in our own bundles, where the EventAdmin is used directly).
We have seen the issues from last December (#4591 and other issues referenced there) and think this could be related, but we do not see any log messages about the EventAdmin blacklisting certain services here. So we cannot say if there is a connection, but it seems likely.
We first observed this behaviour with the 0.9.0 release version.
The text was updated successfully, but these errors were encountered: