Skip to content
This repository has been archived by the owner on May 7, 2020. It is now read-only.

No events received from ESH #5031

Closed
aluedt opened this issue Feb 5, 2018 · 33 comments · Fixed by #5113
Closed

No events received from ESH #5031

aluedt opened this issue Feb 5, 2018 · 33 comments · Fixed by #5113

Comments

@aluedt
Copy link

aluedt commented Feb 5, 2018

We are currently experiencing problems with receiving all kinds of events from ESH. It looks like for certain times we do not receive any events at all from ESH. If this happens, only events from ESH are affected, while at the same time we still receive events when the OSGI EventAdmin is used directly (we have some custom events in our own bundles, where the EventAdmin is used directly).

We have seen the issues from last December (#4591 and other issues referenced there) and think this could be related, but we do not see any log messages about the EventAdmin blacklisting certain services here. So we cannot say if there is a connection, but it seems likely.

We first observed this behaviour with the 0.9.0 release version.

@sjsf
Copy link
Contributor

sjsf commented Feb 5, 2018

That's interesting, I have observed the same behavior twice now on my system, running ESH on a "naked" concierge + felix event admin stack.

Events are flowing nicely if you are looking at the event admin (e.g. via the webconsole), but just not into ESH's event handler. And indeed, I also had no "blacklisting" message in the logs.

One thing I noticed is that in the linked issues the "blacklisting" log entries were coming from the karaf wrapper of the event admin, which is available on karaf only. But I didn't understand so far why the event admin logs don't make it into the logs or if this mechanism is the root-cause at all...

As this is to incredibly hard to reproduce, I currently have no idea yet how to further analyze the issue, apart from waiting... Just out of curiosity I upgraded my system to felix event admin 1.8.10 and it is running stable since 9 days now. But that doesn't mean anything and is way too early to declare this a success, especially as their code history doesn't indicate anything which would have fixed such a bug.

Are you on a packaging sample or custom stack as well?

@sjsf
Copy link
Contributor

sjsf commented Feb 5, 2018

PS: One more thing - I have a

org.apache.felix.eventadmin.Timeout=0

configuration set in my system (the same "workaround" that was applied in the openHAB distro). And this happened regardless in my system. It's really weird.

@aluedt
Copy link
Author

aluedt commented Feb 5, 2018

No custom stack here. Running with karaf.

@aluedt
Copy link
Author

aluedt commented Feb 5, 2018

... and same felix configuration here.

@sjsf
Copy link
Contributor

sjsf commented Feb 5, 2018

That makes it even more interesting now!

I captured a stacktrace from when my system was broken the last time. I attach it here for the sake of completeness, but to me it looked completely unsuspicious in that regard: treaddump.txt

@maggu2810
Copy link
Contributor

Can you test a version that has this changes applied?
https://github.com/eclipse/smarthome/pull/4700/files

@kaikreuzer
Copy link
Contributor

@maggu2810 Afaik, this is already included in @SJKA's setup - so it does not seem to change anything wrt this issue.

@aluedt
Copy link
Author

aluedt commented Feb 5, 2018

@SJKA Is this the case? Are you running a version with the above changes included?

@sjsf
Copy link
Contributor

sjsf commented Feb 5, 2018

At that point I was running a build with the timestamp of 201801251005. PR #4700 was merged already on 20th of December 2017, therefore: yes, this change was included.

@maggu2810
Copy link
Contributor

"ESH-OSGiEventManager" #37 prio=5 os_prio=0 tid=0x00007fe5b5293000 nid=0x33 waiting on condition [0x00007fe563dfe000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000c263d9e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
	at org.eclipse.smarthome.core.internal.events.ThreadedEventHandler.lambda$0(ThreadedEventHandler.java:65)
	at org.eclipse.smarthome.core.internal.events.ThreadedEventHandler$$Lambda$11/1827303532.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)

The threaded event handler waits to read events from the queue (that is filled by ourselves).
The poll should be return as soon as there is an element added to the linked blocking queue.

So, first try, set the log level of for this class / package to debug and you should see "Hey, you have really very few events."
https://github.com/eclipse/smarthome/blob/7717a6de4b4bd68eb0b6d02f7312ed2eb6ee57f0/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/internal/events/ThreadedEventHandler.java#L65

After that we can have a look at the one that puts events to that queue.
This method should be called by the OSGi event handler consumer itself:
https://github.com/eclipse/smarthome/blob/7717a6de4b4bd68eb0b6d02f7312ed2eb6ee57f0/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/internal/events/OSGiEventManager.java#L157
Can you have a look at "all" the event messages that are currently on the bus? Are there events for the "smarthome" topic?
https://github.com/eclipse/smarthome/blob/7717a6de4b4bd68eb0b6d02f7312ed2eb6ee57f0/bundles/core/org.eclipse.smarthome.core/src/main/java/org/eclipse/smarthome/core/internal/events/OSGiEventManager.java#L55

@aluedt
Copy link
Author

aluedt commented Feb 6, 2018

@maggu2810 We are currently using the 0.9.0 release version in our project, so the changes from #4700 are not included for us yet. Testing with a newer version does not seems to make real sense now as @SJKA reproduced the issue also with the #4700 changes included. We currently try to find a way to easily reproduce the problem (so far this mostly happened in field test installations not under our direct control) and then have a closer look at the event flow on the bus.

One further observation here is that sometimes (after hours) there seems to be some kind of "magic healing" and events are flowing again. Then also the events from the meantime seem to be processed.

@maggu2810
Copy link
Contributor

@aluedt Hm, if it make no sense really depends (IMHO). #4700 tries to fix the blacklisting because the event handling needs to much time. After #4700 the event handling method called by the framework puts the event to a queue and return immediately after that. So, it should not be blocked anymore because it takes to much time. As @SJKA still realized the same symptoms I see different possible causes. There could still be another problem in the event handling. So, the thread that consumed the events of the queue is blocked by another thing. Or we get blocked again -- then we need to check why we has been blocked. Or ...

IMHO if you can reproduce it on a system (after hours is much better then after weeks), then it would be better to include #4700 (as it should at least improve the situation already).

@sjsf
Copy link
Contributor

sjsf commented Feb 6, 2018

Let me summarize what we have learned so far

  • The events were still "on the bus" (at least I could see smarthome events in the felix webconsole)
  • There was no blacklisting log entry
  • The timeout was set to 0, so blacklisting should not have happened anyway
  • The thread dump indicates that the OSGiEventManager thread is still alive and dutifully waiting to poll stuf from the queue
  • It sometimes magically heals itself (I checked my influxdb/grafana and it turned out this actually happened to me 5 times in January, so it must have healed itself an my site too)
  • The adding of events to the queue in the OSGiEventManager is so trivial that I have no idea how it should fail

So in summary I reckon this looks really weird, especially in the realm of Java. Therefore I'm not really sure that we are searching in the right direction... (we should anyway, just to be on the safe side).

Let me in addition to @maggu2810 ask another question: Which version of the felix event admin are you using? Or, as you are running on karaf - which karaf version are you on?

The reason why I'm asking is because I discovered apache/felix@ac4a795#diff-7c5c723eccb30ca8376dc6571cf9bfa0
This change was introduced in felix event admin 1.4.10 or karaf 4.2.0. It is pretty much the only change, and it might actually be more interesting than I first thought.

Previously, the event admin tried to account only the time against an event handler which it really consumed itself (as opposed to using only the elapsed milliseconds, which is what it is doing now - and which btw could explain the blacklisting observations in openHAB after the karaf update). It did this via a JMX bean, which for sure will do some native magic. It is not very likely and I actually don't have a clue about this low-level stuff, but what if there is a bug in there (i.e. the JVM or in combination with certain OS library versions) which somehow occasionally screws up the thread handling so it doesn't wake up anymore? As it seems you are pretty "lucky" in reproducing the issue, I actually think it would we worth updating the event admin to see if it helps.

@kaikreuzer
Copy link
Contributor

This change was introduced in felix event admin 1.4.10 or karaf 4.2.0.

Small correction/addition: Karaf 4.1.3 was already updated to 1.4.10 as well, so this version is e.g. used in openHAB 2.2, while openHAB 2.1 was still using Karaf 4.0.8 and therefore EventAdmin 1.4.8.

@aluedt
Copy link
Author

aluedt commented Feb 7, 2018

We're running on karaf, still in version 4.0.8 here. We're on this version for something like 11 months now and the event issue is pretty new, so this does not seem to be related.

@aluedt
Copy link
Author

aluedt commented Feb 8, 2018

We've just been very lucky and had access with a debugger to a running ESH 0.9.0 in the state where no events are received through ESH. We found that events are actually flowing on the bus and received by OSGiEventManager.handleEvent. We further found that within OSGiEventManager.dispatchESHEvent the new EventSubscriber proxies introduced with the 0.9.0 release version are built just as expected and receive is called for these proxies then. This was the case for some central EventSubscriber from our application as well as ESH's EventLogger (amongst others).

These two event subscribers immediately log when the respective receive methods are invoked but this just does not happen in that state. We also set a breakpoint in the receive method of our event subscriber and receive was not invoked in that state.

So this seems to be exactly the spot where things go wrong. The issue seems very likely to be related to the changes introduced with #4655. This also fits the observation that for us the problem first occurred with the 0.9.0 release version but not with 0.9.0.b6.

@sjsf
Copy link
Contributor

sjsf commented Feb 8, 2018

Can you create a thread-dump in that state?
(kill -3 <PID> prints it to sysout, or alternatively with any jvm tool at hand)

It would be interesting to see if and how the SafeCaller threads keep themselves busy with

@sjsf
Copy link
Contributor

sjsf commented Feb 8, 2018

Also, some TRACE level logging of org.eclipse.smarthome.core.internal.common.SafeCallManagerImpl would be interesting to see. Maybe that helps to find out why they are not executed.

(Or maybe you can debug into there and see if they are actually scheduled or incorrectly held back for whatever reason - InvocationHandlerAsync.invoke(...) would be a good place to set a breakpoint)

@aluedt
Copy link
Author

aluedt commented Feb 8, 2018

Here's a thread dump we just created on that machine (still in the "no events" state).
thread-dump.txt

Regarding the TRACE level logs, this may take some time till we manage to reproduce the problem again.

@sjsf
Copy link
Contributor

sjsf commented Feb 8, 2018

The stacktrace doesn't contain any "safeCall" thread - so they are not all blocked at least. We can cross that off the list.
Unfortunately, that still doesn't give us the decisive clue...

@maggu2810
Copy link
Contributor

That's very interesting. @aluedt great that you can reproduce the problem.
Good luck.

@aluedt
Copy link
Author

aluedt commented Feb 12, 2018

Last Friday we prepared a custom ESH build from the 0.9.0 sources with the changes from #4655 rolled back manually and distributed a version of Coqon based on this custom ESH build to our fied testers. We have multiple field testers where the problem occurred frequently, at least daily. Today we received quite positive feedback for this "fix". None of our testers could reproduce the problem over the weekend after the update. So the problem seems solved with the changes rolled back.

@kaikreuzer kaikreuzer added the bug label Feb 12, 2018
@kaikreuzer
Copy link
Contributor

Thanks for the update, @aluedt, good to hear that you were able to mitigate that critical issue for the moment.

So this leaves us with the insight that there must be some nasty bug in the SafeCaller then.
FTR, another issue was discovered and fixed last week in the SafeCaller (#5063) - I cannot say whether this might be a related issue or not, though.

@sjsf
Copy link
Contributor

sjsf commented Feb 18, 2018

I just ran into this bug again and could gather some additional logs: It's full of

2018-02-18 10:13:20.210 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.io.rest.sse.internal.listeners.SseEventSubscriber@689fb6da' is already scheduled for asynchronous execution
2018-02-18 10:13:20.211 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2'
2018-02-18 10:13:20.211 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2' is already scheduled for asynchronous execution
2018-02-18 10:13:21.152 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.io.monitor.internal.EventLogger@2d9d35ad'
2018-02-18 10:13:21.154 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.io.monitor.internal.EventLogger@2d9d35ad' is already scheduled for asynchronous execution
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.model.rule.runtime.internal.engine.RuleEngineImpl@7c02fde1'
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.model.rule.runtime.internal.engine.RuleEngineImpl@7c02fde1' is already scheduled for asynchronous execution
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.autoupdate.internal.AutoUpdateBinding@29a90d7d'
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.autoupdate.internal.AutoUpdateBinding@29a90d7d' is already scheduled for asynchronous execution
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.internal.items.ItemUpdater@69c19ae4'
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.internal.items.ItemUpdater@69c19ae4' is already scheduled for asynchronous execution
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.io.rest.sse.internal.listeners.SseEventSubscriber@689fb6da'
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.io.rest.sse.internal.listeners.SseEventSubscriber@689fb6da' is already scheduled for asynchronous execution
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2'
2018-02-18 10:13:21.155 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2' is already scheduled for asynchronous execution
2018-02-18 10:13:21.455 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.io.monitor.internal.EventLogger@2d9d35ad'
2018-02-18 10:13:21.457 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.io.monitor.internal.EventLogger@2d9d35ad' is already scheduled for asynchronous execution
2018-02-18 10:13:21.457 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.model.rule.runtime.internal.engine.RuleEngineImpl@7c02fde1'
2018-02-18 10:13:21.457 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.model.rule.runtime.internal.engine.RuleEngineImpl@7c02fde1' is already scheduled for asynchronous execution
2018-02-18 10:13:21.458 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.autoupdate.internal.AutoUpdateBinding@29a90d7d'
2018-02-18 10:13:21.458 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.autoupdate.internal.AutoUpdateBinding@29a90d7d' is already scheduled for asynchronous execution
2018-02-18 10:13:21.458 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.internal.items.ItemUpdater@69c19ae4'
2018-02-18 10:13:21.459 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.internal.items.ItemUpdater@69c19ae4' is already scheduled for asynchronous execution
2018-02-18 10:13:21.459 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.io.rest.sse.internal.listeners.SseEventSubscriber@689fb6da'
2018-02-18 10:13:21.459 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.io.rest.sse.internal.listeners.SseEventSubscriber@689fb6da' is already scheduled for asynchronous execution
2018-02-18 10:13:21.459 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2'
2018-02-18 10:13:21.459 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2' is already scheduled for asynchronous execution
2018-02-18 10:13:22.384 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.io.monitor.internal.EventLogger@2d9d35ad'
2018-02-18 10:13:22.385 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.io.monitor.internal.EventLogger@2d9d35ad' is already scheduled for asynchronous execution
2018-02-18 10:13:22.385 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.model.rule.runtime.internal.engine.RuleEngineImpl@7c02fde1'
2018-02-18 10:13:22.385 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.model.rule.runtime.internal.engine.RuleEngineImpl@7c02fde1' is already scheduled for asynchronous execution
2018-02-18 10:13:22.386 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.autoupdate.internal.AutoUpdateBinding@29a90d7d'
2018-02-18 10:13:22.386 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.autoupdate.internal.AutoUpdateBinding@29a90d7d' is already scheduled for asynchronous execution
2018-02-18 10:13:22.387 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.internal.items.ItemUpdater@69c19ae4'
2018-02-18 10:13:22.388 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.internal.items.ItemUpdater@69c19ae4' is already scheduled for asynchronous execution
2018-02-18 10:13:22.389 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.io.rest.sse.internal.listeners.SseEventSubscriber@689fb6da'
2018-02-18 10:13:22.389 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.io.rest.sse.internal.listeners.SseEventSubscriber@689fb6da' is already scheduled for asynchronous execution
2018-02-18 10:13:22.390 [TRACE] [.internal.common.SafeCallManagerImpl] - Triggering submissions for 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2'
2018-02-18 10:13:22.391 [TRACE] [.internal.common.SafeCallManagerImpl] - Identifier 'org.eclipse.smarthome.core.thing.internal.CommunicationManager@19d577c2' is already scheduled for asynchronous execution

This clearly indicated that there is a bug in the asynchronous job handling of the SafeCaller. I will of course have a look at it!

@aluedt this proves that removing #4655 from your product release has been the right choice!

PS: Really funny: My system ran stable with felix event admin 1.4.10 for almost 3 weeks now. After I downgraded to 1.4.8 again, it took less than 48h to run into this issue. Maybe it influences the timing a little, just enough to influence the likelihood of a race condition? Really funny...

@kaikreuzer
Copy link
Contributor

Congrats on catching it in flagranti with trace level 👍

@sjsf
Copy link
Contributor

sjsf commented Feb 18, 2018

The log above clearly states that the "identifiers" (which are the EventSubscriber implementation instances like e.g. CommunicationManager) are already scheduled, i.e. they are contained in the activeAsyncInvocations map which the SafeCallManagerImpl maintains.

I stared at the code for a couple of days now - but I'm still not seeing how a scheduled job can be executed and not remove itself afterward from this map.

I grabbed a heapdump from my system. And of course, it is visible that the activeAsyncInvocations map indeed contains all of the event subscriber "identifiers". An interesting fact though is that the SafeCallerManagerImpl's scheduler (i.e. QueuingThreadPoolExecutor) has a taskQueue which contains a FutureTask instance for each of the identifiers in state NEW - as if the executor simply stopped executing tasks. Its set of workers are empty as is its workQueue. The queueThread field is filled with a Thread instance named "safeCall-queue", but it doesn't show up in the thread-dump. I therefore assume it is not running anymore.

@kaikreuzer do you have any ideas on that?

@maggu2810
Copy link
Contributor

maggu2810 commented Feb 18, 2018

@SJKA I will try to have a detailed look at the sources the next days, too.

@sjsf
Copy link
Contributor

sjsf commented Feb 19, 2018

Looks like indeed the QueuingThreadPoolExecutor has a weak spot. There is a gap when its thread decides to commit suicide until it really is terminated. During that period, a new entry will not be picked up anymore by the existing thread but also won't spawn a new thread.

I will try to prove it and also create a fix for it.

@sjsf
Copy link
Contributor

sjsf commented Feb 19, 2018

Ah, sorry, I didn't refresh my browser tab before posting this. Did you start already?

@maggu2810
Copy link
Contributor

Not yet, so feel free to create your fix. 😉

sjsf pushed a commit to sjsf/smarthome that referenced this issue Feb 19, 2018
...as before it could end up with queued tasks and no monitoring thread
if tasks were added right at the moment when the monitoring thread
decided to  terminate after there were no new tasks queued withing
two seconds.

fixes eclipse-archived#5031
Signed-off-by: Simon Kaufmann <[email protected]>
kaikreuzer pushed a commit that referenced this issue Feb 19, 2018
* fix synchronization in QueueingThreadPoolExecutor

...as before it could end up with queued tasks and no monitoring thread
if tasks were added right at the moment when the monitoring thread
decided to  terminate after there were no new tasks queued withing
two seconds.

fixes #5031
Signed-off-by: Simon Kaufmann <[email protected]>
@aluedt
Copy link
Author

aluedt commented Feb 19, 2018

Thank you all so much! Glad to hear this is solved.

@kaikreuzer
Copy link
Contributor

All kudos go to @SJKA 👍

@maggu2810
Copy link
Contributor

Big thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants