Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calling join() on terminated thread hangs #71

Open
LnnrtS opened this issue Feb 18, 2021 · 17 comments
Open

calling join() on terminated thread hangs #71

LnnrtS opened this issue Feb 18, 2021 · 17 comments

Comments

@LnnrtS
Copy link
Contributor

LnnrtS commented Feb 18, 2021

I believe this condition should also check for state::terminated

while (state_ != state::destroyed)

@ilg-ul
Copy link
Contributor

ilg-ul commented Feb 18, 2021

What would be the updated condition?

And what exactly is the rationale behind this change? Is there any POSIX requirement for this?

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Feb 18, 2021

while (state_ != state::destroyed && state_ != state::terminated) fixed the issue for me.
Otherwise join() would not return forever if called on a thread that has already terminated (the thread function returned).

The comment above the function indicates that join() should return when called on a terminated thread.
I am not familiar with the details of different thread states. If any thread should ultimately reach destroyed then the current condition is correct.

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Feb 26, 2021

I took a second look at the code. The problem is that the thread sets itself to terminated on exit but only gets destroyed by the idle thread. If that one never runs (because of high load), join() blocks forever if it waits for destroyed.

To avoid this kind of deadlock I would either suggest the fix above (if there are no side effects) or somehow raise the priority of the idle thread in this situation. (Something like that is already done for mutexes, right?)

@ilg-ul
Copy link
Contributor

ilg-ul commented Feb 26, 2021

fix above ... raise the priority of the idle thread

I can't tell if the fix is ok, we need to analyse it carefully.

The priority can be temporarily raised, but to what level, to be sure that it does not have negative side effects, considering that the idle thread is intended to run only when there is nothing else to be done?

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Feb 26, 2021

The priority can be temporarily raised, but to what level

To the maximum level of all threads that are waiting on a join()?

the idle thread is intended to run only when there is nothing else to be done

Maybe then the thread destruction should not be done by the idle thread

analyse it carefully.

sure. I will go with my quick fix for now and we can come back to this once testing infrastructure is in place.

@ilg-ul
Copy link
Contributor

ilg-ul commented Feb 26, 2021

To the maximum level of all threads that are waiting on a join()?

I need to check the implementation, but is there a list of threads waiting on a join()?

thread destruction should not be done by the idle thread

Please feel free to suggest an alternate solution.

From my point of view, a well behaved application generally should not rely on the preemptive scheduler, and most of the times should not use a full scheduler cuanta, but enter the idle thread at the end of most scheduler ticks.

Unless you have a background computation task, if your application does not reach the idle thread, I would search for busy waits or similar issues.

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Feb 26, 2021

I don't have a overview of all the implementation details but this is what comes to mind:

  • Why can't the thread destroy itself?
  • the idle thread could inherit the priority of a thread that terminates if that is greater than its current priority (and reduce the priority after garbage collection ran)
  • maybe it makes sense to split idle actions and garbage collection

Yes, the problem occurs only on high load. Actually I noticed it due to a busy wait as described in #72.

@ilg-ul
Copy link
Contributor

ilg-ul commented Feb 26, 2021

Why can't the thread destroy itself?

Because it is quite difficult for a thread to deallocate the stack it is currently using. Or even the thread object itself.

the idle thread could inherit the priority of a thread that terminates

It can, but are you sure that this will help the thread waiting to join?

maybe it makes sense to split idle actions and garbage collection

How? Adding a new GC thread is not exactly attractive.

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Mar 11, 2021

the idle thread could inherit the priority of a thread that terminates

It can, but are you sure that this will help the thread waiting to join?

The fact that my hack works shows that the thread reaches state::terminated, so the priorities are set in a way that the thread being waited on manages to make progress. If the idle threads inherits the priority, the rest of the termination process will continue in the same priority as before (which we know ensures to make progress and will eventually finish). Currently, the priority drops for that last part which can cause a hang.

@ilg-ul
Copy link
Contributor

ilg-ul commented Mar 11, 2021

I may be terribly wrong since I'm not sure I recall correctly the details, but the CMSIS OS validation tests might somehow try to reuse threads; in other words, the thread storage is allocated once and reinitialised multiple times. For this to work, it is mandatory that the thread is no longer in use.

As far as I know, only threads that were joined are guaranteed to no longer be in use.

If join() returns when the thread is terminated and the actual destruction is done much later, it is possible for some other thread to try to reuse it.

Your hack works since you probably don't reuse threads.

When I asked if you are sure that raising priorities helps, I was thinking of the case when a CPU intensive thread may very well have a higher priority, so a slight priority bump might not help.

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Mar 11, 2021

I am not suggesting my hack as general solution to the problem. Next to what you describe, there is also the problem that if the memory for the stack does not get deallocated (because the idle thread doesn't run) it might be missing for other tasks. So you could occasionally run out of memory for no apparent reason.

Inheriting the priority just makes sure that the transition terminated -> destroyed is run at the same priority as the thread function did. So you you set up priorities in a way that the thread can finish the thread will also get destroyed.

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Mar 11, 2021

Also if the idle thread's priority was temporarily raised to destroy a terminated thread, the idle thread should skip other idle actions. That should reduce potential sideffects

@ilg-ul
Copy link
Contributor

ilg-ul commented Mar 11, 2021

run out of memory...

Right. For such cases one might prefer to use static stacks.

Inheriting the priority...

We'll consider this mechanism, in some cases it might help, but the real problems here are:

  • the presence of a higher priority CPU intensive thread
  • the destruction done on the idle thread

I'm thinking of an alternate solution. How about, in addition to performing the destruction on the idle thread, as of now, it is also performed on the thread waiting to join? (if the thread is terminated, of course)

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Mar 11, 2021

the presence of a higher priority CPU intensive thread

higher priority than the idle thread, right?

alternative solution

I was also thinking about that. But what if the idle thread already started with destroying the thread and then you call join()? It might still hang, right?

@ilg-ul
Copy link
Contributor

ilg-ul commented Mar 11, 2021

higher priority than the idle thread, right?

higher priority than the bumped priority, whatever it is.

what if the idle thread already started with destroying the thread and then you call join()? It might still hang, right?

right, it has to wait for the destruction to complete anyway, and if this involves freeing memory, this duration is non-deterministic for the first-fit-top allocator, since it must also coalesce blocks.

but if the waiting thread bumps the idle priority to its own priority, it should take about the same as if it does the destruction itself.

just that we need an exclusion mechanism to prevent them both starting, probably a new state like while_destroying.

@LnnrtS
Copy link
Contributor Author

LnnrtS commented Mar 11, 2021

higher priority than the bumped priority, whatever it is.

But in that case the thread would have not even reached terminated. So that example is irrelevant I would say. We only care about the case where terminated is reached but destroyed is not.

the waiting thread bumps the idle priority to its own priority

Yes, that would work. But what's the advantage over taking the priority of the terminating thread? It certainly does make things more complicated.

@ilg-ul
Copy link
Contributor

ilg-ul commented Mar 11, 2021

We only care about the case where terminated is reached but destroyed is not.

That's correct. Otherwise the current functionality applies.

We can make things even faster if the terminating thread also awakes the thread waiting to join, and in this case the destruction is done sooner, without having to wait for the idle thread. The idle thread will do the cleanup only for threads that are of no interest for other threads

But what's the advantage over taking the priority of the terminating thread?

I don't know, but I can ask the same, what is the advantage of the idle thread taking the priority of the terminating thread?

I suggest we use a mechanism similar to the one used by mutexes, if for no other reason that it is known to work.

But if we decide to perform the destruction on the context of the waiting thread, we might not even need to bump the idle priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants