-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exit may fail to exit with TFORM #14
Comments
Hi Takahiro, Probably we have to set a lock before making the exit. Cheers Jos On 3 jun. 2014, at 20:26, Takahiro Ueda [email protected] wrote:
|
Note for future development: I feel the underlying problem would be that |
Sounds like a good idea.
The ‘own cleanup’ stems from times that things were not that standard (at least in my mind).
Jos
… On 8 feb. 2017, at 11:33, Takahiro Ueda ***@***.***> wrote:
Note for future development: I feel the underlying problem would be that Terminate() is doing many things. Actually it should be empty (except some debugging stuff). When a program exits, the OS must release all memories, close all streams and delete all temporary files (mkstemp + unlink in POSIX). Then all a worker have to do is printing an error message if needed (caring possible deadlock), maybe sending a message to the master and then ending the thread.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFLxEtfNhD8dwit1LuMXQ24_vC8UCRNlks5raZoRgaJpZM4CAvlG>.
|
I'm not 100% sure whether this is a correct way, but commenting out Line 1740 in d15ec75
Maybe it would be better to check whether or not the caller is a worker thread. |
If a tentative fix is available, would it be possible to propagate it into the main repository? The problems that this bug causes are very real and annoying when you run multiple jobs on a cluster. Suppose that some of the jobs (amplitude calculations) crash because of problems with your FORM code. However, since |
Maybe Takahiro has some ideas.
The problem is the way the (internal) Terminate function interacts with the killing of the
workers. It seems that it can cause a deadlock somewhere, but only occasionally.
That makes it hard to find.
Jos
… On 5 Jun 2020, at 12:00, Vladyslav Shtabovenko ***@***.***> wrote:
If a tentative fix is available, would it be possible to propagate it into the main repository?
I would be happy to test it in my environment as well.
The problems that this bug causes are very real and annoying when you run multiple jobs on a cluster. Suppose that some of the jobs (amplitude calculations) crash because of problems with your FORM code.
However, since tform doesn't exit but simply hangs with something like
Program terminating in thread 3 at FORM-test-script.frm Line 5
(this is for Takahiro's example program when run with tform -w4),
neither the grid control system nor the user can immediately recognize such failed jobs.
Your only safe bet it to constantly monitor the logs and grep for terminating or so. Then
you have to manually cancel the remaining jobs. This is not very satisfactory, since
normally the grid control system could do it for you automatically. But for that tform needs
to crash properly, not just hang.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJPCETGS3LF5RLLJKLBEVTRVC65PANCNFSM4AQC7FDA>.
|
I don't know how to use Pthreads and I'm not sure what is done in threads.c, but is it OK for a worker to call Lines 824 to 844 in 4057c65
pthread_join() for all the workers, including itself. Maybe we can start by avoiding the TerminateAllThreads() call for workers:Lines 1753 to 1755 in 4057c65
|
As I've already wrote, I'm very much willing to test possible fixes (perhaps from a dedicated branch?). I employ tform both on my quadcore laptop and on our TTP cluster, so that should provide for different test environments. |
This is indeed where probably the trouble is.
The result is that the worker that calls it is killing itself twice, if I remember correctly.
Maybe the fix is to kill all workers individually and skip the one that called it?
Or somehow skip the first (auto)kill? I think that at the time I did not see a quick
solution. This may need some thinking, but at the moment I cannot think straight
(headache).
… On 5 Jun 2020, at 13:05, Takahiro Ueda ***@***.***> wrote:
I don't know how to use Pthreads and I'm not sure what is done in threads.c, but is it OK for a worker to call TerminateAllThreads() (inside the Terminate() function)?
https://github.com/vermaseren/form/blob/4057c6564e7a2356e92d1ea37047c86d58042350/sources/threads.c#L824-L844 <https://github.com/vermaseren/form/blob/4057c6564e7a2356e92d1ea37047c86d58042350/sources/threads.c#L824-L844> It looks like calling pthread_join() for all the workers, including itself. Maybe we can start by avoiding the TerminateAllThreads() call for workers:
https://github.com/vermaseren/form/blob/4057c6564e7a2356e92d1ea37047c86d58042350/sources/startup.c#L1753-L1755 <https://github.com/vermaseren/form/blob/4057c6564e7a2356e92d1ea37047c86d58042350/sources/startup.c#L1753-L1755>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJPCERIUC3N5TJWOQE4PD3RVDGPVANCNFSM4AQC7FDA>.
|
I think, in practice, there is no need to wait for all threads for an abnormal exit. A tentative fix would be: --- a/sources/startup.c
+++ b/sources/startup.c
@@ -1751,7 +1751,9 @@ VOID Terminate(int errorcode)
/*:[08may2006 mt]*/
#endif
#ifdef WITHPTHREADS
- TerminateAllThreads();
+ if ( !WhoAmI() && !errorcode ) {
+ TerminateAllThreads();
+ }
#endif
if ( AC.FinalStats ) {
if ( AM.PrintTotalSize ) { but it may exit ( |
Many thanks! This definitely fixes your first example in this issue. I'm now switching to a custom tform binary with your fix and will report if |
I've been using Takahiro's patch for some time and didn't observe any side effects. It also appears that it really fixes the original behavior of tform not ending the process So perhaps it could be merged into master? |
Experimentally, this resolved hanging TFORM jobs using the exit statement, without any side effects.
OK. I have merged the patch. Then hopefully more people try it and we will see whether or not someone complains of possible side effects. |
Sometimes the Exit statement causes strange things with TFORM. The following program should quit immediately,
but
for i in {1..100}; do tform -w4 test.frm; done
(in bash) easily gets stuck. I also got "double free or corruption (fasttop)" errors from glibc.Because
L F = a+(x1+...+x9)
, i.e., only one term with "a", doesn't lead this problem, I guess it comes from that two or more threads try to exit.The text was updated successfully, but these errors were encountered: