Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit may fail to exit with TFORM #14

Closed
tueda opened this issue Jun 3, 2014 · 13 comments
Closed

Exit may fail to exit with TFORM #14

tueda opened this issue Jun 3, 2014 · 13 comments
Labels
bug Something isn't working

Comments

@tueda
Copy link
Collaborator

tueda commented Jun 3, 2014

Sometimes the Exit statement causes strange things with TFORM. The following program should quit immediately,

S a,x1,...,x9;
L F = a*(x1+...+x9);
*L F = a+(x1+...+x9);
.sort
if (count(a,1)) exit;
.end

but for i in {1..100}; do tform -w4 test.frm; done (in bash) easily gets stuck. I also got "double free or corruption (fasttop)" errors from glibc.

Because L F = a+(x1+...+x9), i.e., only one term with "a", doesn't lead this problem, I guess it comes from that two or more threads try to exit.

@vermaseren
Copy link
Owner

Hi Takahiro,

Probably we have to set a lock before making the exit.
As you notice, two may exit at the same time. I don’t think that is protected.
With the lock it should be safe and the lock doesn’t have to be unlocked I think.

Cheers

Jos

On 3 jun. 2014, at 20:26, Takahiro Ueda [email protected] wrote:

Sometimes the Exit statement causes strange things with TFORM. The following program should quit immediately,

S a,x1,...,x9;
L F = a*(x1+...+x9);
*L F = a+(x1+...+x9);
.sort
if (count(a,1)) exit;
.end
but for i in {1..100}; do tform -w4 test.frm; done (in bash) easily gets stuck. I also got "double free or corruption (fasttop)" errors from glibc.

Because L F = a+(x1+...+x9), i.e., only one term with "a", doesn't lead this problem, I guess it comes from that two or more threads try to exit.


Reply to this email directly or view it on GitHub.

@tueda
Copy link
Collaborator Author

tueda commented Feb 8, 2017

Note for future development: I feel the underlying problem would be that Terminate() is doing many things. Actually it should be empty (except some debugging stuff). The same for the signal handlers. (Our handlers deviate from the POSIX standard; e.g., we should not call free.) When a program exits, the OS must release all memories, close all streams and delete all temporary files (mkstemp + unlink in the POSIX). Then all a worker have to do is printing an error message if needed (caring possible deadlock), maybe sending a message to the master and then ending the thread/process.

@vermaseren
Copy link
Owner

vermaseren commented Feb 8, 2017 via email

@tueda
Copy link
Collaborator Author

tueda commented Oct 6, 2017

I'm not 100% sure whether this is a correct way, but commenting out TerminateAllThreads() in Terminate() helps in my environment.

TerminateAllThreads();

Maybe it would be better to check whether or not the caller is a worker thread.

@vsht
Copy link

vsht commented Jun 5, 2020

If a tentative fix is available, would it be possible to propagate it into the main repository?
I would be happy to test it in my environment as well.

The problems that this bug causes are very real and annoying when you run multiple jobs on a cluster. Suppose that some of the jobs (amplitude calculations) crash because of problems with your FORM code.

However, since tform doesn't exit but simply hangs with something like
Program terminating in thread 3 at FORM-test-script.frm Line 5
(this is for Takahiro's example program when run with tform -w4),
neither the grid control system nor the user can immediately recognize such failed jobs.
Your only safe bet it to constantly monitor the logs and grep for terminating or so. Then
you have to manually cancel the remaining jobs. This is not very satisfactory, since
normally the grid control system could do it for you automatically. But for that tform needs
to crash properly, not just hang.

@vermaseren
Copy link
Owner

vermaseren commented Jun 5, 2020 via email

@tueda
Copy link
Collaborator Author

tueda commented Jun 5, 2020

I don't know how to use Pthreads and I'm not sure what is done in threads.c, but is it OK for a worker to call TerminateAllThreads() (inside the Terminate() function)?

form/sources/threads.c

Lines 824 to 844 in 4057c65

VOID TerminateAllThreads()
{
int i;
for ( i = 1; i <= numberofworkers; i++ ) {
GetThread(i);
WakeupThread(i,TERMINATETHREAD);
}
#ifdef WITHSORTBOTS
for ( i = numberofworkers+1; i <= numberofworkers+numberofsortbots; i++ ) {
WakeupThread(i,TERMINATETHREAD);
}
#endif
for ( i = 1; i <= numberofworkers; i++ ) {
pthread_join(threadpointers[i],NULL);
}
#ifdef WITHSORTBOTS
for ( i = numberofworkers+1; i <= numberofworkers+numberofsortbots; i++ ) {
pthread_join(threadpointers[i],NULL);
}
#endif
}
It looks like calling pthread_join() for all the workers, including itself. Maybe we can start by avoiding the TerminateAllThreads() call for workers:

form/sources/startup.c

Lines 1753 to 1755 in 4057c65

#ifdef WITHPTHREADS
TerminateAllThreads();
#endif

@vsht
Copy link

vsht commented Jun 5, 2020

As I've already wrote, I'm very much willing to test possible fixes (perhaps from a dedicated branch?). I employ tform both on my quadcore laptop and on our TTP cluster, so that should provide for different test environments.

@vermaseren
Copy link
Owner

vermaseren commented Jun 5, 2020 via email

@tueda
Copy link
Collaborator Author

tueda commented Jun 5, 2020

I think, in practice, there is no need to wait for all threads for an abnormal exit.

A tentative fix would be:

--- a/sources/startup.c
+++ b/sources/startup.c
@@ -1751,7 +1751,9 @@ VOID Terminate(int errorcode)
        /*:[08may2006 mt]*/
 #endif
 #ifdef WITHPTHREADS
-       TerminateAllThreads();
+       if ( !WhoAmI() && !errorcode ) {
+               TerminateAllThreads();
+       }
 #endif
        if ( AC.FinalStats ) {
                if ( AM.PrintTotalSize ) {

but it may exit (P_term() is called in Terminate()) before deleting *.str files.

@vsht
Copy link

vsht commented Jun 6, 2020

Many thanks! This definitely fixes your first example in this issue.

I'm now switching to a custom tform binary with your fix and will report if
I encounter any related issues.

@vsht
Copy link

vsht commented Jul 17, 2020

I've been using Takahiro's patch for some time and didn't observe any side effects.

It also appears that it really fixes the original behavior of tform not ending the process
after the evaluation crashed.

So perhaps it could be merged into master?

tueda added a commit that referenced this issue Jul 17, 2020
Experimentally, this resolved hanging TFORM jobs using the exit
statement, without any side effects.
@tueda
Copy link
Collaborator Author

tueda commented Jul 17, 2020

OK. I have merged the patch. Then hopefully more people try it and we will see whether or not someone complains of possible side effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants