You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 23, 2024. It is now read-only.
The containers available at dockstore crash frequently in our environment. I tried the versions 1.0.8, 1.1.2, 1.1.3, and 1.1.4. The crashes occur at random steps of the workflow even for the same dataset, which led me to believe that it is a technical issue and unrelated to the data. With few exceptions, I could not find any error messages in the log files. The *.wrapper.log files contained an exit code of 255 and the files inside the timings folder, too. But other than that there was no hint about the source of the error in any of the other log files.
After extensive debugging I managed to track down the crashes to two issues:
In order to launch a job, a command is written to a shell script file, for example WGS_tumor_vs_control/caveman/tmpCaveman/logs/Sanger_CGP_Caveman_Implement_caveman_estep.94.sh. This script is then made executable and called right after. Apparently, some versions/storage drivers of docker have an issue with this. When there is no delay between making the script executable and running it, occasionally the change in permissions has not yet become effective before the script is run, resulting in an error Text file busy and the termination of the workflow. Others have reported this issue, too: Running chmod on file results in 'text file busy' when running straight after. moby/moby#9547. Supposedly, it helps to insert a sync or sleep 1 between making the script executable and running it. I am not sure whether this helps, because switching to singularity fixed this issue for me, so I did not bother to find out which scripts would need to be modified and actually try it out. Even though this is not a bug in the workflow itself but in Docker, you might want to consider inserting a sync, because other users might run into the same error.
After solving the above issue, only about half of the runs crashed (rather than 9/10). The remaining crashes were caused by the need_backoff function in /opt/wtsi-cgp/lib/perl5/PCAP/Threaded.pm. The following line occasionally threw an error Use of uninitialized value $one_min: $ret = 1 if($one_min > $self->{'system_cpus'});
I was unable to find out, why $one_min is undefined sometimes. I tried writing the value of $uptime to STDERR to check, if the regex fails to match, but for reasons I do not understand the values did not get written to the log files of the workflow. I tried replacing the uptime tool with something that is guaranteed to produce an output string matching the regex, but the error still occurred. At this point, I'm thinking that perhaps the call to the external command uptime from within Perl fails from time to time. I eventually gave up, since it takes days to reproduce the issue and I was able to avoid the crashes altogether by simply wrapping the offending line into this:
I assume you do not bump into these issues as often as I do, because you certainly would have noticed an error that affects a major fraction of the runs. I have no explanation as to why these two errors happen so frequently in our environment. Still, I was able to reproduce the issues on various systems (openSuSE/CentOS) with various kernel/Docker versions and various storage drivers, so other users might be affected, too. I therefore figured that it is reasonable to take precautions to circumvent the errors and wanted to give you this feedback.
Regards,
Sebastian
The text was updated successfully, but these errors were encountered:
But, it may be that we need to move that sync until after the chmod, we are currently reviewing many of our tools so this should get picked up relatively soon.
FYI, we do seems to be finding users on CentOS have more problems, not sure why.
Dear Keiran,
The containers available at dockstore crash frequently in our environment. I tried the versions 1.0.8, 1.1.2, 1.1.3, and 1.1.4. The crashes occur at random steps of the workflow even for the same dataset, which led me to believe that it is a technical issue and unrelated to the data. With few exceptions, I could not find any error messages in the log files. The
*.wrapper.log
files contained an exit code of 255 and the files inside thetimings
folder, too. But other than that there was no hint about the source of the error in any of the other log files.After extensive debugging I managed to track down the crashes to two issues:
In order to launch a job, a command is written to a shell script file, for example
WGS_tumor_vs_control/caveman/tmpCaveman/logs/Sanger_CGP_Caveman_Implement_caveman_estep.94.sh
. This script is then made executable and called right after. Apparently, some versions/storage drivers of docker have an issue with this. When there is no delay between making the script executable and running it, occasionally the change in permissions has not yet become effective before the script is run, resulting in an errorText file busy
and the termination of the workflow. Others have reported this issue, too: Running chmod on file results in 'text file busy' when running straight after. moby/moby#9547. Supposedly, it helps to insert async
orsleep 1
between making the script executable and running it. I am not sure whether this helps, because switching to singularity fixed this issue for me, so I did not bother to find out which scripts would need to be modified and actually try it out. Even though this is not a bug in the workflow itself but in Docker, you might want to consider inserting async
, because other users might run into the same error.After solving the above issue, only about half of the runs crashed (rather than 9/10). The remaining crashes were caused by the
need_backoff
function in/opt/wtsi-cgp/lib/perl5/PCAP/Threaded.pm
. The following line occasionally threw an errorUse of uninitialized value $one_min
:$ret = 1 if($one_min > $self->{'system_cpus'});
I was unable to find out, why
$one_min
is undefined sometimes. I tried writing the value of$uptime
to STDERR to check, if the regex fails to match, but for reasons I do not understand the values did not get written to the log files of the workflow. I tried replacing theuptime
tool with something that is guaranteed to produce an output string matching the regex, but the error still occurred. At this point, I'm thinking that perhaps the call to the external commanduptime
from within Perl fails from time to time. I eventually gave up, since it takes days to reproduce the issue and I was able to avoid the crashes altogether by simply wrapping the offending line into this:I assume you do not bump into these issues as often as I do, because you certainly would have noticed an error that affects a major fraction of the runs. I have no explanation as to why these two errors happen so frequently in our environment. Still, I was able to reproduce the issues on various systems (openSuSE/CentOS) with various kernel/Docker versions and various storage drivers, so other users might be affected, too. I therefore figured that it is reasonable to take precautions to circumvent the errors and wanted to give you this feedback.
Regards,
Sebastian
The text was updated successfully, but these errors were encountered: