Change the destination of where failed script is written to #1530

gaow · 2023-12-23T01:30:13Z

This is typically what we see when a job failed,

ERROR: [susie_twas_1]: [0]:
Failed to execute Rscript /home/aw3600/.sos/97de343d7da3f0ce/susie_twas_1_0_ff982a12.R
exitcode=1, workdir=/home/mambauser/data
---------------------------------------------------------------------------
[susie_twas]: Exits with 1 pending step (susie_twas_2)

The line Rscript /home/aw3600/.sos/97de343d7da3f0ce/susie_twas_1_0_ff982a12.R is how we track and try to reproduce the error, to debug. However, long story short is that we are working with cloud computing where /home/aw3600/.sos/ is a path in the VM that gets destroyed after a command ends. Although it is possible to copy the entire .sos folder to permanent AWS S3 bucket before the VM dies, it is non-trivial to sync the entire folder ... all we care is this file susie_twas_1_0_ff982a12.R.

I think this conversation was once brought up but I don't remember we have an option to do it yet -- can we specify something on the sos run interface to make these temporary scripts saved to a given folder? I like the behavior of:

R: ... , stdout =, stderr =

which writes the stderr and stdout to where I want them. I wonder if we can add something like:

R: ..., stdout=, stderr=, debug_script="/path/to/debug/folder"

and only keep the scripts to /path/to/debug/folder when there is an issue -- and change the prompt Failed to execute to pointing to this script?

The text was updated successfully, but these errors were encountered:

BoPeng · 2023-12-23T07:22:37Z

Cannot you -v outside_sos:/home/user/.sos to save the content outside of the VM?

gaow · 2023-12-23T14:04:41Z

The suggestions we get from the vendor team is to avoid any communications between VM and S3 directly especially for frequent file exchanges. That was why their setup is to write data only to specific folders pre-mounted which later gets automatically copied to S3. I guess we can try to mount it as you suggest and see if it is feasible for large embarrassing parallel workflows involving many signature checks. I will report it back here if it seems too slow.

BoPeng · 2023-12-24T04:29:24Z

Let me know if it works for you. It is pretty easy to add this setting from command line but the problem is that ~/.sos/config.yml is the default configuration path and moving ~/.sos away will force users to use this option all the time. It is possible to keep ~/.sos/ but move the command log to somewhere else though.

gaow · 2023-12-24T16:01:34Z

@BoPeng thank you! Currently we cannot test it properly because the way our vendor set it up is that the S3 bucket is mounted with read-only. The automatic process runs the sync separate from our SoS run so we don't really write there the real time. I am asking them to reconfigure it and am waiting for the response.

It is possible to keep ~/.sos/ but move the command log to somewhere else though.

I think all we would need is to move the failed script ("command log"?) elsewhere. That's what we are interested in. If that's written to a folder that we sync between the VM and S3 that'd be the best. We should be able to leverage that and test it out.

gaow · 2024-02-04T02:59:25Z

It is possible to keep ~/.sos/ but move the command log to somewhere else though.

@BoPeng I'm sorry, it turns out we do need this feature -- to keep ~/.sos where it is but set the failed script to write to somewhere else, perhaps the output folder? The problem is that we did try to mount S3 bucket to ~/.sos as the cache. However, the I/O is an issue and we encounter SQLite failure frequently for large jobs to the extend that we cannot get our analysis done this way .. We will need to keep ~/.sos local.

Can we change the default for SoS anyways to write the command log to the same folder as where people set stderr to be, and if they did not set it we still write to ~/.sos? Thank you in advance for helping with this emergency fix!

BoPeng · 2024-02-04T05:53:46Z

The problem is that the .sos/... has the complete script that can be executed by Rscript .... etc, but stderr is a file with other content, and cannot be directly executed. Would not be enough if we write the content of the script also to stderr?

BoPeng · 2024-02-04T06:10:32Z

Let me see if #1533 works.

gaow · 2024-02-04T13:14:04Z

Thank you @BoPeng it's very helpful pointer to where to modify it. I have change it to:

https://github.com/vatlab/sos/pull/1533/files

It seems good. For example for this script:

[1]
R: 
  print(hello)

it says:

Failed to execute Rscript /home/gw/.sos/aee75cfb40461b96/1_0_a0afcf75.R
exitcode=1, workdir=/home/gw/tmp/04-Feb-2024

but when i set stderr file explicitly:

[1]
R: stderr="file.txt"
  print(hello)

the temporary script gets into the current folder properly:

Failed to execute Rscript /home/gw/tmp/04-Feb-2024/1_0_9b3f78e8.R
exitcode=1, workdir=/home/gw/tmp/04-Feb-2024, stderr=file.txt

Do you see any obvious issues with this patch? Please improve as you see fit. I wonder if we can also release a new version to conda for us to pull the changes and apply it to our jobs on AWS. Thanks!

BoPeng · 2024-02-04T13:59:33Z

How about a combination of both patches? I think having the script directly in stderr can be useful especially when stderr is the default (sys.stderr).

BoPeng · 2024-02-04T14:09:26Z

If we do that, for consistency and convenient, we should also write the script to

sos/src/sos/task_executor.py

Line 311 in 23bd5e9

with open(

This is because sometimes it is not entirely clear what went wrong with a task when it fails due to variable replacement problem.

gaow · 2024-02-04T15:03:59Z

@BoPeng I brought the other patch back via: ffeac9c instead of writing the entire script, because the script can be very long in many applications.

I think by placing the error message into stderr, it should also reflect into the task status so we don't have to modify task_executor.py?

The patch does not work as is, however. The error message is

TypeError: a bytes-like object is required, not 'str'

I think this is because you opened the stderr file with b option so we need to se.write a byte object not str? I'm not sure how to do that. I tried to wrap it around with encode_msg which bypasses the problem but the output has non-ASCII characters in it ... perhaps you know a quick fix to it :)

BoPeng · 2024-02-04T16:43:01Z

ok, the patch is updated, it should work w/wo option stderr and w/wo task:. Please let me know if it works as expected.

gaow · 2024-02-04T16:47:13Z

Thanks @BoPeng the patch works well in a couple of tests I tried. But for the conda release -- I see that there are some failed tests on #1533 should they be ignored?

BoPeng · 2024-02-04T16:51:11Z

I will clean up the code (pylint) and make a release.

BoPeng · 2024-02-04T22:54:53Z

sos 0.24.5 is relased.

gaow · 2024-02-05T00:44:38Z

Thank you @BoPeng . It's not here yet: https://anaconda.org/conda-forge/sos but i guess it will show up soon?

BoPeng · 2024-02-05T00:45:35Z

Yes, it should be there in a few hours after the pypi release.

gaow · 2024-02-05T01:37:06Z

I am not sure if that will be the case ... according to the release history, conda should have version 0.24.4: https://pypi.org/project/sos/#history

However, it is still 0.24.3 https://anaconda.org/conda-forge/sos

Perhaps there are some check fails that prevents it from getting onto conda-forge?

gaow · 2024-02-05T02:17:40Z

It's posted after I merged PR.

BoPeng mentioned this issue Feb 4, 2024

Write offending script to stderr (#i530) #1533

Merged

gaow added a commit that referenced this issue Feb 4, 2024

Further address comment on #1530

ffeac9c

gaow closed this as completed Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the destination of where failed script is written to #1530

Change the destination of where failed script is written to #1530

gaow commented Dec 23, 2023

BoPeng commented Dec 23, 2023

gaow commented Dec 23, 2023

BoPeng commented Dec 24, 2023

gaow commented Dec 24, 2023

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 5, 2024

BoPeng commented Feb 5, 2024

gaow commented Feb 5, 2024

gaow commented Feb 5, 2024

Change the destination of where failed script is written to #1530

Change the destination of where failed script is written to #1530

Comments

gaow commented Dec 23, 2023

BoPeng commented Dec 23, 2023

gaow commented Dec 23, 2023

BoPeng commented Dec 24, 2023

gaow commented Dec 24, 2023

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 4, 2024

BoPeng commented Feb 4, 2024

BoPeng commented Feb 4, 2024

gaow commented Feb 5, 2024

BoPeng commented Feb 5, 2024

gaow commented Feb 5, 2024

gaow commented Feb 5, 2024