Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crab submit --dryrun executes cmsRun over and over again #7493

Closed
mapellidario opened this issue Nov 30, 2022 · 8 comments
Closed

crab submit --dryrun executes cmsRun over and over again #7493

mapellidario opened this issue Nov 30, 2022 · 8 comments

Comments

@mapellidario
Copy link
Member

problem

I submitted a task with --dryrun (adapted from https://github.com/dmwm/CRABServer/blob/master/test/statusTrackingTasks/HC-1kj.py ) to prod with cmssw 10 from lxplus7.

The crab submit --dryrun command gets stuck at [1]

Created temporary directory for dry run sandbox in /tmp/dmapelli/tmpuB5sJK
Executing test, please wait...

and from another shell I notice that it keeps executing the same cmsRun command all over again [2].

I could replicate the same behavior with cmssw 12 in lxplus8

ideal behavior

cmsRun should be run only once.


[1]

> export CRABCONFIGINSTANCE=prod && crab submit --dryrun
Will use CRAB configuration file crabConfig.py
Importing CMSSW configuration pset.py
Finished importing CMSSW configuration pset.py
Sending the request to the server at cmsweb.cern.ch
Success: Your task has been delivered to the prod CRAB3 server.
Task name: 221130_143901:dmapelli_crab_20221130_153856
Project dir: crab_projects/crab_20221130_153856
Waiting for task to be processed
Checking task status
Task status: NEW
Please wait...
Task status: UPLOADED

Created temporary directory for dry run sandbox in /tmp/dmapelli/tmpuB5sJK
Executing test, please wait...
^CKeyboard Interrupted
Log file is /afs/cern.ch/user/d/dmapelli/crab/submit/1-analysis/crab_projects/crab_20221130_153856/crab.log

[2]

> grep -i "invoking" /tmp/dmapelli/tmpuB5sJK/wmagentJob.log
2022-11-30 15:39:41,961:INFO:Scram:    Invoking command: export X509_USER_PROXY=/tmp/x509up_u99307; export HOME=${HOME:-$PWD}; export SITECONFIG_PATH=/cvmfs/cms.cern.ch/SITECONF/local; cmsRun -p PSet.py -j FrameworkJobReport.xml
2022-11-30 15:40:10,776:INFO:Scram:    Invoking command: export X509_USER_PROXY=/tmp/x509up_u99307; export HOME=${HOME:-$PWD}; export SITECONFIG_PATH=/cvmfs/cms.cern.ch/SITECONF/local; cmsRun -p PSet.py -j FrameworkJobReport.xml
[...]
@mapellidario
Copy link
Member Author

After private chat with Stefano, this issue is related to #6544

@belforte
Copy link
Member

I can't reproduce it with CMSSW_10_6_12 [1]. One possibility is that in your case a
different configuration makes cmsRun fail in a way which triggers a restart in the
script. We'll need to modify client so that more info is printed
https://github.com/dmwm/CRABClient/blob/aabf7727858d637ce3f42e574b420285a0fefc72/src/python/CRABClient/Commands/submit.py#L372
and at least the tmp directory is not removed
https://github.com/dmwm/CRABClient/blob/aabf7727858d637ce3f42e574b420285a0fefc72/src/python/CRABClient/Commands/submit.py#L448

ATM my best guess is that the problem is triggered by this line
https://github.com/dmwm/CRABClient/blob/aabf7727858d637ce3f42e574b420285a0fefc72/src/python/CRABClient/Commands/submit.py#L402
i.e. your cmsRun runs too fast no matter how many events are asked, maybe the input file has only 1 event or so ?

Indeed it is not documented anywhere that in dryRun cmsRun has to run for at least 25 seconds.

[1]

belforte@lxplus7101/TC3> crab submit --dryrun 
Will use CRAB configuration file crabConfig.py
Importing CMSSW configuration minimalPS.py
Finished importing CMSSW configuration minimalPS.py
Sending the request to the server at cmsweb.cern.ch
Success: Your task has been delivered to the prod CRAB3 server.
Task name: 230119_215751:belforte_crab_20230119_225732
Project dir: ./crab_20230119_225732
Waiting for task to be processed
Checking task status
Task status: NEW
Please wait...
Task status: NEW
Please wait...
Task status: UPLOADED

Created temporary directory for dry run sandbox in /tmp/belforte/tmp7rOemt
Executing test, please wait...

Using LumiBased splitting
Task consists of 2 jobs to process 2 lumis
The longest job will process 1 lumis, with an estimated processing time of 0 minutes
The average job will process 1 lumis, with an estimated processing time of 0 minutes
The shortest job will process 1 lumis, with an estimated processing time of 0 minutes
The estimated memory requirement is 556 MB

Timing quantities given below are ESTIMATES. Keep in mind that external factors
such as transient file-access delays can reduce estimate reliability.

An update to your splitting parameters is recommended.

For ~0 minute jobs, use:
Data.unitsPerJob = 0
You will need to submit a new task

Dry run requested: task paused
To continue processing, use 'crab proceed'

Log file is /afs/cern.ch/work/b/belforte/CRAB3/TC3/crab_20230119_225732/crab.log
belforte@lxplus7101/TC3> echo $CMSSW_BASE
/afs/cern.ch/work/b/belforte/CMSSW/SL7/CMSSW_10_6_12
belforte@lxplus7101/TC3> 

@belforte
Copy link
Member

If my theory is correct, the only fix is to detect the "running time is not increasing" in the loop and exit with a msg. OTOH it is quire unrealistic that someone bothers to use CRAB for something which never runs longer than 25 seconds per file.

@mapellidario
Copy link
Member Author

UH! Thanks Stefano! You found the culprit. I am sorry, I did not look at the submit dryrun code before opening this issue, my bad!

As I mentioned, in order to speed up the turn around time when developing the jobwrapper, I am using skipEvents=cms.untracked.uint32(293) to process only 7 events in each job (apparently one lumisection of the usual HC datasets has 300 events.). I know I should not even be able to do this, i know.

Today I tried to replicate this. The first time I tried, running a single dryrun job took 40s and everything worked smoothly. In the following attempts, the job took 15s and it got stuck. The time report mentioned a way smaller time required to "init" the job. If I have to guess, from the second time onward opening the connection to stream the input file is quicker.

it is quire unrealistic that someone bothers to use CRAB for something which never runs longer than 25 seconds per file.

I agree, we do not need to change any logic here.

We'll need to modify client so that more info is printed

I also agree with this, maybe we can simply add something like

            while totalJobSeconds < maxSeconds:
+              if totalJobSeconds != 0:
+                    self.logger.info("Last trial took only %s seconds. We are trying now with %s events", totalJobSeconds, events)
                optsList = getCMSRunAnalysisOpts('Job.submit', 'RunJobs.dag', job=1, events=events)

which gives

[...]
Created temporary directory for dry run sandbox in /tmp/dmapelli/tmptmfh7xu_
Executing test, please wait...
Last trial took only 15.3117 seconds. We are trying now with 10203.22273007123 events
Last trial took only 13.6517 seconds. We are trying now with 17630.833238060368 events
Last trial took only 14.7838 seconds. We are trying now with 29045.941361964517 events
Last trial took only 13.7008 seconds. We are trying now with 39820.45661652321 events
Last trial took only 14.1551 seconds. We are trying now with 50363.28580582182 events
...

It does not identify the "issue" we are discussing here, but it is pretty simple and can definitely help a bit. If you want I can open the PR. Otherwise we can close this issue straight away.

@belforte
Copy link
Member

Sure, go ahead with the PR, but please cut those numbers to 1 decimal for secs and integer for event,
Maybe something like:

last trial took 10.2 secs, not enough for an estimate. Try again with 50539 events

@belforte
Copy link
Member

of course nobody know better than you that skipEvents should never be used with CRAB !

@belforte
Copy link
Member

the most annoying part of dryrun imho is indeed that it prints nothing on stdout and keep you waiting for quite some time. I'd prefere that it prints cmsRun stdout, and tell you when it issues cmsRun, so that one can see that time is spent in cmssw initialization, non in crab things. Bur I still think that we should rather put our time in unifying with preparelocal

@belforte
Copy link
Member

the printout was fixed in dmwm/CRABClient#5184 . Can close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants