change how we handle user sandbox #7461

belforte · 2022-11-17T18:13:41Z

Action item of postmortem CRAB TaskWorker went down after too many task submissions

Maybe there was a special reason in the past, but currently we may simply download sandbox.tar.gx from S3 in the scheduler. E.g. as part of https://github.com/dmwm/CRABServer/blob/master/scripts/AdjustSites.py

Problem:

user sandboxes use a lot of disk on TW and may quickly cumulate in case of repeated submissions.
as architecture goes, TW has no need of sandbox info, why should it touch it ?
spooling sandboxes during dag submissions all in all adds load to HTCondor and make operation take more time.

Things which we can do:

remove sandbox from TW disk once submission process is over (even if it failed, a new crab submit will create anyhow a new tmp directory) . A partial solution
go all the way and only download in scheduler. Scheduler code will also have to take care of debug subdirectory

Caveats:

currently TW needs sandbox for dryrun. But we already have in the list to use S3 for that make preparelocal use S3 for tarball and unify with --dryrun #6544
all in all we do not have any real, serious problem here. Do we really want to change ?

The text was updated successfully, but these errors were encountered:

mapellidario · 2022-11-17T18:26:38Z

Wa and I had a quick chat about this. We had a simple concern. How should we deal with http errors when downloading sandboxes from s3 directly into the schedd? Because with this option we would have a task that is properly submitted with a running dagman_bootstrap, but that fails to retrieve the sandbox and can not submit jobs to the vanilla universe. how long do we keep trying downloading the sandbox? how many attempts do we make? should we put the dagman on hold if it fails for more than 1d? or should we just kill/remove it?

I am sorry, I have more questions than good proposals at the moment, but in general I like the proposal, we should try to avoid keeping files where they are not 100% needed.

belforte · 2022-11-17T21:47:47Z

there is no clear cut answer to those (valid) concerns. But keep in mind that dagman boostrap process already makes a few calls to CRAB REST. So nothing new on that side. Talking with S3 will be a new dependency. I do not know why original developers decided to send sandbox via condor file spooling, but it is possible that in original implementation there was no communication from scheduler to CRAB REST. We have a lot of such situations: original decisions stuck around even if original motivations were not valid anymore, different people took different decisions for similar things, something was done to mitigate the then-problem-of-the-day which eventually got solved otherwise, etc. Very few of those decisions have been documented.

So we are free to take the decision which we think is best, and will have to live with the consequences.

Note: failures in bootstrap: currently if something goes wrong in dagman bootstrap & C (e.g. can't talk with REST), everything is aborted and crab status will report a "task failed to bootstrap" and user submits again. If things go horribly wrong, the bootstrap does not manage to abort and task gets simply stuck in "waiting to bootstrap" and
and crab status command will print "If this persists report it to ..computingTools…”

novicecpp · 2024-07-04T10:49:04Z

This happened again yesterday https://mattermost.web.cern.ch/cms-o-and-c/pl/9m9bcm3dnfbrj8h8i7t9id45no

belforte · 2024-07-05T08:35:44Z

let's start by keeping tmp directory for a shorter time #8542

novicecpp · 2024-08-09T20:55:04Z

I am looking at this issue today.

crab submit --dryrun break because InputFiles.tar.gz does not contains sandbox.tar.gz anymore. But we can ignore it because we want to remove it anyway.
Need to check crab preparelocal.

belforte · 2024-09-28T14:35:30Z

once #6544 is done, user sandbox will not be downloaded on TW tmp disk any more.
The original problem will be gone.

No further action is needed

belforte added Type: Enhancement Priority: Medium Status: Available labels Nov 17, 2022

belforte self-assigned this Nov 17, 2022

belforte removed their assignment Jan 19, 2023

belforte mentioned this issue Jan 26, 2023

make preparelocal use S3 for tarball and unify with --dryrun #6544

Closed

novicecpp added the Postmortem label Jul 4, 2024

belforte assigned belforte and unassigned belforte Jul 4, 2024

belforte mentioned this issue Jul 5, 2024

keep tmp directory in TW only for 6 hours #8542

Closed

novicecpp self-assigned this Aug 12, 2024

novicecpp added Status: Available and removed Status: Available labels Aug 15, 2024

belforte mentioned this issue Aug 23, 2024

GOALS for 2024 #7876

Open

27 tasks

novicecpp assigned belforte and unassigned novicecpp Aug 23, 2024

belforte mentioned this issue Sep 27, 2024

sort out confusion/duplicates in tarballs handling from TW to sched to Wrapper #8699

Closed

belforte closed this as completed Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change how we handle user sandbox #7461

change how we handle user sandbox #7461

belforte commented Nov 17, 2022 •

edited by novicecpp

Loading

mapellidario commented Nov 17, 2022

belforte commented Nov 17, 2022 •

edited

Loading

novicecpp commented Jul 4, 2024

belforte commented Jul 5, 2024

novicecpp commented Aug 9, 2024

belforte commented Sep 28, 2024

change how we handle user sandbox #7461

change how we handle user sandbox #7461

Comments

belforte commented Nov 17, 2022 • edited by novicecpp Loading

mapellidario commented Nov 17, 2022

belforte commented Nov 17, 2022 • edited Loading

novicecpp commented Jul 4, 2024

belforte commented Jul 5, 2024

novicecpp commented Aug 9, 2024

belforte commented Sep 28, 2024

belforte commented Nov 17, 2022 •

edited by novicecpp

Loading

belforte commented Nov 17, 2022 •

edited

Loading