-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve exception handling in endStream #43831
Comments
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
cms-bot internal usage |
A new Issue was created by @makortel Matti Kortelainen. @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
I'm working to implement the first part of the request in the initial comment above. I should have a PR soon. Regarding the second part, I think this is not needed. beginJob and beginStream are not like beginRun or beginLumi. When there is an exception in beginJob or beginStream the process just shuts down immediately. We currently do not try to execute the endJob or endStream transitions at all. endJob and endStream are not run for any module nor any stream. This makes sense because there is not any data we can even imagine we might recover. The situation is different for runs and lumis, because we might imagine we've already processed many runs and lumis successfully and we want to exit cleanly to save those results either permanently or at least for debugging (even in that case I've always wondered if anyone actually uses this ability...). There is only one beginJob transition and beginStream transition. If they fail, then nothing useful has happened yet that would be worth saving. There is no output file. The best thing is just to exit as soon as possible and try to get the exception message printed to help debug the problem. Any action after risks a crash (seg fault) instead of a clean exit with a non-zero exit value. There's no benefit. I think we are doing the right thing in this case already. Maybe I am missing something. Is there a purpose to run endStream and endJob after an exception in beginJob or beginStream? |
The only thing I could think of if in the logic of the destructor, it assumes that the 'end' was called if the 'begin' was called and if it isn't, it would lead to a crash. I don't really think such behavior is reasonable nor very likely. |
(should have recorded this earlier, IIRC the context for this issue was #43814 and in particular #38260) We have cmssw/HeterogeneousCore/SonicTriton/interface/TritonEDProducer.h Lines 29 to 34 in 85c8117
(the this->client_ is of type TritonClient , and is constructed in beginStream() ).
Looking at TritonService, when it starts the fallback server, it does it in preBeginJob, and shuts it down in postEndJob. If an exception is thrown during beginJob/beginStream, do we call the I have a feeling this "communicating to external thing" use case might need a bit more thought (and possibly changes in the Sonic side). |
No. Not in the existing version of the code. If there is a reason, I suppose we could change that. |
In cmsRun.cpp, I think proc.on is telling the sentry to call endJob if we are exiting with an exception. It is only turned on after beginJob. |
Not that it matters, but I looked in the history. It's been that way since April 25 2006. Bill added that. |
Notes from discussion
|
Adding to notes from our discussion: We agreed that we would add more WorkerManagers so that we could support the Worker containing a single bool that records whether the begin transition succeeded. This will require having 3 WorkerManagers in StreamSchedule instead of 1 For global transitions, there is already 1 WorkerManager per concurrent run plus 1 WorkerManager per concurrent lumi. We currently reuse the 0th WorkerManager in the vector that holds the run and lumi global WorkerManagers for beginJob/endJob and beginProcessBlock/endProcessBlock. So we would need to add 2 more WorkerManagers to handle those global cases. The alternative would be to add 4 bool's to each Worker. bool beginJobOrStreamSucceeded_; I think it will work either way. The second way would save a little memory and there would be less WorkerManagers... Maybe this is not significant. The downside of the second approach is that some Workers would actually use only 1 of the 4 bools and some would use 3 of the 4 bools and there is a little additional complication dealing with that. I mention it now because there will be some nontrivial rework to implement it one way and then change to other way because we change our minds. I wanted to document that decision. I'm starting down the path of implementation with additional WorkerManagers now. |
Currently in
endStream()
for a given stream, if oneWorker::endStream()
throws an exception, theendStream()
of remaining workers gets ignored in this loopcmssw/FWCore/Framework/src/WorkerManager.cc
Lines 129 to 133 in 695ed78
We should change this loop to similar as
WorkerManager::endJob()
cmssw/FWCore/Framework/src/WorkerManager.cc
Lines 83 to 91 in 695ed78
that each worker's
endStream()
is in its owntry
-catch
block.In addition, we should add logic (if it isn't there yet) that in case worker's
beginStream()
throws an exception, theendStream()
should be called only for those workers whosebeginStream()
succeeded.(this issue spurred from #43814 and in particular #38260)
The text was updated successfully, but these errors were encountered: