Improve exception handling in endStream #43831

makortel · 2024-01-31T20:04:28Z

Currently in endStream() for a given stream, if one Worker::endStream() throws an exception, the endStream() of remaining workers gets ignored in this loop

cmssw/FWCore/Framework/src/WorkerManager.cc

Lines 129 to 133 in 695ed78

    
           void WorkerManager::endStream(StreamID iID, StreamContext& streamContext) { 
        
             for (auto& worker : allWorkers_) { 
        
               worker->endStream(iID, streamContext); 
        
             } 
        
           }

We should change this loop to similar as WorkerManager::endJob()

cmssw/FWCore/Framework/src/WorkerManager.cc

Lines 83 to 91 in 695ed78

    
           void WorkerManager::endJob(ExceptionCollector& collector) { 
        
             for (auto& worker : allWorkers_) { 
        
               try { 
        
                 convertException::wrap([&]() { worker->endJob(); }); 
        
               } catch (cms::Exception const& ex) { 
        
                 collector.addException(ex); 
        
               } 
        
             } 
        
           }

that each worker's endStream() is in its own try-catch block.

In addition, we should add logic (if it isn't there yet) that in case worker's beginStream() throws an exception, the endStream() should be called only for those workers whose beginStream() succeeded.

(this issue spurred from #43814 and in particular #38260)

The text was updated successfully, but these errors were encountered:

makortel · 2024-01-31T20:04:33Z

assign core

cmsbuild · 2024-01-31T20:04:46Z

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2024-01-31T20:04:48Z

cms-bot internal usage

cmsbuild · 2024-01-31T20:04:48Z

A new Issue was created by @makortel Matti Kortelainen.

@smuzaffar, @makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

wddgit · 2024-03-01T17:37:16Z

I'm working to implement the first part of the request in the initial comment above. I should have a PR soon.

Regarding the second part, I think this is not needed. beginJob and beginStream are not like beginRun or beginLumi. When there is an exception in beginJob or beginStream the process just shuts down immediately. We currently do not try to execute the endJob or endStream transitions at all. endJob and endStream are not run for any module nor any stream. This makes sense because there is not any data we can even imagine we might recover. The situation is different for runs and lumis, because we might imagine we've already processed many runs and lumis successfully and we want to exit cleanly to save those results either permanently or at least for debugging (even in that case I've always wondered if anyone actually uses this ability...).

There is only one beginJob transition and beginStream transition. If they fail, then nothing useful has happened yet that would be worth saving. There is no output file. The best thing is just to exit as soon as possible and try to get the exception message printed to help debug the problem. Any action after risks a crash (seg fault) instead of a clean exit with a non-zero exit value. There's no benefit. I think we are doing the right thing in this case already.

Maybe I am missing something. Is there a purpose to run endStream and endJob after an exception in beginJob or beginStream?

Dr15Jones · 2024-03-01T19:36:11Z

Maybe I am missing something. Is there a purpose to run endStream and endJob after an exception in beginJob or beginStream?

The only thing I could think of if in the logic of the destructor, it assumes that the 'end' was called if the 'begin' was called and if it isn't, it would lead to a crash. I don't really think such behavior is reasonable nor very likely.

makortel · 2024-03-01T21:19:32Z

(should have recorded this earlier, IIRC the context for this issue was #43814 and in particular #38260)

We have

cmssw/HeterogeneousCore/SonicTriton/interface/TritonEDProducer.h

Lines 29 to 34 in 85c8117

    
           //destroy client before destructor called to unregister any shared memory before TritonService shuts down fallback server 
        
           virtual void tritonEndStream() {} 
        
           void endStream() final { 
        
             tritonEndStream(); 
        
             this->client_.reset(); 
        
           }

(the this->client_ is of type TritonClient, and is constructed in beginStream()).

Looking at TritonService, when it starts the fallback server, it does it in preBeginJob, and shuts it down in postEndJob.

If an exception is thrown during beginJob/beginStream, do we call the ActivityRegistry pre/postEndJob signals?

I have a feeling this "communicating to external thing" use case might need a bit more thought (and possibly changes in the Sonic side).

wddgit · 2024-03-01T22:03:51Z

If an exception is thrown during beginJob/beginStream, do we call the ActivityRegistry pre/postEndJob signals?

No. Not in the existing version of the code. If there is a reason, I suppose we could change that.

wddgit · 2024-03-01T22:09:48Z

In cmsRun.cpp, I think proc.on is telling the sentry to call endJob if we are exiting with an exception. It is only turned on after beginJob.

wddgit · 2024-03-01T22:25:41Z

Not that it matters, but I looked in the history. It's been that way since April 25 2006. Bill added that.

makortel · 2024-03-04T20:46:52Z

Notes from discussion

When looping over the callbacks of a signal, if one callback throws an exception, do we process the rest of the callbacks?
- Order of callbacks is unspecified, would be good to process all callbacks of a signal
If preX signal is run, the postX signal should be run too (this should already be the current behavior)
For data processing transitions, we want the endTransition to run for a module only if a beginTransition was successfully run for that module (really about Unexpected behavior when exception thrown at begin Run #42501)
- If a module's beginTransition throws an exception, the corresponding endTransition should not be called
Regarding summary producing, like lumi summary. The ideal behavior would be that if a streamBeginLumi throws an exception, follow the semantics of a stream skipping an empty lumi and process the lumiSummary by skipping the stream that threw the exception; if a streamEndLumi throws an exception, skip the summary production.
- We don't necessarily have to implement it this way though

wddgit · 2024-03-07T16:42:31Z

Adding to notes from our discussion:

We agreed that we would add more WorkerManagers so that we could support the Worker containing a single bool that records whether the begin transition succeeded. This will require having 3 WorkerManagers in StreamSchedule instead of 1
WorkerManager, one for runs, one for lumi, and one for beginStream/endStream. There is 1 StreamSchedule per stream. The number of WorkerManagers devoted to streams triples.

For global transitions, there is already 1 WorkerManager per concurrent run plus 1 WorkerManager per concurrent lumi. We currently reuse the 0th WorkerManager in the vector that holds the run and lumi global WorkerManagers for beginJob/endJob and beginProcessBlock/endProcessBlock. So we would need to add 2 more WorkerManagers to handle those global cases.

The alternative would be to add 4 bool's to each Worker.

bool beginJobOrStreamSucceeded_;
bool beginProcessBlockSucceeded_;
bool beginRunSucceeded_;
bool beginLumiSucceeded_;

I think it will work either way. The second way would save a little memory and there would be less WorkerManagers... Maybe this is not significant. The downside of the second approach is that some Workers would actually use only 1 of the 4 bools and some would use 3 of the 4 bools and there is a little additional complication dealing with that.

I mention it now because there will be some nontrivial rework to implement it one way and then change to other way because we change our minds. I wanted to document that decision. I'm starting down the path of implementation with additional WorkerManagers now.

cmsbuild added core-pending pending-signatures labels Jan 31, 2024

makortel mentioned this issue Jan 31, 2024

Improve exception handling in {begin,end}{Job,Stream,ProcessBlock} cms-sw/framework-team#811

Closed

wddgit mentioned this issue Apr 4, 2024

Improve behavior after exception in begin/end stream lumi #44624

Merged

wddgit mentioned this issue Apr 24, 2024

Improve behavior after exception in begin/end global lumi #44840

Merged

wddgit mentioned this issue May 22, 2024

Improve behavior after exception in begin/end run transitions #45017

Merged

wddgit mentioned this issue Jul 11, 2024

Improve Framework behavior after exceptions in begin/end transitions (Job, Stream, ProcessBlock) #45434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve exception handling in endStream #43831

Improve exception handling in endStream #43831

makortel commented Jan 31, 2024 •

edited

Loading

makortel commented Jan 31, 2024

cmsbuild commented Jan 31, 2024

cmsbuild commented Jan 31, 2024 •

edited

Loading

cmsbuild commented Jan 31, 2024

wddgit commented Mar 1, 2024

Dr15Jones commented Mar 1, 2024

makortel commented Mar 1, 2024

wddgit commented Mar 1, 2024

wddgit commented Mar 1, 2024

wddgit commented Mar 1, 2024

makortel commented Mar 4, 2024

wddgit commented Mar 7, 2024

Improve exception handling in endStream #43831

Improve exception handling in endStream #43831

Comments

makortel commented Jan 31, 2024 • edited Loading

makortel commented Jan 31, 2024

cmsbuild commented Jan 31, 2024

cmsbuild commented Jan 31, 2024 • edited Loading

cmsbuild commented Jan 31, 2024

wddgit commented Mar 1, 2024

Dr15Jones commented Mar 1, 2024

makortel commented Mar 1, 2024

wddgit commented Mar 1, 2024

wddgit commented Mar 1, 2024

wddgit commented Mar 1, 2024

makortel commented Mar 4, 2024

wddgit commented Mar 7, 2024

makortel commented Jan 31, 2024 •

edited

Loading

cmsbuild commented Jan 31, 2024 •

edited

Loading