Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve exception handling in endStream #43831

Open
makortel opened this issue Jan 31, 2024 · 12 comments
Open

Improve exception handling in endStream #43831

makortel opened this issue Jan 31, 2024 · 12 comments

Comments

@makortel
Copy link
Contributor

makortel commented Jan 31, 2024

Currently in endStream() for a given stream, if one Worker::endStream() throws an exception, the endStream() of remaining workers gets ignored in this loop

void WorkerManager::endStream(StreamID iID, StreamContext& streamContext) {
for (auto& worker : allWorkers_) {
worker->endStream(iID, streamContext);
}
}

We should change this loop to similar as WorkerManager::endJob()

void WorkerManager::endJob(ExceptionCollector& collector) {
for (auto& worker : allWorkers_) {
try {
convertException::wrap([&]() { worker->endJob(); });
} catch (cms::Exception const& ex) {
collector.addException(ex);
}
}
}

that each worker's endStream() is in its own try-catch block.

In addition, we should add logic (if it isn't there yet) that in case worker's beginStream() throws an exception, the endStream() should be called only for those workers whose beginStream() succeeded.

(this issue spurred from #43814 and in particular #38260)

@makortel
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 31, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@smuzaffar, @makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@wddgit
Copy link
Contributor

wddgit commented Mar 1, 2024

I'm working to implement the first part of the request in the initial comment above. I should have a PR soon.

Regarding the second part, I think this is not needed. beginJob and beginStream are not like beginRun or beginLumi. When there is an exception in beginJob or beginStream the process just shuts down immediately. We currently do not try to execute the endJob or endStream transitions at all. endJob and endStream are not run for any module nor any stream. This makes sense because there is not any data we can even imagine we might recover. The situation is different for runs and lumis, because we might imagine we've already processed many runs and lumis successfully and we want to exit cleanly to save those results either permanently or at least for debugging (even in that case I've always wondered if anyone actually uses this ability...).

There is only one beginJob transition and beginStream transition. If they fail, then nothing useful has happened yet that would be worth saving. There is no output file. The best thing is just to exit as soon as possible and try to get the exception message printed to help debug the problem. Any action after risks a crash (seg fault) instead of a clean exit with a non-zero exit value. There's no benefit. I think we are doing the right thing in this case already.

Maybe I am missing something. Is there a purpose to run endStream and endJob after an exception in beginJob or beginStream?

@Dr15Jones
Copy link
Contributor

Maybe I am missing something. Is there a purpose to run endStream and endJob after an exception in beginJob or beginStream?

The only thing I could think of if in the logic of the destructor, it assumes that the 'end' was called if the 'begin' was called and if it isn't, it would lead to a crash. I don't really think such behavior is reasonable nor very likely.

@makortel
Copy link
Contributor Author

makortel commented Mar 1, 2024

(should have recorded this earlier, IIRC the context for this issue was #43814 and in particular #38260)

We have

//destroy client before destructor called to unregister any shared memory before TritonService shuts down fallback server
virtual void tritonEndStream() {}
void endStream() final {
tritonEndStream();
this->client_.reset();
}

(the this->client_ is of type TritonClient, and is constructed in beginStream()).

Looking at TritonService, when it starts the fallback server, it does it in preBeginJob, and shuts it down in postEndJob.

If an exception is thrown during beginJob/beginStream, do we call the ActivityRegistry pre/postEndJob signals?

I have a feeling this "communicating to external thing" use case might need a bit more thought (and possibly changes in the Sonic side).

@wddgit
Copy link
Contributor

wddgit commented Mar 1, 2024

If an exception is thrown during beginJob/beginStream, do we call the ActivityRegistry pre/postEndJob signals?

No. Not in the existing version of the code. If there is a reason, I suppose we could change that.

@wddgit
Copy link
Contributor

wddgit commented Mar 1, 2024

In cmsRun.cpp, I think proc.on is telling the sentry to call endJob if we are exiting with an exception. It is only turned on after beginJob.

@wddgit
Copy link
Contributor

wddgit commented Mar 1, 2024

Not that it matters, but I looked in the history. It's been that way since April 25 2006. Bill added that.

@makortel
Copy link
Contributor Author

makortel commented Mar 4, 2024

Notes from discussion

  • When looping over the callbacks of a signal, if one callback throws an exception, do we process the rest of the callbacks?
    • Order of callbacks is unspecified, would be good to process all callbacks of a signal
  • If preX signal is run, the postX signal should be run too (this should already be the current behavior)
  • For data processing transitions, we want the endTransition to run for a module only if a beginTransition was successfully run for that module (really about Unexpected behavior when exception thrown at begin Run #42501)
    • If a module's beginTransition throws an exception, the corresponding endTransition should not be called
  • Regarding summary producing, like lumi summary. The ideal behavior would be that if a streamBeginLumi throws an exception, follow the semantics of a stream skipping an empty lumi and process the lumiSummary by skipping the stream that threw the exception; if a streamEndLumi throws an exception, skip the summary production.
    • We don't necessarily have to implement it this way though

@wddgit
Copy link
Contributor

wddgit commented Mar 7, 2024

Adding to notes from our discussion:

We agreed that we would add more WorkerManagers so that we could support the Worker containing a single bool that records whether the begin transition succeeded. This will require having 3 WorkerManagers in StreamSchedule instead of 1
WorkerManager, one for runs, one for lumi, and one for beginStream/endStream. There is 1 StreamSchedule per stream. The number of WorkerManagers devoted to streams triples.

For global transitions, there is already 1 WorkerManager per concurrent run plus 1 WorkerManager per concurrent lumi. We currently reuse the 0th WorkerManager in the vector that holds the run and lumi global WorkerManagers for beginJob/endJob and beginProcessBlock/endProcessBlock. So we would need to add 2 more WorkerManagers to handle those global cases.

The alternative would be to add 4 bool's to each Worker.

bool beginJobOrStreamSucceeded_;
bool beginProcessBlockSucceeded_;
bool beginRunSucceeded_;
bool beginLumiSucceeded_;

I think it will work either way. The second way would save a little memory and there would be less WorkerManagers... Maybe this is not significant. The downside of the second approach is that some Workers would actually use only 1 of the 4 bools and some would use 3 of the 4 bools and there is a little additional complication dealing with that.

I mention it now because there will be some nontrivial rework to implement it one way and then change to other way because we change our minds. I wanted to document that decision. I'm starting down the path of implementation with additional WorkerManagers now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants