Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot on signal #2253

Merged
merged 1 commit into from
Aug 22, 2015
Merged

Snapshot on signal #2253

merged 1 commit into from
Aug 22, 2015

Conversation

jyegerlehner
Copy link
Contributor

This an implementation of the feature discussed in Issue 2012.

When you hit Ctrl-C to kill caffe (while training), it will now save a snapshot before exiting. Actually, the Solver just stops training, and a snapshot is only saved if snapshot_after_train is true.

This is the default behavior which is configurable via the sigint_effect and sighup_effect command line options. Also by default, SIGHUP signal causes caffe to save a snapshot and continue training. So you can make caffe save a snapshot by sending it SIGHUP signal, e.g.:

kill -SIGHUP PID

where PID is the process id of caffe, which you can find by doing ps -ef | grep caffe.

The design has two pieces to it. 1. Solver is modified slightly so that after each iteration it checks to see if its client wants it to either snapshot or exit. It does this via a callback function that a client can set on the Solver instance. If the callback hasn't been set, it just carries on as usual. So there is no breaking change to the Solver interface and the behavior of existing code shouldn't change. 2. The caffe executable provides the callback function to the Solver, and the callback is implemented on a SignalHandler that intercepts SIGINT and SIGHUP.

@jyegerlehner jyegerlehner changed the title Add signal handler and early exit/snapshot to Solver. Snapshot on signal Apr 4, 2015
@jyegerlehner jyegerlehner mentioned this pull request Apr 6, 2015
@flx42
Copy link
Contributor

flx42 commented Apr 18, 2015

Looks cool, but I don't think you can safely lock a mutex in a signal handler, nor try to perform I/O.

@jyegerlehner
Copy link
Contributor Author

Thanks @flx42, I should have researched signals a bit more before I implemented this.

@jyegerlehner
Copy link
Contributor Author

@flx42 do you see any problem with this new implementation?

@flx42
Copy link
Contributor

flx42 commented Apr 18, 2015

The type should be "sig_atomic_t volatile".
Also, since the variables are no longer of type "bool", it might be better to use "1" and "0" instead of "true" and "false".

After that, I think it will be fine :)
👍

@jyegerlehner
Copy link
Contributor Author

Thanks for the reply.

The type should be "sig_atomic_t volatile".

Hrmm.. all the examples I see, the qualifier precedes the type name, same as with "const" or "mutable". Please let me know if I'm missing something.

since the variables are no longer of type "bool", it might be better to use "1" and "0" instead of "true" and "false".

true and false are promoted to 1 and 0 per the language spec I'm pretty sure. Since they better convey the meaning, I'm inclined to leave the true and false in there.

@flx42
Copy link
Contributor

flx42 commented Apr 18, 2015

On Sat, Apr 18, 2015 at 12:48 PM, jyegerlehner [email protected]
wrote:

Thanks for the reply.

The type should be "sig_atomic_t volatile".

Hrmm.. all the examples I see, the qualifier precedes the type name, same
as with "const" or "mutable". Please let me know if I'm missing something.

No, you're right. That's what I meant.

since the variables are no longer of type "bool", it might be better to
use "1" and "0" instead of "true" and "false".

true and false are promoted to 1 and 0 per the language spec I'm pretty
sure. Since they better convey the meaning, I'm inclined to leave the true
and false in there.

It was really a nitpicking comment about style. But both ways will be fine.

@flx42
Copy link
Contributor

flx42 commented Apr 20, 2015

Another nitpick: you should avoid unnecessary empty lines changes in your patch.
Like this: jyegerlehner@c0df909#diff-ed792b89816c46f109918495c88421bcL3

@lukeyeager
Copy link
Contributor

Any update on this? I want it.

@jyegerlehner
Copy link
Contributor Author

Hi Luke, I'm not aware of anything deficient about it.

@shelhamer shelhamer added the JL label May 13, 2015
@jyegerlehner
Copy link
Contributor Author

Well there is one thing about this that is debatable. Solver doesn't check to see if it should exit when it is doing test, only when it is training. This means if you send it SIGINT and it happens to be in the middle of test (not train), caffe keeps going until the test is finished, and only then does it exit (or snapshot). So you might have to wait a bit if you try to request it to stop when it's testing. I thought that was a good thing since it allows the test to run to completion and produce a valid test result. However, one might prefer that it be more responsive and stop right away regardless of whether it is testing, and throw away the test that is in-progress. We could make it behave that way with a bit of extra complexity.

@lukeyeager
Copy link
Contributor

one might prefer that it be more responsive and stop right away regardless of whether it is testing, and throw away the test that is in-progress

That sounds better to me. I would expect Ctrl+C to kill the process within a second or two. If you wait for testing to finish, you might have to wait several minutes.

@jyegerlehner
Copy link
Contributor Author

I would expect Ctrl+C to kill the process within a second or two.

If anyone objects to Luke's preferred behaviour please speak up. Otherwise I'll plan to make that change.

@shelhamer
Copy link
Member

I can see an argument for either finishing the testing or quitting
immediately, but to better fit expectations it should likely quit right
away no matter the action.

Thanks for the signal handling!
On Sun, May 17, 2015 at 08:38 jyegerlehner [email protected] wrote:

I would expect Ctrl+C to kill the process within a second or two.

If anyone objects to Luke's preferred behaviour please speak up. Otherwise
I'll plan to make that change.


Reply to this email directly or view it on GitHub
#2253 (comment).

@flx42
Copy link
Contributor

flx42 commented May 18, 2015

It would be nice to have both, if that's not too much complexity for you.
It's not uncommon to have different shutdown behaviors depending on which signal was received. For instance, nginx will do a fast shutdown on SIGTERM/SIGINT, and a graceful shutdown on SIGQUIT (wait for all current connections to finish). Apache has a similar behavior.

@jyegerlehner
Copy link
Contributor Author

Behaviour modified to respond to signals during testing or training, whereas before it just responded during training. And rebased off master.

I tested these latest changes on a couple scenarios manually. But some of this code is so new if anyone else can test it that could give more confidence.

@jyegerlehner
Copy link
Contributor Author

to better fit expectations it should likely quit right away

Makes sense to me.

Thanks for the signal handling!

Anything for the cause, comrade!

@jyegerlehner
Copy link
Contributor Author

It would be nice to have both, if that's not too much complexity for you.

If we find we need or want that extra sophistication, I think my preference is to add it in a separate PR so that we may thereby proceed more incrementally. The changes to Solver started out very simple, and got more complex with the latest change and feels to me like pushing the limits of added risk in one PR.

That said, if everyone really really wants it right now, I can do it.

@flx42
Copy link
Contributor

flx42 commented May 18, 2015

No, I agree, this should be done in a separate patch.

@jyegerlehner
Copy link
Contributor Author

I think this could use a "Ready for Review" label, if those make any difference.

@jyegerlehner jyegerlehner force-pushed the snapshot_on_signal branch 2 times, most recently from 17baa33 to c946a00 Compare August 14, 2015 22:03
@@ -126,6 +133,20 @@ void CopyLayers(caffe::Solver<float>* solver, const std::string& model_list) {
}
}

caffe::SolverParameter_Action GetRequestedAction(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line break not needed for 80 cols? Remove if not needed.

@ronghanghu
Copy link
Member

LGTM. I shall review this PR within next week.

@jyegerlehner
Copy link
Contributor Author

OK thanks. I will squash the commits once the review is done and we know there aren't any more changes required.

@ronghanghu
Copy link
Member

@jyegerlehner I just did a quick review today, and tested on my machine. Looks good to me! I'll finish reviewing this PR within tomorrow.

(As far as I know, this PR won't apply to windows. But since we are not officially supporting windows at the moment, I'm not too worried about that.)

One another potential enhancement related to this PR is that we may also support learning rate adjustment on signaling in the future (in a separate PR), so that one may adjust it during training based on e.g. learning curve from log, similar to some other deep learning tools. @jeffdonahue @longjon what do you think?

SolverParameter_Action SIGHUP_action);
ActionCallback GetActionFunction();
private:
SignalHandler(); // Not implemented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sort of confused here... Why do we need a private constructor declaration in SignalHandler class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default constructor is not declared private, the compiler will provide it implicitly. So by preventing default construction we force the use of the constructor taking explicitly-specified signal actions. We could provide a default constructor that, say, initializes the actions to sensible defaults. But it would be unused as it is and thus unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default constructor is not declared private, the compiler will provide it implicitly.

C++ does generate a default constructor but only if you don't provide one of your own. Since you've already provided one: SignalHandler(SolverParameter_Action SIGINT_action, SolverParameter_Action SIGHUP_action);, a default constructor won't be generated by compiler implicitly. So a private constructor declaration here is unnecessary.

You may verify with this simple program:

class C {
 public:
  C(int) {}
};
int main() { C c; } // compilation fails because there's no default constructor in class C.

Anyway, this is only a small issue. The PR seems great to me :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the correction and explanation. I will remove the private ctor dec.

@bhack
Copy link
Contributor

bhack commented Aug 20, 2015

@willyd How this will impact with your windows plans?

@willyd
Copy link
Contributor

willyd commented Aug 20, 2015

@bhack On windows we would need to call SetConsoleCtrlHandler to handle SIGINT but don't think there is an equivalent to SIGHUP.

A cross-plaform implementation is available in boost.asio.

@bhack
Copy link
Contributor

bhack commented Aug 20, 2015

If there is still interest in #2537 I will avoid asio solution.

if (sigaction(SIGINT, &sa, NULL) == -1) {
LOG(FATAL) << "Cannot install SIGINT handler.";
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "unhook" these signals in SignalHandler::~SignalHandler()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will do.

@ronghanghu
Copy link
Member

Completed a thorough pass today. This PR seems in good shape to me, handles signals via POSIX sigaction and address actions from solver's train & test loop in a call-back fashion. A side effect: this PR is platform-specific and may impact community windows ports.

@jyegerlehner
Copy link
Contributor Author

@ronghanghu The latest commits are intended to resolve the review issues you raised. Please let us know if they are not satisfactory.

As far as the Travis build failing: this looks like it happened due to an error installing a package. Does anyone know: should I make a dummy commit to provoke it to try building again? Or am I missing something that's an actual problem I need to fix?.

@ronghanghu
Copy link
Member

I restarted Travis CI and now all tests pass. I'll try to take a look today.

@ronghanghu
Copy link
Member

Seems ready to me :) Please squash into one commit.

@ronghanghu
Copy link
Member

@jeffdonahue @longjon I would like to merge this PR if you don't have other concerns.

Community windows ports can perhaps simply strip this feature with #ifdef e.g. via removing the anonymous namespace section in caffe/util/signal_handler.cpp and always returning SolverAction::NONE in SignalHandler::CheckForSignals() (and Ctrl+C still terminates the program on windows).

@jyegerlehner
Copy link
Contributor Author

OK, do you prefer commits to be squashed?

@ronghanghu
Copy link
Member

Yes, please squash into one commit, so that I can merge in this weekend.

Add signal handler and early exit/snapshot to Solver.

Add signal handler and early exit/snapshot to Solver.

Also check for exit and snapshot when testing.

Skip running test after early exit.

Fix more lint.

Rebase on master.

Finish rebase on master.

Fixups per review comments.

Redress review comments.

Lint.

Correct error message wording.
ronghanghu added a commit that referenced this pull request Aug 22, 2015
@ronghanghu ronghanghu merged commit 12e1432 into BVLC:master Aug 22, 2015
@erogol
Copy link
Contributor

erogol commented Aug 31, 2015

this is great. thanks for the effort @jyegerlehner

@jyegerlehner
Copy link
Contributor Author

Sure thing @erogol. Glad to hear it's helpful.

@Coderx7
Copy link
Contributor

Coderx7 commented May 23, 2016

Can you add an option to the solver so that users can take snap-shots at any given time by pressing a key combination like Ctrl-S for example?

@ajtulloch
Copy link
Contributor

@Coderx7 I think that'd add to much complexity to the feature. If you want to snapshot at an arbitrary time without stopping training, just send SIGHUP (kill -SIGHUP $(pidof caffe))

@Coderx7
Copy link
Contributor

Coderx7 commented May 24, 2016

@ajtulloch yes, but at the same time it provides a very convenient and good feature to have,
in linux thats a very good idea, but how does one do the exact same thing on windows, as far as I know, windows doesn't have signals in the linux sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants