Snapshot on signal #2253

jyegerlehner · 2015-04-03T21:30:54Z

This an implementation of the feature discussed in Issue 2012.

When you hit Ctrl-C to kill caffe (while training), it will now save a snapshot before exiting. Actually, the Solver just stops training, and a snapshot is only saved if snapshot_after_train is true.

This is the default behavior which is configurable via the sigint_effect and sighup_effect command line options. Also by default, SIGHUP signal causes caffe to save a snapshot and continue training. So you can make caffe save a snapshot by sending it SIGHUP signal, e.g.:

kill -SIGHUP PID

where PID is the process id of caffe, which you can find by doing ps -ef | grep caffe.

The design has two pieces to it. 1. Solver is modified slightly so that after each iteration it checks to see if its client wants it to either snapshot or exit. It does this via a callback function that a client can set on the Solver instance. If the callback hasn't been set, it just carries on as usual. So there is no breaking change to the Solver interface and the behavior of existing code shouldn't change. 2. The caffe executable provides the callback function to the Solver, and the callback is implemented on a SignalHandler that intercepts SIGINT and SIGHUP.

flx42 · 2015-04-18T00:27:05Z

Looks cool, but I don't think you can safely lock a mutex in a signal handler, nor try to perform I/O.

jyegerlehner · 2015-04-18T01:27:40Z

Thanks @flx42, I should have researched signals a bit more before I implemented this.

jyegerlehner · 2015-04-18T17:43:27Z

@flx42 do you see any problem with this new implementation?

flx42 · 2015-04-18T19:02:59Z

The type should be "sig_atomic_t volatile".
Also, since the variables are no longer of type "bool", it might be better to use "1" and "0" instead of "true" and "false".

After that, I think it will be fine :)
👍

jyegerlehner · 2015-04-18T19:47:47Z

Thanks for the reply.

The type should be "sig_atomic_t volatile".

Hrmm.. all the examples I see, the qualifier precedes the type name, same as with "const" or "mutable". Please let me know if I'm missing something.

since the variables are no longer of type "bool", it might be better to use "1" and "0" instead of "true" and "false".

true and false are promoted to 1 and 0 per the language spec I'm pretty sure. Since they better convey the meaning, I'm inclined to leave the true and false in there.

flx42 · 2015-04-18T20:09:16Z

On Sat, Apr 18, 2015 at 12:48 PM, jyegerlehner [email protected]
wrote:

Thanks for the reply.

The type should be "sig_atomic_t volatile".

Hrmm.. all the examples I see, the qualifier precedes the type name, same
as with "const" or "mutable". Please let me know if I'm missing something.

No, you're right. That's what I meant.

since the variables are no longer of type "bool", it might be better to
use "1" and "0" instead of "true" and "false".

true and false are promoted to 1 and 0 per the language spec I'm pretty
sure. Since they better convey the meaning, I'm inclined to leave the true
and false in there.

It was really a nitpicking comment about style. But both ways will be fine.

flx42 · 2015-04-20T20:24:54Z

Another nitpick: you should avoid unnecessary empty lines changes in your patch.
Like this: jyegerlehner@c0df909#diff-ed792b89816c46f109918495c88421bcL3

lukeyeager · 2015-05-13T17:22:36Z

Any update on this? I want it.

jyegerlehner · 2015-05-13T17:43:29Z

Hi Luke, I'm not aware of anything deficient about it.

jyegerlehner · 2015-05-14T01:43:27Z

Well there is one thing about this that is debatable. Solver doesn't check to see if it should exit when it is doing test, only when it is training. This means if you send it SIGINT and it happens to be in the middle of test (not train), caffe keeps going until the test is finished, and only then does it exit (or snapshot). So you might have to wait a bit if you try to request it to stop when it's testing. I thought that was a good thing since it allows the test to run to completion and produce a valid test result. However, one might prefer that it be more responsive and stop right away regardless of whether it is testing, and throw away the test that is in-progress. We could make it behave that way with a bit of extra complexity.

lukeyeager · 2015-05-14T23:24:17Z

one might prefer that it be more responsive and stop right away regardless of whether it is testing, and throw away the test that is in-progress

That sounds better to me. I would expect Ctrl+C to kill the process within a second or two. If you wait for testing to finish, you might have to wait several minutes.

jyegerlehner · 2015-05-17T15:38:54Z

I would expect Ctrl+C to kill the process within a second or two.

If anyone objects to Luke's preferred behaviour please speak up. Otherwise I'll plan to make that change.

shelhamer · 2015-05-18T02:44:30Z

I can see an argument for either finishing the testing or quitting
immediately, but to better fit expectations it should likely quit right
away no matter the action.

Thanks for the signal handling!
On Sun, May 17, 2015 at 08:38 jyegerlehner [email protected] wrote:

I would expect Ctrl+C to kill the process within a second or two.

If anyone objects to Luke's preferred behaviour please speak up. Otherwise
I'll plan to make that change.

—
Reply to this email directly or view it on GitHub
#2253 (comment).

flx42 · 2015-05-18T04:08:57Z

It would be nice to have both, if that's not too much complexity for you.
It's not uncommon to have different shutdown behaviors depending on which signal was received. For instance, nginx will do a fast shutdown on SIGTERM/SIGINT, and a graceful shutdown on SIGQUIT (wait for all current connections to finish). Apache has a similar behavior.

jyegerlehner · 2015-05-18T06:52:35Z

Behaviour modified to respond to signals during testing or training, whereas before it just responded during training. And rebased off master.

I tested these latest changes on a couple scenarios manually. But some of this code is so new if anyone else can test it that could give more confidence.

jyegerlehner · 2015-05-18T18:21:45Z

to better fit expectations it should likely quit right away

Makes sense to me.

Thanks for the signal handling!

Anything for the cause, comrade!

jyegerlehner · 2015-05-18T18:30:06Z

It would be nice to have both, if that's not too much complexity for you.

If we find we need or want that extra sophistication, I think my preference is to add it in a separate PR so that we may thereby proceed more incrementally. The changes to Solver started out very simple, and got more complex with the latest change and feels to me like pushing the limits of added risk in one PR.

That said, if everyone really really wants it right now, I can do it.

flx42 · 2015-05-18T18:32:48Z

No, I agree, this should be done in a separate patch.

jyegerlehner · 2015-06-11T19:16:21Z

I think this could use a "Ready for Review" label, if those make any difference.

jyegerlehner · 2015-08-15T00:34:46Z

tools/caffe.cpp

@@ -126,6 +133,20 @@ void CopyLayers(caffe::Solver<float>* solver, const std::string& model_list) {
  }
 }

+caffe::SolverParameter_Action GetRequestedAction(


Line break not needed for 80 cols? Remove if not needed.

ronghanghu · 2015-08-15T18:43:35Z

LGTM. I shall review this PR within next week.

jyegerlehner · 2015-08-16T17:01:53Z

OK thanks. I will squash the commits once the review is done and we know there aren't any more changes required.

ronghanghu · 2015-08-20T06:04:06Z

@jyegerlehner I just did a quick review today, and tested on my machine. Looks good to me! I'll finish reviewing this PR within tomorrow.

(As far as I know, this PR won't apply to windows. But since we are not officially supporting windows at the moment, I'm not too worried about that.)

One another potential enhancement related to this PR is that we may also support learning rate adjustment on signaling in the future (in a separate PR), so that one may adjust it during training based on e.g. learning curve from log, similar to some other deep learning tools. @jeffdonahue @longjon what do you think?

ronghanghu · 2015-08-20T06:08:00Z

include/caffe/util/signal_handler.h

+                SolverParameter_Action SIGHUP_action);
+  ActionCallback GetActionFunction();
+ private:
+  SignalHandler();  // Not implemented.


I'm sort of confused here... Why do we need a private constructor declaration in SignalHandler class?

If the default constructor is not declared private, the compiler will provide it implicitly. So by preventing default construction we force the use of the constructor taking explicitly-specified signal actions. We could provide a default constructor that, say, initializes the actions to sensible defaults. But it would be unused as it is and thus unnecessary.

If the default constructor is not declared private, the compiler will provide it implicitly.

C++ does generate a default constructor but only if you don't provide one of your own. Since you've already provided one: SignalHandler(SolverParameter_Action SIGINT_action, SolverParameter_Action SIGHUP_action);, a default constructor won't be generated by compiler implicitly. So a private constructor declaration here is unnecessary.

You may verify with this simple program:

class C { public: C(int) {} }; int main() { C c; } // compilation fails because there's no default constructor in class C.

Anyway, this is only a small issue. The PR seems great to me :)

OK thanks for the correction and explanation. I will remove the private ctor dec.

bhack · 2015-08-20T10:25:49Z

@willyd How this will impact with your windows plans?

willyd · 2015-08-20T12:52:31Z

@bhack On windows we would need to call SetConsoleCtrlHandler to handle SIGINT but don't think there is an equivalent to SIGHUP.

A cross-plaform implementation is available in boost.asio.

bhack · 2015-08-20T13:01:31Z

If there is still interest in #2537 I will avoid asio solution.

ronghanghu · 2015-08-20T18:54:59Z

src/caffe/util/signal_handler.cpp

+    if (sigaction(SIGINT, &sa, NULL) == -1) {
+      LOG(FATAL) << "Cannot install SIGINT handler.";
+    }
+  }


Maybe "unhook" these signals in SignalHandler::~SignalHandler()?

OK will do.

ronghanghu · 2015-08-20T19:04:17Z

Completed a thorough pass today. This PR seems in good shape to me, handles signals via POSIX sigaction and address actions from solver's train & test loop in a call-back fashion. A side effect: this PR is platform-specific and may impact community windows ports.

jyegerlehner · 2015-08-21T17:29:03Z

@ronghanghu The latest commits are intended to resolve the review issues you raised. Please let us know if they are not satisfactory.

As far as the Travis build failing: this looks like it happened due to an error installing a package. Does anyone know: should I make a dummy commit to provoke it to try building again? Or am I missing something that's an actual problem I need to fix?.

ronghanghu · 2015-08-21T18:39:10Z

I restarted Travis CI and now all tests pass. I'll try to take a look today.

ronghanghu · 2015-08-22T04:29:25Z

Seems ready to me :) Please squash into one commit.

ronghanghu · 2015-08-22T04:44:03Z

@jeffdonahue @longjon I would like to merge this PR if you don't have other concerns.

Community windows ports can perhaps simply strip this feature with #ifdef e.g. via removing the anonymous namespace section in caffe/util/signal_handler.cpp and always returning SolverAction::NONE in SignalHandler::CheckForSignals() (and Ctrl+C still terminates the program on windows).

jyegerlehner · 2015-08-22T17:04:08Z

OK, do you prefer commits to be squashed?

ronghanghu · 2015-08-22T17:19:20Z

Yes, please squash into one commit, so that I can merge in this weekend.

Add signal handler and early exit/snapshot to Solver. Add signal handler and early exit/snapshot to Solver. Also check for exit and snapshot when testing. Skip running test after early exit. Fix more lint. Rebase on master. Finish rebase on master. Fixups per review comments. Redress review comments. Lint. Correct error message wording.

Snapshot on signal

erogol · 2015-08-31T09:37:11Z

this is great. thanks for the effort @jyegerlehner

jyegerlehner · 2015-08-31T20:12:56Z

Sure thing @erogol. Glad to hear it's helpful.

Coderx7 · 2016-05-23T17:21:13Z

Can you add an option to the solver so that users can take snap-shots at any given time by pressing a key combination like Ctrl-S for example?

ajtulloch · 2016-05-23T23:37:08Z

@Coderx7 I think that'd add to much complexity to the feature. If you want to snapshot at an arbitrary time without stopping training, just send SIGHUP (kill -SIGHUP $(pidof caffe))

Coderx7 · 2016-05-24T04:28:08Z

@ajtulloch yes, but at the same time it provides a very convenient and good feature to have,
in linux thats a very good idea, but how does one do the exact same thing on windows, as far as I know, windows doesn't have signals in the linux sense

jyegerlehner changed the title ~~Add signal handler and early exit/snapshot to Solver.~~ Snapshot on signal Apr 4, 2015

jyegerlehner mentioned this pull request Apr 6, 2015

Snapshot on signal #2012

Closed

shelhamer added the enhancement label Apr 9, 2015

jyegerlehner force-pushed the snapshot_on_signal branch from 00fabeb to c0df909 Compare April 18, 2015 20:59

lukeyeager mentioned this pull request May 13, 2015

Upgrades to scheduler module NVIDIA/DIGITS#104

Closed

9 tasks

shelhamer added the JL label May 13, 2015

jyegerlehner force-pushed the snapshot_on_signal branch from 32c4a7b to 17baa33 Compare May 18, 2015 17:56

shelhamer added focus RH labels Aug 5, 2015

jyegerlehner force-pushed the snapshot_on_signal branch 2 times, most recently from 17baa33 to c946a00 Compare August 14, 2015 22:03

jyegerlehner reviewed Aug 15, 2015
View reviewed changes

ronghanghu reviewed Aug 20, 2015
View reviewed changes

ronghanghu added the ready for review label Aug 22, 2015

jyegerlehner force-pushed the snapshot_on_signal branch from 11f501a to ff19d5f Compare August 22, 2015 18:07

ronghanghu added a commit that referenced this pull request Aug 22, 2015

Merge pull request #2253 from jyegerlehner/snapshot_on_signal

12e1432

Snapshot on signal

ronghanghu merged commit 12e1432 into BVLC:master Aug 22, 2015

jyegerlehner deleted the snapshot_on_signal branch August 31, 2015 20:12

willyd mentioned this pull request Sep 16, 2015

Build the newest Caffe with Multi-GPU support willyd/caffe-builder#9

Closed

eelstork mentioned this pull request Dec 31, 2015

Handle signals with boost Asio #3500

Open

Snapshot on signal #2253

Snapshot on signal #2253

Conversation

jyegerlehner commented Apr 3, 2015

flx42 commented Apr 18, 2015

jyegerlehner commented Apr 18, 2015

jyegerlehner commented Apr 18, 2015

flx42 commented Apr 18, 2015

jyegerlehner commented Apr 18, 2015

flx42 commented Apr 18, 2015

flx42 commented Apr 20, 2015

lukeyeager commented May 13, 2015

jyegerlehner commented May 13, 2015

jyegerlehner commented May 14, 2015

lukeyeager commented May 14, 2015

jyegerlehner commented May 17, 2015

shelhamer commented May 18, 2015

flx42 commented May 18, 2015

jyegerlehner commented May 18, 2015

jyegerlehner commented May 18, 2015

jyegerlehner commented May 18, 2015

flx42 commented May 18, 2015

jyegerlehner commented Jun 11, 2015

jyegerlehner Aug 15, 2015

Choose a reason for hiding this comment

ronghanghu commented Aug 15, 2015

jyegerlehner commented Aug 16, 2015

ronghanghu commented Aug 20, 2015

ronghanghu Aug 20, 2015

Choose a reason for hiding this comment

jyegerlehner Aug 20, 2015

Choose a reason for hiding this comment

ronghanghu Aug 20, 2015

Choose a reason for hiding this comment

jyegerlehner Aug 20, 2015

Choose a reason for hiding this comment

bhack commented Aug 20, 2015

willyd commented Aug 20, 2015

bhack commented Aug 20, 2015

ronghanghu Aug 20, 2015

Choose a reason for hiding this comment

jyegerlehner Aug 20, 2015

Choose a reason for hiding this comment

ronghanghu commented Aug 20, 2015

jyegerlehner commented Aug 21, 2015

ronghanghu commented Aug 21, 2015

ronghanghu commented Aug 22, 2015

ronghanghu commented Aug 22, 2015

jyegerlehner commented Aug 22, 2015

ronghanghu commented Aug 22, 2015

erogol commented Aug 31, 2015

jyegerlehner commented Aug 31, 2015

Coderx7 commented May 23, 2016

ajtulloch commented May 23, 2016

Coderx7 commented May 24, 2016 • edited Loading

Coderx7 commented May 24, 2016 •

edited

Loading