Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JVM] Support overriding RPCWatchdog termination behavior on Android and other platforms #6216

Merged
merged 2 commits into from
Aug 11, 2020
Merged

[JVM] Support overriding RPCWatchdog termination behavior on Android and other platforms #6216

merged 2 commits into from
Aug 11, 2020

Conversation

csullivan
Copy link
Contributor

Calling System.exit(0) from the watchdog (RPCActivity) emits an interrupt request that goes uncaught in some Android environments resulting in a system crash. I believe the correct non-destructive behavior should be to finish the RPCActivity, thereby popping the activity stack and returning control to the MainActivity, from which the RPCActivity can be restarted using the normal auto-reboot sequence.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tqchen
Copy link
Member

tqchen commented Aug 5, 2020

the intended behavior of the watchdog is actually to kill the process when a timeout Happens. The other running process. We will need to configure the app to auto restart, which will restart an RPC session. So the original exit behavior is intended.

@tqchen
Copy link
Member

tqchen commented Aug 5, 2020

This is used to get around the fact that android does not support fork of a process

@csullivan
Copy link
Contributor Author

I'm not an expert on JVM in Android, from what I can tell it is recommended against using System.exit in almost cases. It sounds like you are expecting System.exit to kill the TVMRPC app (the way a force stop would) but this does not seem to be the functionality of System.exit, as exit will resume any paused activities left on the activity stack after exiting [ref].

Overall, I am not claiming that the use of finish() instead is necessarily the correct implementation, but empirically the system is much more stable.

@tqchen
Copy link
Member

tqchen commented Aug 5, 2020

The main question though is whether finish will actually kill the running job on the other thread -- thus achieving the functionality of timeout, or if the stablity comes from the fact that we simply disables the timeout. That would boils down to the semantics of the finish itself, and would be great to confirm

@tqchen
Copy link
Member

tqchen commented Aug 6, 2020

@csullivan can you run some quick experiments to confirm whether the finish does kill the other running thread ?

@csullivan
Copy link
Contributor Author

csullivan commented Aug 6, 2020

Your concern is valid. What I observed that motivated this PR: After the timeout is reached** the RPCActivity view (with the StopRPC button) will close and return to the MainActivity view where the connection fields are listed. After a short time the RPCActivity will automatically restart and the pending measurements will again decrease. This seems to be the desired behavior.

I can run a few explicit experiments to be sure the thread is terminating, however, unfortunately while this helped stability, after about an hour or two I did see a system reset of similar characteristics as before. The resets are much less frequent than before, but persist. I noted two more uses of System.exit() in the RPCTracker APK which may or may not be causing this behavior:

https://github.com/apache/incubator-tvm/blob/master/apps/android_rpc/app/src/main/java/org/apache/tvm/tvmrpc/RPCActivity.java#L44

https://github.com/apache/incubator-tvm/blob/master/apps/android_rpc/app/src/main/java/org/apache/tvm/tvmrpc/RPCProcessor.java#L54

** Approximating here as I counted from when I saw the pending measurements stall

@tqchen
Copy link
Member

tqchen commented Aug 6, 2020

One quick way to confirm the behavior it is to write the main loop as an infinite loop(that sleeps periodically) and increases the counter, while the watchdog thread calls the finish function.

@csullivan
Copy link
Contributor Author

@tqchen I added a counter to the RPCProcessor loop:

 class RPCProcessor extends Thread {
   private String host;
   private int port;
   private String key;
   private boolean running = false;
   private long startTime;
   private ConnectTrackerServerProcessor currProcessor;
   private boolean first = true;
+  private int counter = 0;

   @Override public void run() {
     RPCWatchdog watchdog = new RPCWatchdog();
     watchdog.start();
     while (true) {
       synchronized (this) {
         currProcessor = null;
         while (!running) {
           try {
             this.wait();
           } catch (InterruptedException e) {
           }
         }
         try {
           currProcessor = new ConnectTrackerServerProcessor(host, port, key, watchdog);
         } catch (Throwable e) {
           e.printStackTrace();
           // kill if creating a new processor failed
+          System.err.println("Creating a new processor failed, exiting");
           System.exit(0);
         }
       }
+      System.err.println("RPCProcessor infinite loop: " + counter);
+      counter += 1;
       if (currProcessor != null)
         currProcessor.run();
       watchdog.finishTimeout();
     }
   }

In the logs I see that after the watchdog wakes up and calls finish, the counter is reset. Seems to give evidence that the thread is exiting as expected.

...
08-06 14:42:31.157 24737 24761 W System.err: Connection from /192.168.1.10:60626
08-06 14:42:31.158 24737 24762 W System.err: waiting for timeout: 10000
08-06 14:42:31.160 24737 24761 W System.err: starting server loop...
08-06 14:42:31.192 24737 24761 W System.err: Load module from /data/user/0/org.apache.tvm.tvmrpc/cache/tvm4j_rpc_7111901762916576085/tmp_func_19a48fccba43c2c0.so
08-06 14:42:31.809 24737 24761 W System.err: done server loop...
08-06 14:42:31.809 24737 24761 W System.err: Finish serving /192.168.1.10:60626
08-06 14:42:31.812 24737 24762 W System.err: watchdog woken up, ok...
08-06 14:42:31.812 24737 24761 W System.err: using port: 5001
08-06 14:42:31.812 24737 24761 W System.err: RPCProcessor infinite loop: 7 <---
08-06 14:42:31.812 24737 24761 W System.err: currProcessor.run()
08-06 14:42:31.922 24737 24761 W System.err: registered with tracker...
08-06 14:42:31.922 24737 24761 W System.err: waiting for requests...
08-06 14:42:31.923 24737 24761 W System.err: matchKey:android:0.16312182579570966
08-06 14:42:31.923 24737 24761 W System.err: key: client:android:0.16312182579570966 -timeout=10
08-06 14:42:31.924 24737 24761 W System.err: alloted timeout: 10
08-06 14:42:31.924 24737 24761 W System.err: Connection from /192.168.1.10:60628
08-06 14:42:31.924 24737 24762 W System.err: waiting for timeout: 10000
08-06 14:42:31.925 24737 24761 W System.err: starting server loop...
08-06 14:42:32.001 24737 24761 W System.err: Load module from /data/user/0/org.apache.tvm.tvmrpc/cache/tvm4j_rpc_1859845163597930021/tmp_func_d08fd3eb029d7c7.so
08-06 14:42:41.925 24737 24762 W System.err: watchdog woke up! <---
08-06 14:42:41.925 24737 24762 W System.err: terminating... <--- calls finish()
08-06 14:42:41.926 24737 24813 W System.err: Deleting /data/user/0/org.apache.tvm.tvmrpc/cache/tvm4j6272654684694666821
08-06 14:42:41.983 24511 24511 W System.err: MainActivity onResume...
08-06 14:42:46.987 24511 24511 W System.err: relaunching RPC activity...
08-06 14:42:46.987 24511 24511 W System.err: updating preferences...
08-06 14:42:47.048 24818 24818 W System.err: rpc activity onCreate...
08-06 14:42:47.049 24818 24842 W System.err: using port: 5001
08-06 14:42:47.049 24818 24842 W System.err: RPCProcessor infinite loop: 0 <---
...

@csullivan
Copy link
Contributor Author

Again though, this doesn't appear to have entirely fixed the issue. Less frequent instabilities, but they still exist with finish() and still seem to have something to do with the timeout behavior. Unfortunately, any instabilities that cause resets or crashes to the lock screen are enough to make long unmonitored autotuning runs untenable.

For this reason I’ve moved to cross compiling the C++ RPC app (#6229) and running it from the android shell which is much more stable. After battle testing it I'd like to update our android docs to recommend using it.

@tqchen
Copy link
Member

tqchen commented Aug 6, 2020

Great, Thanks @csullivan , seems thatw we can go ahead and merge it then. please help to fix the CI error

@csullivan csullivan marked this pull request as draft August 7, 2020 23:39
@csullivan csullivan changed the title [JVM] Remove destructive call to System.exit(0) upon timeout [JVM] Support overriding RPCWatchdog termination behavior on Android and other platforms Aug 10, 2020
the activity stack, finish the RPCActivity and return cleanly to the
MainActivity where the RPCActivity can be restarted automatically.
@csullivan csullivan marked this pull request as ready for review August 11, 2020 16:21
@csullivan
Copy link
Contributor Author

@tmoreau89, @tqchen, if you have time please review the updates. Thanks!

@csullivan csullivan requested a review from tmoreau89 August 11, 2020 19:59
@tqchen tqchen merged commit db6e0c1 into apache:master Aug 11, 2020
wjliu1998 pushed a commit to wjliu1998/incubator-tvm that referenced this pull request Aug 13, 2020
…and other platforms (apache#6216)

* Instead of performing a system exit and leaving unhandled items on
the activity stack, finish the RPCActivity and return cleanly to the
MainActivity where the RPCActivity can be restarted automatically.

* Update doc. string for checkstyle.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…and other platforms (apache#6216)

* Instead of performing a system exit and leaving unhandled items on
the activity stack, finish the RPCActivity and return cleanly to the
MainActivity where the RPCActivity can be restarted automatically.

* Update doc. string for checkstyle.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…and other platforms (apache#6216)

* Instead of performing a system exit and leaving unhandled items on
the activity stack, finish the RPCActivity and return cleanly to the
MainActivity where the RPCActivity can be restarted automatically.

* Update doc. string for checkstyle.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…and other platforms (apache#6216)

* Instead of performing a system exit and leaving unhandled items on
the activity stack, finish the RPCActivity and return cleanly to the
MainActivity where the RPCActivity can be restarted automatically.

* Update doc. string for checkstyle.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Sep 2, 2020
…and other platforms (apache#6216)

* Instead of performing a system exit and leaving unhandled items on
the activity stack, finish the RPCActivity and return cleanly to the
MainActivity where the RPCActivity can be restarted automatically.

* Update doc. string for checkstyle.
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Sep 3, 2020
…and other platforms (apache#6216)

* Instead of performing a system exit and leaving unhandled items on
the activity stack, finish the RPCActivity and return cleanly to the
MainActivity where the RPCActivity can be restarted automatically.

* Update doc. string for checkstyle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants