-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulty with stop command #29
Comments
Looks like the job is hanging around in the REMOVED state. This occasionally happens when the schedd is under heavy load, which ours often are. Do you happen to still have the I've seen that As far as code changes go, we can probably issue a https://htcondor.readthedocs.io/en/latest/apis/python-bindings/api/htcondor.html#htcondor.JobAction.RemoveX if the job stays in the REMOVED state for too long. |
I have some files at (base) [stsievert@submit3 jupyter-logs]$ cat current.events
000 (8138911.000.000) 2020-07-19 13:57:07 Job submitted from host: <128.104.100.44:9618?addrs=128.104.100.44-9618+[2607-f388-107c-501-92e2-baff-fe2c-2724]-9618&alias=submit3.chtc.wisc.edu&noUDP&sock=schedd_4216_675f>
...
001 (8138911.000.000) 2020-07-19 13:57:08 Job executing on host: <128.104.100.44:9618?addrs=128.104.100.44-9618+[2607-f388-107c-501-92e2-baff-fe2c-2724]-9618&alias=submit3.chtc.wisc.edu&noUDP&sock=starter_3257435_6388_266760>
... current.err(base) [stsievert@submit3 jupyter-logs]$ cat current.err
[I 13:57:10.489 LabApp] Writing notebook server cookie secret to /home/stsievert/.local/share/jupyter/runtime/notebook_cookie_secret
[I 13:57:10.775 LabApp] JupyterLab extension loaded from /home/stsievert/miniconda3/lib/python3.7/site-packages/jupyterlab
[I 13:57:10.775 LabApp] JupyterLab application directory is /home/stsievert/miniconda3/share/jupyter/lab
[I 13:57:10.778 LabApp] Serving notebooks from local directory: /home/stsievert
[I 13:57:10.778 LabApp] The Jupyter Notebook is running at:
[I 13:57:10.779 LabApp] http://localhost:8888/?token=d1717bce73ebc0f54ebeb16eeeef70811ead8eaae23e213c
[I 13:57:10.779 LabApp] or http://127.0.0.1:8888/?token=d1717bce73ebc0f54ebeb16eeeef70811ead8eaae23e213c
[I 13:57:10.779 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 13:57:10.784 LabApp]
To access the notebook, open this file in a browser:
file:///home/stsievert/.local/share/jupyter/runtime/nbserver-1133953-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=d1717bce73ebc0f54ebeb16eeeef70811ead8eaae23e213c
or http://127.0.0.1:8888/?token=d1717bce73ebc0f54ebeb16eeeef70811ead8eaae23e213c
[I 13:57:48.832 LabApp] 302 GET / (127.0.0.1) 1.99ms
[I 13:57:48.892 LabApp] 302 GET /lab? (127.0.0.1) 3.51ms
[I 13:59:04.645 LabApp] 302 GET /lab/workspaces/auto-R?clone (127.0.0.1) 2.55ms
[I 13:59:10.113 LabApp] 302 GET /?token=d1717bce73ebc0f54ebeb16eeeef70811ead8eaae23e213c (127.0.0.1) 1.86ms
[W 13:59:13.906 LabApp] Could not determine jupyterlab build status without nodejs
[I 13:59:18.255 LabApp] Creating new notebook in /
[I 13:59:18.268 LabApp] Writing notebook-signing key to /home/stsievert/.local/share/jupyter/notebook_secret
[I 13:59:18.892 LabApp] Kernel started: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 13:59:19.808 LabApp] Starting buffering for 2d281613-0dc9-463f-b608-85b37b9d2712:b90b0af5-ef49-4f4b-b9fb-cdd0511ffcb2
[I 14:00:02.696 LabApp] Saving file at /Untitled.ipynb
[I 14:00:13.388 LabApp] Saving file at /Untitled.ipynb
[I 14:00:17.022 LabApp] Saving file at /Untitled.ipynb
[I 14:00:18.753 LabApp] Saving file at /Untitled.ipynb
[I 14:00:20.109 LabApp] Saving file at /Untitled.ipynb
[I 14:00:24.034 LabApp] Saving file at /Untitled.ipynb
[I 14:00:27.692 LabApp] Saving file at /Untitled.ipynb
[I 14:00:32.130 LabApp] Saving file at /Untitled.ipynb
[I 14:00:38.323 LabApp] Saving file at /Untitled.ipynb
[I 14:00:39.603 LabApp] Saving file at /Untitled.ipynb
[I 14:00:44.105 LabApp] Saving file at /Untitled.ipynb
[I 14:00:53.219 LabApp] Saving file at /Untitled.ipynb
[I 14:01:06.243 LabApp] Saving file at /Untitled.ipynb
[I 14:01:10.000 LabApp] Saving file at /Untitled.ipynb
[I 14:01:17.589 LabApp] Saving file at /Untitled.ipynb
[I 14:01:26.650 LabApp] Saving file at /Untitled.ipynb
[I 14:01:36.569 LabApp] Saving file at /Untitled.ipynb
[I 14:01:38.832 LabApp] Saving file at /Untitled.ipynb
[I 14:01:44.863 LabApp] Saving file at /Untitled.ipynb
[I 14:02:00.285 LabApp] Saving file at /Untitled.ipynb
[I 14:02:10.162 LabApp] Saving file at /Untitled.ipynb
[I 14:02:19.008 LabApp] Saving file at /Untitled.ipynb
[I 14:02:24.847 LabApp] Saving file at /Untitled.ipynb
[I 14:05:58.584 LabApp] Saving file at /Untitled.ipynb
[I 14:06:18.515 LabApp] Saving file at /Untitled.ipynb
[I 14:07:12.166 LabApp] Saving file at /Untitled.ipynb
[I 14:07:22.798 LabApp] Saving file at /Untitled.ipynb
[I 14:07:33.246 LabApp] Saving file at /Untitled.ipynb
[I 14:07:39.316 LabApp] Starting buffering for 2d281613-0dc9-463f-b608-85b37b9d2712:2a52aeac-46d6-4386-8dc4-07432931f77b
[I 14:07:58.748 LabApp] Restoring connection for 2d281613-0dc9-463f-b608-85b37b9d2712:2a52aeac-46d6-4386-8dc4-07432931f77b
[I 14:07:59.417 LabApp] Starting buffering for 2d281613-0dc9-463f-b608-85b37b9d2712:2a52aeac-46d6-4386-8dc4-07432931f77b
[I 14:07:59.534 LabApp] Restoring connection for 2d281613-0dc9-463f-b608-85b37b9d2712:2a52aeac-46d6-4386-8dc4-07432931f77b
[I 14:07:59.636 LabApp] Starting buffering for 2d281613-0dc9-463f-b608-85b37b9d2712:2a52aeac-46d6-4386-8dc4-07432931f77b
[W 14:08:05.578 LabApp] Could not determine jupyterlab build status without nodejs
[I 14:08:53.990 LabApp] Saving file at /Untitled.ipynb
[I 14:14:14.428 LabApp] Kernel interrupted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:14:20.322 LabApp] Kernel restarted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:14:21.151 LabApp] Saving file at /Untitled.ipynb
[I 14:14:23.810 LabApp] Kernel restarted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:14:37.652 LabApp] Saving file at /Untitled.ipynb
[I 14:16:38.695 LabApp] Saving file at /Untitled.ipynb
[I 14:17:43.209 LabApp] Kernel interrupted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:17:43.944 LabApp] Kernel interrupted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:17:44.107 LabApp] Kernel interrupted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:17:50.511 LabApp] Kernel restarted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:17:50.873 LabApp] Saving file at /Untitled.ipynb
[I 14:17:54.929 LabApp] Saving file at /Untitled.ipynb
[I 14:17:57.405 LabApp] Saving file at /Untitled.ipynb
[I 14:18:14.052 LabApp] Saving file at /Untitled.ipynb
[I 14:18:21.129 LabApp] Saving file at /Untitled.ipynb
[I 14:18:24.933 LabApp] Saving file at /Untitled.ipynb
[I 14:18:49.786 LabApp] Saving file at /Untitled.ipynb
[I 14:27:03.404 LabApp] Kernel restarted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:27:12.398 LabApp] Saving file at /Untitled.ipynb
[I 14:27:28.733 LabApp] Saving file at /Untitled.ipynb
[I 14:27:34.333 LabApp] Saving file at /Untitled.ipynb
[I 14:27:39.981 LabApp] Saving file at /Untitled.ipynb
[I 14:27:46.739 LabApp] Kernel interrupted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:27:53.154 LabApp] Kernel restarted: 2d281613-0dc9-463f-b608-85b37b9d2712
[I 14:27:53.199 LabApp] Starting buffering for 2d281613-0dc9-463f-b608-85b37b9d2712:15d16818-0e08-4922-9b89-54ec020e4d6f
[W 14:27:53.201 LabApp] Got events for closed stream None
[W 14:27:53.202 LabApp] Got events for closed stream None
[W 14:27:53.202 LabApp] Got events for closed stream None
[W 14:27:53.205 LabApp] Got events for closed stream None
[C 16:34:02.201 LabApp] received signal 15, stopping
[I 16:34:02.202 LabApp] Shutting down 1 kernel
[I 16:34:02.604 LabApp] Kernel shutdown: 2d281613-0dc9-463f-b608-85b37b9d2712 |
@johnkn and I took a look at this: The job was stuck in the X (i.e., REMOVED) state because the Jupyter notebook server was refusing to exit after shutting down its kernels. Here's
We Force removal is not a good solution to this, because it would not finish cleaning up the notebook server process. I don't think we can safely expose that to users. The real problem is that HTCondor itself should have eventually killed the process after giving it time to kill itself. It didn't because local universe jobs don't honor max vacate time, even if you set it explicitly (https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7746). We could still implement hacky force removal by using process management tools to find the notebook process and kill it. Probably through Our real hope here is that this problem is actually rare, and that you got incredibly unlucky. @stsievert , would you mind starting up and shutting down your notebook server through Dask-CHTC a few times to see if you hit the same issue again? |
A force stop looks for the notebook server process tree and sends it a kill signal. This is a yucky, low-level hack that we will probably need to revisit later.
I'm also having difficulty with
dask-chtc jupyter stop
. I started a Jupyter server, then after playing a while, logged out and killed the tmux session running the server (I think? It's also possible it wasn't in tmux).The Jupyter session is still running in the background. I tried following the documentation on methods to stop it:
I created and activated an environment in the meantime that install Jupyter and Dask-CHTC, which might be the issue.
The text was updated successfully, but these errors were encountered: