Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception writing to websocket #516

Closed
amangarg96 opened this issue Dec 5, 2018 · 7 comments
Closed

Exception writing to websocket #516

amangarg96 opened this issue Dec 5, 2018 · 7 comments
Labels

Comments

@amangarg96
Copy link

I am using Jupyter Enterprise Gateway in YARN Cluster mode, with a slight modification.

An NGINX proxy is being used over multiple Jupyter Enterprise Gateway servers, which sends users to different gateway servers by hashing over the hostname of the machine. (This was done as a quick-fix for #86)

When the kernel loses connection with the Gateway server, it tries to reconnect with the previous kernel. The Notebook server logs are as follows

[I 19:11:55.965 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3
[I 19:11:56.058 LabApp] Kernel retrieved: {u'connections': 1, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'}
[I 19:11:58.964 LabApp] Request list kernel specs at: /api/kernelspecs
[I 19:17:41.815 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3
[I 19:17:41.898 LabApp] Kernel retrieved: {u'connections': 0, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'}
[I 19:18:11.819 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3
[I 19:18:11.901 LabApp] Kernel retrieved: {u'connections': 0, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'}
[I 19:18:22.711 LabApp] Request kernel at: /api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3
[I 19:18:22.798 LabApp] Kernel retrieved: {u'connections': 0, u'last_activity': u'2018-12-04T13:26:03.683070Z', u'execution_state': u'idle', u'id': u'fd5ac87c-378a-4531-a25d-8b0e858077a3', u'name': u'mlp.mlc.linux_debian_8_11.py27.mlc-base-xgboost'}
[E 19:19:16.312 LabApp] Exception writing message to websocket:
[E 19:19:17.704 LabApp] Exception writing message to websocket:
[E 19:19:18.872 LabApp] Exception writing message to websocket:
[I 19:19:28.890 LabApp] Saving file at .ipynb
[W 19:19:28.891 LabApp] Notebook .ipynb is not trusted
[I 19:19:36.813 LabApp] Request list kernel specs at: /api/kernelspecs
[I 19:20:37.810 LabApp] Request list kernel specs at: /api/kernelspecs
[E 19:20:39.373 LabApp] Exception writing message to websocket:

The Notebook server seems to be able to contact with the Kernel through REST calls, but it's not able to connect to the websockets.

On the notebook, it shows that kernel is active (in idle state), but it doesn't execute the cell.

The following step is what is missing from the reconnection attempt, right?

Connecting to ws://10.33.11.68:9090/api/kernels/fd5ac87c-378a-4531-a25d-8b0e858077a3/channels

Does it have something to do with the NGINX proxy?

Also, what all messages are sent through Websockets? Is it possible to switch notebooks and enterprise to just HTTP (Since the REST calls are working just fine)? If so, what all functionality of notebooks would be affected?
Pointers to the Documentation would also help.

@kevin-bates
Copy link
Member

@amangarg96 - thanks for the issue - another interesting issue from you.

Use of a reverse proxy is something we recommend, so this shouldn't be a problem. In addition, your use of the client machine as the affinity key seems fine as well.

I'm curious your EG log is indicating during this period. Perhaps there's something indicated there, or, for that matter, the kernel-specific logs maintained in YARN. Please check those for any clues. Also, have you tried a forced reconnect operation from the notebook?

Regarding the separation of duties relative to the HTTP and WS, the HTTP requests essentially invoke the various manager classes to either get, start, interrupt, etc. data relative to a specific (or all) instance(s). These requests do not (necessarily) go directly to the kernel process itself. (Of course, things like interrupt or restart will implicitly trigger interaction with the kernel process.) .

The WS request handler communicates directly with the kernel. The channel embedded in the json body of the message indicates to which ZMQ port the request should be posted. Taking a look at our gateway_client.py file might help shed some light here. Unfortunately, I'm not enough of a web developer to answer your question regarding a switch, although I'm suspicious why it wasn't done that way in the first place. A quick google regarding their differences gives good reason for why WS is used - in particular bi-directional, full-duplex and far less overhead.

One area that will make things a little difficult, should you find you need to make changes, is that the EG doesn't define any handlers - all are inherited from the Kernel Gateway and Notebook projects, so this may open a can of worms for you. That said, it would be fine to define a subclass in EG that derives from the class you need to change, assuming that a change of that magnitude is warranted and can be done in a relatively clean way.

@amangarg96
Copy link
Author

I reproduced the above error and checked the kernel-specific logs in YARN. The stdout and stderr look fine to me, and I'm putting it here for your reference

stdout:

Using connection file '/tmp/kernel-0bc9ede3-229d-48b2-8025-5aaac8607002_IPOgwT.json' instead of '/root/.local/share/jupyter/runtime/kernel-0bc9ede3-229d-48b2-8025-5aaac8607002.json'
Signal socket bound to host: 0.0.0.0, port: 41173
JSON Payload '{"stdin_port": 16842, "pgid": "2541", "ip": "0.0.0.0", "pid": "2593", "control_port": 9208, "hb_port": 46206, "signature_scheme": "hmac-sha256", "key": "11002c92-6daa-47f8-908f-d26d611eb300", "comm_port": 41173, "kernel_name": "", "shell_port": 62515, "transport": "tcp", "iopub_port": 13844}
Encrypted Payload '1jWOPQH9wzE/6bjr7pJAv9bAC9MXhkvEMx0iDcZlniOyHkrRWl4e7avhMfD15yectln5b6uwhi3HGzukyqe+85BjboBlkzKNpVGRY56Y7Qf6i6k253NV2aWOi3V/9ry//bHXXR7pg5XIqxVzyQgzFl5xH+Edam8n9irNS6a1tnjtYcBQ/eH52LYiH2gtWe60JCcj2xAFNIteypVgZCrVgJYufow2RYLnlsCQAK1WLNaLPf02DehBmjtw/PfDEi0zHl4RDRPbLaG2lCzTnx3VHyADezyK3zXAhmblt55QA9tvmfMsCiB3HUaRWnfOlMTSfkSSdbXFdn0oQE0o9jLDCLw2ppAD9Cw+BQ8KNrXoH0DHIPwSFOPCj2TcQY3nZ1VV++Z9vUV92MADyTkKuBXxWA==
/grid/1/yarn/local/usercache/fk-mlp-user/appcache/application_1543928474139_24402/container_e238_1543928474139_24402_01_000001/mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz/lib/python2.7/site-packages/IPython/paths.py:69: UserWarning: IPython parent '/home' is not a writable location, using a temp directory.
" using a temp directory.".format(parent))
NOTE: When using the ipython kernel entry point, Ctrl-C will not work.

To exit, you will have to explicitly quit this process, by either sending
"quit" from a client, or using Ctrl-\ in UNIX-like environments.

To read more about this, see ipython/ipython#2049

To connect another client to this kernel, use:
--existing /tmp/kernel-0bc9ede3-229d-48b2-8025-5aaac8607002_IPOgwT.json

stderr:

YARN executor launch context:
env:
CLASSPATH -> {{PWD}}{{PWD}}/spark_conf{{PWD}}/spark_libs/$HADOOP_CONF_DIR/usr/hdp/current/hadoop-client//usr/hdp/current/hadoop-client/lib//usr/hdp/current/hadoop-hdfs-client//usr/hdp/current/hadoop-hdfs-client/lib//usr/hdp/current/hadoop-yarn-client//usr/hdp/current/hadoop-yarn-client/lib/$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar:/etc/hadoop/conf/secure
SPARK_YARN_STAGING_DIR -> *********(redacted)
SPARK_USER -> *********(redacted)
SPARK_YARN_MODE -> true
PYTHONPATH -> {{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.4-src.zip{{PWD}}/mlsdk.zip

command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx10240m \
-Djava.io.tmpdir={{PWD}}/tmp \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.CoarseGrainedExecutorBackend \
--driver-url \
spark://[email protected]:11004 \
--executor-id \
\
--hostname \
\
--cores \
1 \
--app-id \
application_1543928474139_24402 \
--user-class-path \
file:$PWD/app.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr

resources:
py4j-0.10.4-src.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/py4j-0.10.4-src.zip" } size: 74096 timestamp: 1544082647339 type: FILE visibility: PRIVATE
mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz" } size: 1683630994 timestamp: 1544082645796 type: ARCHIVE visibility: PRIVATE
spark_conf -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/spark_conf.zip" } size: 102062 timestamp: 1544082647415 type: ARCHIVE visibility: PRIVATE
pyspark.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/pyspark.zip" } size: 482687 timestamp: 1544082645943 type: FILE visibility: PRIVATE
spark_libs -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_24402/__spark_libs__8138412549744268625.zip" } size: 203821116 timestamp: 1544082638970 type: ARCHIVE visibility: PRIVATE
mlsdk.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/notebooks/mlsdk/mlsdk.zip" } size: 113787 timestamp: 1543586447744 type: FILE visibility: PUBLIC

===============================================================================
18/12/06 13:23:35 INFO yarn.YarnRMClient: Registering the ApplicationMaster
18/12/06 13:23:35 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
18/12/06 13:23:35 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals

@amangarg96
Copy link
Author

amangarg96 commented Dec 6, 2018

I hadn't tried the reconnect option as I couldn't find it in JupyterLab. So I launched the same notebook (kernel) in the classic Notebooks (through JupyterLab).
Then I tried reconnect and it worked!

I got the "connecting to websocket" log in Notebook server too

[I 14:43:05.903 LabApp] Connecting to ws://10.34.162.133:9090/api/kernels/0bc9ede3-229d-48b2-8025-5aaac8607002/channels

It's surprising why the reconnect option has been left out from JupyterLab

This seems to have solved my use-case, is there anything else we should troubleshoot?

Update: The 'Reconnect to kernel' option is available in the 'Command' palette of JupyterLab

@kevin-bates
Copy link
Member

Thanks for that update. I was just about to post a question on the jupyterlab gitter forum. They sure don't make that easy to find!

Are you satisfied with this behavior? We'll likely revisit this area when we go to implement a robust HA solution.

@amangarg96
Copy link
Author

When the Notebook server tries to poll the state of the kernel (through REST calls), it should ideally attempt the 'reconnect' to the kernel (websockets) too.
A reconnect attempt (I think) is a harmless activity and does not interfere with the state of the kernel, so should be invoked.

@kevin-bates
Copy link
Member

Sounds like a good suggestion/contribution to the Notebook server. 😃

Since this is sounding more like a client-side issue, I'm inclined to close this issue for now. Should any activity occur in Notebook/Lab, we can post a reference here.

Are you okay with closure?

@amangarg96
Copy link
Author

Yes! I'm happy with the resolution. Thanks for the help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants