Skip to content
This repository has been archived by the owner on Oct 26, 2019. It is now read-only.

Browser loses websocket connection, can't reconnect in streaming-meetup demo #156

Closed
parente opened this issue Mar 24, 2016 · 9 comments · May be fixed by rs2/dashboards_server#4
Closed
Assignees
Labels
Milestone

Comments

@parente
Copy link
Member

parente commented Mar 24, 2016

I deployed streaming-meetup demo from https://github.com/jupyter-incubator/dashboards/blob/master/etc/notebooks/stream_demo/meetup-streaming.ipynb to the dashboard server as a bundle. Good news: it started off working just fine (sans a little styling nit on the paper card for some reason):

screen shot 2016-03-24 at 1 40 00 pm

Bad news: after a few seconds, it stopped working. Checking the console I saw an endless cadence of:

screen shot 2016-03-24 at 1 43 09 pm

I left the page and noticed that the kernel did indeed get cleaned up after our idle timeout. But when I come back to the dashboard page anew, the same error appears endlessly no matter how many times I refresh. It's like the Websocket part of the dashboard server just disappeared.

I checked the dashboard server logs, no errors that I can see. Also no problems loading the HTTP resources on refresh. Just the websocket proxy that has gone kaput.

@parente parente added the bug label Mar 24, 2016
@parente parente added this to the 0.5.0 milestone Mar 24, 2016
@parente
Copy link
Member Author

parente commented Mar 24, 2016

Tried again and hit it again. Here's the combined log up to the approximate point of failure.

https://gist.github.com/parente/02b2fe67897a4dacedd1

The Spark logging is noisy, but searching through, I don't see any message from the dashboard server that things have gone awry.

@parente
Copy link
Member Author

parente commented Mar 24, 2016

So even crazier, I left it running after the websocket failure. After 200 or so console log errors about the websocket failure (see above), it reconnected and has been running ever since without a problem (30+ minutes!)

@jhpedemonte
Copy link
Collaborator

Testing using make dev-logging (so not running dashboard server in a container). Have yet to be able to reproduce the issue. Will keep testing.

@parente
Copy link
Member Author

parente commented Apr 4, 2016

I seem to have hit this probably again this AM, but this time on the POST /api/kernels call, not the websocket. The call would hang for some time before the dashboard server logged:

POST /api/kernels?1459777914810 - - ms - -

This looks like proxy upstream timeout.

@parente
Copy link
Member Author

parente commented Apr 12, 2016

Back looking at this today and reproduced it pretty quickly. This time the websocket from browser to dashboard server dropped after 2 pushes from the backend on the meetup demo, jupyter-js-services started trying to reconnect, succeeded after 12 tries, worked for a while longer, then died again later and retried endlessly.

I'm using the docker-compose setup from https://gist.github.com/527cea0481afe9fabbcd. Last time, I was using a docker-machine virtualbox instance. This time I'm using the new docker beta on OSX Hyve. So it doesn't seem to be anything at the docker / VM networking level.

Will try enabling debug logging in the dashboard server to see if anything jumps out. No errors appear in the dashboard server or kernel gateway logs at the default log level.

@parente
Copy link
Member Author

parente commented Apr 12, 2016

Caught it with logging enabled:

Things are working fine to start. The last successful WS message the dashboard server logs is

dashboards_1 | Tue, 12 Apr 2016 12:35:41 GMT dashboard-proxy:server PROXY: received message from client WS: {"header":{"username":"","version":"5.0","session":"25f89eeb7ed9164399c6d649134fe33e","msg_id":"e685a1a4ce2c46ba3a90fc8ff57fd883","msg_type":"comm_msg"},"parent_header":{},"channel":"shell","content":{"comm_id":"b40a2a3b3683ff700903b4e862ad6317","data":{"method":"custom","content":{"event":"change","data":{"name":"venue_data","old_val":[[-38,144,0.03],[-34,115,0.02],[-34,151,0.01],[-27,28,0.02],[-22,-49,0.02],[0,0,0.01],[1,103,0.02],[8,-80,0.01],[12,77,0.01],[24,54,0.02],[25,-81,0.01],[25,46,0.01],[25,55,0.03],[27,-83,0.04],[29,-96,0.04],[29,-83,0.02],[30,-98,0.04],[32,-111,0.03],[32,-97,0.01],[32,-81,0.02],[32,34,0.03],[33,-113,0.02],[33,-98,0.060000000000000005],[33,-81,0.01],[34,-119,0.02],[34,-83,0.01],[34,-79,0.01],[35,-83,0.03],[35,-81,0.01],[35,-79,0.04],[36,-83,0.02],[36,-79,0.02],[37,-123,0.01],[37,-91,0.01],[37,-79,0.04],[38,-78,0.03],[39,-106,0.01],[39,-87,0.01],[39,-77,0.01],[40,-83,0.01],[40,-76,0.01],[40,-75,0.03],[40,-74,0.04],[40,-4,0.02],[41,-89,0.01],[41,-88,0.02],[41,-82,0.01],[41,2,0.01],[41,14,0.01],[42,-88,0.01],[42,-84,0.01],[43,-80,0.060000000000000005],[43,-79,0.03],[43,-78,0.01],[43,11,0.01],[44,-94,0.02],[44,-72,0.01],[44,20,0.01],[45,-123,0.01],[45,5,0.02],[45,9,0.01],[46,6,0.01],[48,2,0.10999999999999999],[49,8,0.01],[50,3,0.01],[50,14,0.01],[51,-115,0.02],[51,-3,0.01],[51,-1,0.25000000000000006],[51,0,0.02],[52,1,0.01],[52,13,0.03],[53,-7,0.02],[53,6,0.02],[55,12,0.02],[56,-4,0.01],[59,30,0.01]],"new_val":[[-38,144,0.03],[-34,115,0.02],[-34,151,0.02],[-27,28,0.02],[-22,-49,0.02],[-10,-36,0.01],[0,0,0.01],[1,103,0.02],[8,-80,0.01],[12,77,0.01],[24,54,0.02],[25,-81,0.01],[25,46,0.01],[25,55,0.03],[26,-81,0.01],[27,-83,0.04],[29,-96,0.04],[29,-83,0.02],[30,-98,0.04],[32,-111,0.03],[32,-97,0.01],[32,-81,0.02],[32,34,0.03],[33,-113,0.02],[33,-98,0.07],[33,-81,0.02],[34,-119,0.02],[34,-83,0.01],[34,-79,0.01],[35,-83,0.03],[35,-81,0.01],[35,-79,0.04],[36,-83,0.02],[36,-79,0.02],[37,-123,0.01],[37,-91,0.01],[37,-79,0.04],[38,-78,0.03],[39,-106,0.01],[39,-87,0.01],[39,-77,0.01],[40,-83,0.01],[40,-76,0.01],[40,-75,0.03],[40,-74,0.04],[40,-4,0.02],[41,-89,0.01],[41,-88,0.02],[41,-82,0.01],[41,2,0.01],[41,14,0.01],[42,-88,0.02],[42,-84,0.01],[43,-80,0.060000000000000005],[43,-79,0.03],[43,-78,0.01],[43,11,0.01],[44,-94,0.02],[44,-72,0.02],[44,20,0.01],[45,-123,0.01],[45,5,0.02],[45,9,0.01],[46,6,0.01],[48,2,0.10999999999999999],[49,8,0.01],[50,3,0.01],[50,14,0.01],[51,-115,0.02],[51,-3,0.01],[51,-1,0.26000000000000006],[51,0,0.02],[52,1,0.01],[52,13,0.03],[53,-7,0.02],[53,6,0.02],[55,12,0.02],[56,-4,0.01],[59,18,0.02],[59,30,0.01],[60,24,0.01]],"channel":"venues"}}}},"metadata":{},"buffers":[]}

https://gist.github.com/parente/ced8fa9a6f3808bc6ee85788d978ed0a#file-docker-compose-log-L273

The next two messages from the dashboard server, which have no timestamps unfortunately, appear to come soon after:

dashboards_1 | HTTP 18: server socket close
dashboards_1 | HTTP 18: server socket close

https://gist.github.com/parente/ced8fa9a6f3808bc6ee85788d978ed0a#file-docker-compose-log-L284

The kernel gateway logs no error and its SparkContext continues streaming merrily. A few moments later (which could be milliseconds or 4 seconds later, hard to tell because of the lack of timestamp on the server socket close messages), the dashboard server logs:

dashboards_1 | Tue, 12 Apr 2016 12:35:45 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1 | Tue, 12 Apr 2016 12:35:45 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1 | Tue, 12 Apr 2016 12:35:45 GMT dashboard-proxy:server:error an error occurred during parsing of WS data Control frames must not be fragmented.
dashboards_1 | Tue, 12 Apr 2016 12:35:45 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1 | Tue, 12 Apr 2016 12:35:45 GMT dashboard-proxy:server:error an error occurred during parsing of WS data Control frames must not be fragmented.

https://gist.github.com/parente/ced8fa9a6f3808bc6ee85788d978ed0a#file-docker-compose-log-L500

These could be interpreted in one of two ways, depending on how the various callbacks are ordering the log messages:

  1. The unexpected server socket close causes the next websocket operation to fail with partial data.
  2. The websocket frame parsing problems ultimately trigger the socket close but the close logs first.

Thereafter, the dashboard server enters its normal shutdown flow:

dashboards_1 | Tue, 12 Apr 2016 12:35:45 GMT dashboard-proxy:server PROXY: WS will kill kernel 3032acc6-9b4a-412e-a0d8-79f06c7f694f session 25f89eeb7ed9164399c6d649134fe33e soon
dashboards_1 | Tue, 12 Apr 2016 12:35:45 GMT dashboard-proxy:server PROXY: WS client disconnected

https://gist.github.com/parente/ced8fa9a6f3808bc6ee85788d978ed0a#file-docker-compose-log-L539

Given that these messages say the WS client disconnected, I suspect that the two server socket close messages above represent both the connection to the kernel gateway (websocket #1) and the connection to the browser (websocket #2).

Epilogue: Before I could kill the server, the jupyter-js-services code starts its reconnect logic. At the very end of the log, there is a new WS upgrade request:

https://gist.github.com/parente/ced8fa9a6f3808bc6ee85788d978ed0a#file-docker-compose-log-L972

It is immediately followed by a bunch more messages about insufficient data for frame, server socket close, and client sockets being removed. This appears to be the pattern that repeats over and over as the browser tries to reestablish a connection with the node server, succeeds momentarily, then immediately fails and tries again.

@parente
Copy link
Member Author

parente commented Apr 12, 2016

One thing interesting here is that the very first message logged by the dashboard server in the comment above has the prefix:

PROXY: received message from client WS

and comes from the substituteCodeCell function. But this is almost certainly a message coming from the kernel not the client. It looks like the if(eventName === 'data') check in the upgrade handler for WS is not enough to distinguish client-to-kernel messages from kernel-to-client ones. This could certainly wreak havoc if we assume we're only touching client-originated messages.

@parente
Copy link
Member Author

parente commented Apr 12, 2016

Caught it again, and this time, I only got:

dashboards_1 | Tue, 12 Apr 2016 13:41:40 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1 | Tue, 12 Apr 2016 13:41:40 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1 | Tue, 12 Apr 2016 13:41:40 GMT dashboard-proxy:server:error an error occurred during parsing of WS data Control frames must not be fragmented.
dashboards_1 | Tue, 12 Apr 2016 13:41:40 GMT dashboard-proxy:server:error an error occurred during parsing of WS data Illegal control frame longer than 125 bytes.
dashboards_1 | Tue, 12 Apr 2016 13:41:40 GMT dashboard-proxy:server PROXY: WS will kill kernel 225a8fc4-f6b9-49af-8cfa-01bee59f7210 session 3cd1d5eb37bb7148904eb7e11406dc67 soon

Without the HTTP 18: server socket close bit.

And a little bit later on retry:

dashboards_1      | Tue, 12 Apr 2016 13:41:48 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1      | Tue, 12 Apr 2016 13:41:48 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1      | Tue, 12 Apr 2016 13:41:48 GMT dashboard-proxy:server PROXY: received message from client WS: ���������E������ ���E�ƿK������  �޿R�ԿK������  �޿T������T���P�Х^������V�ԫ�ҪE�������޿U�ר
dashboards_1      | Tue, 12 Apr 2016 13:41:48 GMT dashboard-proxy:server:error insufficient data for frame
dashboards_1      | Tue, 12 Apr 2016 13:41:48 GMT dashboard-proxy:server PROXY: received message from client WS: 06.
                                                                                                                    +KJ}:97
                                                                                                                           :\$BD1
    1                                                                                                                        DeK:DsK
        +KT"EL
              :>DeJ}
                    9
                     }S5;"#t��j�ȏ���
                                    ���H������
dashboards_1      | Tue, 12 Apr 2016 13:41:48 GMT dashboard-proxy:server PROXY: received message from client WS:WIGPYQ
                                                                                                                      _P

Yea. Framing.

@parente
Copy link
Member Author

parente commented Apr 15, 2016

Noted in #181 that it fixes this problem. Closing this as a dupe of #97. Will tested #181 again using the case reported here when that PR is ready to merge.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants