Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rosbridge_websocket: WebSocket opening handshake timed out #403

Closed
Alabate opened this issue Apr 25, 2019 · 11 comments
Closed

rosbridge_websocket: WebSocket opening handshake timed out #403

Alabate opened this issue Apr 25, 2019 · 11 comments

Comments

@Alabate
Copy link
Contributor

Alabate commented Apr 25, 2019

Expected Behavior

rosbridge_websocket should never stops to accept new connections

Actual Behavior

We use rosbridge_websocket with our web app (via roslibjs). Every time we stat the bridge and our app, everything works as expected. However after a random amount of time, rosbridge_websocket stop to accept new connections and we get the following error (on chrome web console):

WebSocket connection to 'ws://localhost:9090/' failed: WebSocket opening handshake timed out

Once the rosbridge is in this state, we've tried to connect from other browser or even from a bare websocket client and we have no response from the bridge.

If we check the logs of the bridge, without the problem, we have some interesting data:

[rospy.internal][INFO] 2019-04-25 15:39:44,522: topic impl's ref count is zero, deleting topic /xxxx...
[rospy.internal][INFO] 2019-04-25 15:39:44,522: topic[/xxxx] removing connection to /yyyy

But we never get a

[rosout][INFO] 2019-04-25 15:39:01,735: Client disconnected. 0 clients total.

As usual.

Please note that our web app automatically reconnect when connection is lost. That may lead to a very fast disconnection/reconnection.

More data that could help you to troubleshoot:

The bridge seems to think that there is still a client connected:

rostopic echo /connected_clients          
clients: 
  - 
    ip_address: "127.0.0.1"
    connection_time: 
      secs: 1556199542
      nsecs: 117264986
---
rostopic echo /client_count                                                                                                                                                                                                                                                     130 ↵
data: 1
---

The bridge is still registred as listening on the linux network stack:

ss -lp | grep 9090
tcp               LISTEN              9                    128                                                                                          0.0.0.0:9090                                                0.0.0.0:*                    users:(("python",pid=14397,fd=13))                                             
tcp               LISTEN              10                   128                                                                                             [::]:9090                                                   [::]:*                    users:(("python",pid=14397,fd=14))  

Specifications

  • ROS Version: melodic
  • OS Version: bionic
  • Rosbridge Version: 0.11.0 (we've just updated, but the issue is not new)
  • Tornado Version: 4.5.3
@Alabate
Copy link
Contributor Author

Alabate commented May 1, 2019

I've got more informations: I managed to reproduce the bug on our project. As a workaround, I've created a node that can kill the rosbridge to let roslaunch restart it.

Interesting facts:

  • rosnode ping still work
  • I tried to kill the node with rosnode.kill_nodes which just unregister the node from master, but didn't kill it.
  • So I tried to kill it with signals INT didn't worked but QUIT worked.

If someone has the same problem, here is the node I wrote: https://gist.github.com/Alabate/9b2467018503e8d96c366e3448d34d2c

I know it's not an easy bug to troubleshoot, because I don't know how to trigger it. But I would like to have some help to gather more information to troubleshoot.

@Alabate
Copy link
Contributor Author

Alabate commented May 1, 2019

I managed to trigger it manually! I've made a python script that connect to the websocket, subscribe to /client_count and wait for answer. If I run it sequentialy or with only two processes or less in the pool, rosbridge never fails.

With 3 processes or more, it fails nearly instantly. The "test" script stop to print stuff, and if you kill it, you can see that /client_count is stuck at 4 clients and you cannot connect to the websocket anymore.

Here is the script with 8 processes in the pool

#!/usr/bin/env python3

import asyncio
import websockets
import time

from multiprocessing import Pool

class RosbridgeTester():

	def __init__(self):
		with Pool(8) as p:
			p.map(self.hello_process, range(100))

	def hello_process(self, i):
		asyncio.get_event_loop().run_until_complete(self.hello(i))

	async def hello(self, i=0):
		async with websockets.connect('ws://localhost:9090') as websocket:
			print(i)
			req = '{ "op": "subscribe", "topic": "/client_count", "type": "std_msgs/Int32" }'

			await websocket.send(req)
			print(f"{i} > {req}")

			res = await websocket.recv()
			print(f"{i} < {res}")


if __name__ == "__main__":
	rosbridge_tester = RosbridgeTester()

Here is the full log:

registered capabilities (classes):
 - rosbridge_library.capabilities.call_service.CallService
 - rosbridge_library.capabilities.advertise.Advertise
 - rosbridge_library.capabilities.publish.Publish
 - rosbridge_library.capabilities.subscribe.Subscribe
 - <class 'rosbridge_library.capabilities.defragmentation.Defragment'>
 - rosbridge_library.capabilities.advertise_service.AdvertiseService
 - rosbridge_library.capabilities.service_response.ServiceResponse
 - rosbridge_library.capabilities.unadvertise_service.UnadvertiseService
[INFO] [1556743633.869426] /rosbridge_websocket: Rosbridge WebSocket server started on port 9090
[INFO] [1556743643.466776] /rosbridge_websocket: Client connected.  1 clients total.
[INFO] [1556743643.470309] /rosbridge_websocket: Client connected.  2 clients total.
[INFO] [1556743643.475697] /rosbridge_websocket: Client connected.  3 clients total.
[INFO] [1556743643.477882] /rosbridge_websocket: Client connected.  4 clients total.
[INFO] [1556743643.480826] /rosbridge_websocket: Client connected.  5 clients total.
[INFO] [1556743643.484413] /rosbridge_websocket: Client connected.  6 clients total.
[INFO] [1556743643.487320] /rosbridge_websocket: Client connected.  7 clients total.
[INFO] [1556743643.489502] /rosbridge_websocket: Client connected.  8 clients total.
[INFO] [1556743643.870120] /rosbridge_websocket: [Client 0] Subscribed to /client_count
[INFO] [1556743643.871343] /rosbridge_websocket: [Client 1] Subscribed to /client_count
[INFO] [1556743643.872329] /rosbridge_websocket: [Client 2] Subscribed to /client_count
[INFO] [1556743643.873154] /rosbridge_websocket: [Client 3] Subscribed to /client_count
[INFO] [1556743643.874017] /rosbridge_websocket: [Client 4] Subscribed to /client_count
[INFO] [1556743643.874924] /rosbridge_websocket: [Client 5] Subscribed to /client_count
[INFO] [1556743643.876352] /rosbridge_websocket: [Client 6] Subscribed to /client_count
[INFO] [1556743643.877756] /rosbridge_websocket: [Client 7] Subscribed to /client_count
[INFO] [1556743643.898046] /rosbridge_websocket: Client disconnected. 7 clients total.
[INFO] [1556743643.900261] /rosbridge_websocket: Client disconnected. 6 clients total.
[INFO] [1556743643.901702] /rosbridge_websocket: Client disconnected. 5 clients total.
[INFO] [1556743643.903092] /rosbridge_websocket: Client disconnected. 4 clients total.
[INFO] [1556743643.904278] /rosbridge_websocket: Client disconnected. 3 clients total.
[INFO] [1556743643.910097] /rosbridge_websocket: Client connected.  4 clients total.
[INFO] [1556743643.913085] /rosbridge_websocket: Client connected.  5 clients total.
[INFO] [1556743643.915693] /rosbridge_websocket: Client connected.  6 clients total.
[INFO] [1556743643.918273] /rosbridge_websocket: Client connected.  7 clients total.
[INFO] [1556743643.922729] /rosbridge_websocket: Client connected.  8 clients total.
[INFO] [1556743643.925311] /rosbridge_websocket: Client connected.  9 clients total.
[INFO] [1556743643.929019] /rosbridge_websocket: Client connected.  10 clients total.
[INFO] [1556743643.931380] /rosbridge_websocket: Client connected.  11 clients total.
[WARN] [1556743643.933411] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.935495] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.936958] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.938208] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.939801] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.942148] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.943610] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.945090] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.946527] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[WARN] [1556743643.947879] /rosbridge_websocket: WebSocketClosedError: Tried to write to a closed websocket
[INFO] [1556743643.949612] /rosbridge_websocket: Client disconnected. 10 clients total.

Alabate added a commit to Alabate/rosbridge_suite that referenced this issue Jun 8, 2019
@Alabate
Copy link
Contributor Author

Alabate commented Jun 8, 2019

I've done a little more digging and the hanging function is finish() from subscribe.py. But this method hangs because on_msg takes the handler_lock and never release it because its content hangs. If we follow the hanging function, we come back to self.outgoing(serialized) from protocol.py which seems to be set in our case to send_message from websocket_handler.py.

Now that we are in websocket_handler.py we can see that the problem seems to be coming from the self._write_lock which lock forever in some cases:

# From src/rosbridge_suite/rosbridge_server/src/rosbridge_server/websocket_handler.py
            with self._write_lock:
                yield self.write_message(message, binary)

It seems that the lock stay locked when the yield is reached and until the second call of the generator. I was not sure about the behavior of a yield inside a with lock so I made a little script to test it and it confirmed my thoughts. So I made a fix on the PR #408

import threading
lock = threading.Lock()

def check_locked():
    if lock.acquire(False):
        print("Not locked")
        lock.release()
    else:
        print("LOCKED")

def gen_func():
    yield 0
    with lock:
        yield 1
    yield 2
gen = gen_func()

# At first the lock shouldn't be lock
check_locked()

# Shouldn't be locked too after the first yield
print(next(gen))
check_locked()

# What happen at the second yield within the `with`?
print(next(gen))
check_locked()

# The last one shouldn't be locked too
print(next(gen))
check_locked()

And the output

Not locked
0
Not locked
1
LOCKED
2
Not locked

Can be closed if #408 is accepted.

@mvollrath
Copy link
Contributor

mvollrath commented Jun 10, 2019

There is another condition for with exit while yielding, if the generator is garbage collected. This is probably why it works most of the time but not always. Tornado is probably holding onto a reference or GC is just asleep. Either way, the logic I wrote did not account for this.

#!/usr/bin/env python3

import gc
import threading
lock = threading.Lock()

def check_locked():
    if lock.acquire(False):
        print("Not locked")
        lock.release()
    else:
        print("LOCKED")

def gen_func():
    with lock:
        yield 0

gen = gen_func()

# At first the lock shouldn't be lock
check_locked()

# We expect it is locked here
print(next(gen))
check_locked()

# But not after garbage collection
gen = None
gc.collect()
check_locked()
Not locked
0
LOCKED
Not locked

@mvollrath
Copy link
Contributor

The return value of self.write_message is a Future according to the docs, it seems like you could do something to wait for that Future inside the with block before yielding it on.

@mvollrath
Copy link
Contributor

For example, if you lock, exhaust the inner generator with a for loop then unlock and yield all of those values again, that should do the trick. We expect that the inner generator will yield somewhere between 0 and 1 items. Maybe only the first item matters?

Don't you wish Tornado had features to solve this problem?

@Alabate
Copy link
Contributor Author

Alabate commented Jun 11, 2019

exhaust the inner generator with a for loop then unlock

I don't see any 'inner generator', do you talk about self.write_message ? It will return only one Future so it's not really a generator, right ?

Anyway the idea is the same, we need to avoid reliying on the garbadge collector and ensure the release() is allways called after Future is done. We cannot really block the coroutine by waiting inside of it, because we are not supposed to have blocking functions in coroutines (as tornado doc says). I've found two possible ways to do it:

I managed to do a POC with add_done_callback which seems to work. Do you see any flaws to this solution?

Intersting fact: write_message sometime returns None. I guess it's when socket is closed.

    @coroutine
    def prewrite_message(self, message, binary):
        self._write_lock.acquire()
        # Use a try block because the log decorator doesn't cooperate with @coroutine.
        try:
            future_handle = self.write_message(message, binary)
        except WebSocketClosedError:
            rospy.logwarn('WebSocketClosedError: Tried to write to a closed websocket')
            raise
        except BadYieldError:
            # Tornado <4.5.0 doesn't like its own yield and raises BadYieldError.
            # This does not affect functionality, so pass silently only in this case.
            if tornado_version_info < (4, 5, 0, 0):
                pass
            else:
                _log_exception()
                raise
        except:
            _log_exception()
            raise
        finally:
            self._write_lock.release()
    
        # Ensure lock is unlocked at the end of write_message
        if future_handle is None:
            rospy.logwarn('write_message was canceled')
            self._write_lock.release()
        else:
            future_handle.add_done_callback(lambda f: self._write_lock.release())
        yield future_handle

@Alabate
Copy link
Contributor Author

Alabate commented Jun 11, 2019

Also, I don't understand the need of locking here:

    def send_message(self, message):
        if type(message) == bson.BSON:
            binary = True
        elif type(message) == bytearray:
            binary = True
            message = bytes(message)
        else:
            binary = False

        with self._write_lock:
            IOLoop.instance().add_callback(partial(self.prewrite_message, message, binary))

We are just adding the function to the callback, it's not executed now (and we will be locking anyway, when we will really write on tornado). And as said in the documentation add_callback is threadsafe.

@mvollrath
Copy link
Contributor

Having a blocking function in a coroutine is part of this exercise. It is waiting on an RLock to prevent internal Tornado deadlock. This way Tornado can still process outgoing messages in parallel, but rosbridge can't write more messages until Tornado has finished processing all of its outbound.

The best case is:

Outgoing 0 on /foo gets the lock and hands off to IOLoop
IOLoop gets the lock and starts 0
Outgoing 1 on /bar waits for the lock
Outgoing 2 on /baz waits for the lock
Outgoing 3 on /foo waits for the lock
IOLoop finishes 0 and releases the lock
Outgoing 1 on /bar gets the lock and hands off to IOLoop
Outgoing 2 on /baz gets the lock and hands off to IOLoop
Outgoing 3 on /foo gets the lock and hands off to IOLoop
IOLoop gets the lock and starts 1 2 and 3 in parallel
More outgoing are queued up
IOLoop finishes 1 2 and 3 and releases the lock

The order threads waiting for a lock are acquired is undefined and may vary across implementations, so we can't expect true FIFO, but the result should always be that the infinite Tornado queue is reduced to a few messages on deck for each topic. Controlling and optimizing this behavior would be an interesting project. At some point Tornado is probably just getting in the way.

It sounds like using run_in_executor would be perfect to wait for the Future to finish in the prewrite_message with block.

@Alabate
Copy link
Contributor Author

Alabate commented Jun 11, 2019

Unfortunately, run_in_executor is a feature of Tornado 5.0. There is the decorator run_on_executor but it requires a named executor that I don't have and I don't want to spawn one.
And finally I cannot find a way to wait for a future except by doing a dirty loop that checks and sleep.

However playing around with the code, I find another way to fix the issue:

    def prewrite_message(self, message, binary):
        # Use a try block because the log decorator doesn't cooperate with @coroutine.
        try:
            with self._write_lock:
                future_handle = self.write_message(message, binary)

                # When closing, self.write_message() return None even if it's an undocument output.
                # Consider it as WebSocketClosedError
                if future_handle is None:
                    raise WebSocketClosedError
                    
                yield future_handle
        except WebSocketClosedError:
            rospy.logwarn('WebSocketClosedError: Tried to write to a closed websocket')
            raise

As we've seen before, sometime future_handle is None. And actually, it seems that the deadlock appears only when it's None, which seems logical:

And as described here:

The decorator receives a Future from the generator, waits (without blocking) for that Future to complete, then “unwraps” the Future and sends the result back into the generator as the result of the yield expression.

  • If the coroutine yield a Future instance, the coroutine is called back when future is done, which release the lock
  • if future_handle is None: tornado doesn't underdand that we want to be called again, why would he ?

The question is now, why write_message return None. It's not supposed to. So I guess it's actually a bug of tornado. But it's not that hard to mitigate on our side.

@mvollrath
Copy link
Contributor

That looks great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants