-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in connection disconnect #732
Comments
Here's part of a backtrace showing basically the same problem, but with a SentinelManagedConnection: File "/usr/local/lib/python3.5/site-packages/redis/client.py", line 880, in get
return self.execute_command('GET', name)
File "/usr/local/lib/python3.5/site-packages/redis/client.py", line 572, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 563, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 538, in send_packed_command
self.connect()
File "/usr/local/lib/python3.5/site-packages/redis/sentinel.py", line 44, in connect
self.connect_to(self.connection_pool.get_master_address())
File "/usr/local/lib/python3.5/site-packages/redis/sentinel.py", line 106, in get_master_address
self.disconnect()
File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 921, in disconnect
connection.disconnect()
File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 530, in disconnect
self._sock.close()
AttributeError: 'NoneType' object has no attribute 'close' |
I've pretty much convinced myself ConnectionPool doesn't have the locking it needs. The disconnect() method is ripping out connections that are in use and I suspect the call in Connection.disconnect() to self._sock.shutdown() ends up being a system call, so you lose the GIL and any protection you were counting on from that. I did a quick experiment to catch and ignore the AttributeError on self._sock.close() and ended up with a RuntimeError in ConnectionPool.disconnect(). See the stack trace below: File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/client.py", line 925, in get
return self.execute_command('GET', name)
File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/client.py", line 623, in execute_command
connection.send_command(*args)
File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/connection.py", line 571, in send_command
self.send_packed_command(self.pack_command(*args))
File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/connection.py", line 546, in send_packed_command
self.connect()
File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/sentinel.py", line 45, in connect
self.connect_to(self.connection_pool.get_master_address())
File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/sentinel.py", line 107, in get_master_address
self.disconnect()
File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/connection.py", line 961, in disconnect
for connection in all_conns:
RuntimeError: Set changed size during iteration |
The This error shouldn't happen, but lets see why it is blowing in our face. I would need the version of redis-py used and a piece of code that raises this issue. Cheers, |
This was running redis-py current with master up to and including this commit: 20fc04e I did a minor patch to temporarily work-around the problem in def disconnect(self):
"Disconnects from the Redis server"
self._parser.on_disconnect()
if self._sock is None:
return
try:
self._sock.shutdown(socket.SHUT_RDWR)
self._sock.close()
# iio: Probably not the full fix, but this works around an
# "AttributeError: 'NoneType' object has no attribute 'close'" error
except AttributeError:
pass
except socket.error:
pass
self._sock = None Where you get multiple threads accessing the same connection is in def get_master_address(self):
master_address = self.sentinel_manager.discover_master(
self.service_name)
if self.is_master:
if self.master_address is None:
self.master_address = master_address
elif master_address != self.master_address:
# Master address changed, disconnect all clients in this pool
self.disconnect()
return master_address In the case where the master address changes, it ends up disconnecting every connection in the pool, with no locking or care taken to avoid issues with the current owner of the connection. The def disconnect(self):
"Disconnects all connections in the pool"
all_conns = chain(self._available_connections,
self._in_use_connections)
for connection in all_conns:
connection.disconnect() Hopefully that makes it pretty clear how multiple threads are touching the same connection. If not and you really need something to reproduce it, it'll take me a bit to do that. It's happening in a web app under apache2 + mod_wsgi + flask on Python 3.5.1 when Sentinel sees a change in the master's IP address, so complicated to try to pull out for a test case. |
Ill be out two days. I will check it then. In any case thanks for your detailed comment. |
Hi @rolette you got the point, there is an issue with threads with the I guess that the call to the Im afraid that this only the spike of the iceberg, I mean that the race condition can appear in the all access to the Then what ? Either wrap all classes to become thread safe or change the disconnect method. The first will get time and its not clear nowadays how to do that without break backward compatibility, the second option is tricky but easy to implement. How ? /cc @andymccurdy The idea is use the shutdown syscall and call it from the pool connection and out of the current disconnect mehod that belongs to the [1] https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L951 |
This PR tries to solve the issues raised by redis#732 regarding the danger of disconnect clients from the `ConnectionPool.disconnect` method executed by a Thread different that those ones that are in charge of the connections. Instead of call the `Connection.disconnect` method it uses the syscall `shutdown` to leave the socket unusable. Once the connection tries to use the socket, even when it is already blocked such us the `PubSub` pattern, it gets a `socket.error` exception that will be cactched by the `Connection` class to then raise an `ConnectionError` and disconnect the socket in a clean and safe way. The `Client.execute_command` function catches the `ConnectionError` exception and tries to connect again and run the command that raised the error. Worth mentioning that in the case of the `Sentinel` environment, if some changes regarding the Redis pool of servers - perhaps the mater went down and a slave was promoted - the next command will be executed using an other server.
@rolette do you wanna try this fix ? It should remove the problem that you experimented. |
@pfreixes definitely an improvement, but still issues. Taking your patch as-is, here's what I see now: File "/mnt/hgfs/w/redis-py/redis/client.py", line 925, in get
return self.execute_command('GET', name)
File "/mnt/hgfs/w/redis-py/redis/client.py", line 617, in execute_command
connection.send_command(*args)
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 579, in send_command
self.send_packed_command(self.pack_command(*args))
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 554, in send_packed_command
self.connect()
File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 45, in connect
self.connect_to(self.connection_pool.get_master_address())
File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 107, in get_master_address
self.disconnect()
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 970, in disconnect
connection.shutdown_socket()
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 545, in shutdown_socket
self._sock.shutdown(socket.SHUT_RDWR)
OSError: [Errno 107] Transport endpoint is not connected Whether you get the def shutdown_socket(self):
"""
Shutdown the socket hold by the current connection, called from
the connection pool class u other manager to singal it that has to be
disconnected in a thread safe way. Later the connection instance
will get an error and will call `disconnect` by it self.
"""
try:
self._sock.shutdown(socket.SHUT_RDWR)
except AttributeError:
# either _sock attribute does not exist or
# connection thread removed it.
pass
except OSError as e:
if e.errno == 107:
# Transport endpoint is not connected
pass
else:
raise then I'm back to the same Traceback (most recent call last):
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 565, in send_packed_command
self._sock.sendall(item)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/hgfs/w/redis-py/redis/client.py", line 617, in execute_command
connection.send_command(*args)
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 585, in send_command
self.send_packed_command(self.pack_command(*args))
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 578, in send_packed_command
(errno, errmsg))
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
# removed app-specific parts of stack
File "/mnt/hgfs/w/redis-py/redis/client.py", line 925, in get
return self.execute_command('GET', name)
File "/mnt/hgfs/w/redis-py/redis/client.py", line 623, in execute_command
connection.send_command(*args)
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 585, in send_command
self.send_packed_command(self.pack_command(*args))
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 560, in send_packed_command
self.connect()
File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 45, in connect
self.connect_to(self.connection_pool.get_master_address())
File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 107, in get_master_address
self.disconnect()
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 975, in disconnect
for connection in all_conns:
RuntimeError: Set changed size during iteration I'm not seeing a way around adding proper synchronization instead of relying on the GIL, but maybe my python-foo is lacking and there's a different way to get there. |
@rolette thanks for the fast feedback, updated the PR with the two issues commented. I hope that this will be the definitely to fix your issue. |
@pfreixes progress... it gets past the Traceback (most recent call last):
File "/mnt/hgfs/w/redis-py/redis/client.py", line 618, in execute_command
return self.parse_response(connection, command_name, **options)
File "/mnt/hgfs/w/redis-py/redis/client.py", line 630, in parse_response
response = connection.read_response()
File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 56, in read_response
return super(SentinelManagedConnection, self).read_response()
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 587, in read_response
response = self._parser.read_response()
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 333, in read_response
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
# removed app-specific parts of stack
File "/mnt/hgfs/w/redis-py/redis/client.py", line 1902, in hget
return self.execute_command('HGET', name, key)
File "/mnt/hgfs/w/redis-py/redis/client.py", line 624, in execute_command
return self.parse_response(connection, command_name, **options)
File "/mnt/hgfs/w/redis-py/redis/client.py", line 630, in parse_response
response = connection.read_response()
File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 56, in read_response
return super(SentinelManagedConnection, self).read_response()
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 587, in read_response
response = self._parser.read_response()
File "/mnt/hgfs/w/redis-py/redis/connection.py", line 361, in read_response
self._reader.feed(self._buffer, 0, bufflen)
AttributeError: 'NoneType' object has no attribute 'feed' |
mmm weird I been running the next snippet for a while and it didnt raise any issue from threading import Thread
from redis.sentinel import Sentinel
from time import sleep
def ping_forever(client):
while True:
client.ping()
sleep(0.1)
s = Sentinel([('localhost', 26379)], socket_timeout=0.1)
m = s.master_for('mymaster', socket_timeout=0.1)
t = Thread(target=ping_forever, args=(m,))
t.start()
sleep(1)
while True:
print "Killing connection "
m.connection_pool.disconnect()
print "Connection finished"
sleep(0.1)
t.join() I must need all of this necessary to reproduce the issue, a snippet with the dependencies should be enough. |
It's happening in a web app under apache2 + mod_wsgi + flask on Python 3.5.1, so nothing easy for me to just pull a snippet to repro unfortunately. More context on how I'm making this happen:
sentinel = Sentinel([('127.0.0.1', 26379)], socket_timeout=0.5)
return SentinelConnectionPool('mymaster', sentinel, socket_timeout=3.0, retry_on_timeout=True)
Looking at your test code, the main question I'd have is whether you have hiredis installed or not. The path that blew up in my most recent traceback was down in the $ pip3 list | grep hiredis
hiredis (0.1.5) |
Yes I used hiredis. It is gonna take me a while, lets see if Im able to find what is going on here with your feedback. |
@rolette are you sharing the SentinelConnectionPool between threads via a Flask Proxy? |
@pfreixes Not sure specifically what you mean by "Flask Proxy", but the SentinelConnectionPool is being shared across threads, yes. $ sudo apachectl -V
Server version: Apache/2.4.7 (Ubuntu)
Server built: Mar 10 2015 13:05:59
Server's Module Magic Number: 20120211:27
Server loaded: APR 1.5.1-dev, APR-UTIL 1.5.3
Compiled using: APR 1.5.1-dev, APR-UTIL 1.5.3
Architecture: 64-bit
Server MPM: event
threaded: yes (fixed thread count)
forked: yes (variable process count)
Server compiled with....
-D APR_HAS_SENDFILE
-D APR_HAS_MMAP
-D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
-D APR_USE_SYSVSEM_SERIALIZE
-D APR_USE_PTHREAD_SERIALIZE
-D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
-D APR_HAS_OTHER_CHILD
-D AP_HAVE_RELIABLE_PIPED_LOGS
-D DYNAMIC_MODULE_LIMIT=256
-D HTTPD_ROOT="/etc/apache2"
-D SUEXEC_BIN="/usr/lib/apache2/suexec"
-D DEFAULT_PIDLOG="/var/run/apache2.pid"
-D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
-D DEFAULT_ERRORLOG="logs/error_log"
-D AP_TYPES_CONFIG_FILE="mime.types"
-D SERVER_CONFIG_FILE="apache2.conf" |
@rolette Sadly In aims to avoid introduce so many changes in the core of redis-py and willing leave the next rationale untouched: 1 - My idea to address this issue is move all the necessary code to the specific classes published by Sentinel [1] https://github.com/andymccurdy/redis-py/blob/master/redis/sentinel.py#L69 |
@pfreixes Do you mean that the implementation of The docs in the code for class BlockingConnectionPool(ConnectionPool):
"""
Thread-safe blocking connection pool::
>>> from redis.client import Redis
>>> client = Redis(connection_pool=BlockingConnectionPool())
It performs the same function as the default
``:py:class: ~redis.connection.ConnectionPool`` implementation, in that,
it maintains a pool of reusable connections that can be shared by
multiple redis clients (safely across threads if required).
The difference is that, in the event that a client tries to get a
connection from the pool when all of connections are in use, rather than
raising a ``:py:class: ~redis.exceptions.ConnectionError`` (as the default
``:py:class: ~redis.connection.ConnectionPool`` implementation does), it
makes the client wait ("blocks") for a specified number of seconds until
a connection becomes available. I can't find any assertion one way or the other for |
@rolette the issue with As I said the rationale is right but IMHO due the both functions [1] https://github.com/andymccurdy/redis-py/blob/master/redis/sentinel.py#L99 |
@rolette Ive tried many attempts to get a race condition and I didn't make it ... but Ive prepared a POC patch [1] that might solve your problem. Right now Im not feel very confidence with that issue and a bit frustrated ... [1] pfreixes@6a95994 |
Thanks for digging into this. I just had a chance to come up to speed on the conversation. Looking at the code, I agree that there's a race condition in In addition to the race condition, it seems like there's another bug as well. The only time the Sentinels are queried about the state of the cluster is when the I think a better solution is to regularly query the state of the sentinel cluster and store the master address and slave addresses on the connection pool instance. This doesn't account for connection instances that get cached on pipeline or pubsub objects, so we'll need to do a little more to help those out too. I'm currently traveling but should have some free time around the end of this week. I'm happy to take a stab at this at that time. If you guys see any glaring holes in this idea, please let me know :) |
SO can we resume it as
/cc @rolette any news about the patch ? [1] http://redis.io/topics/sentinel |
@pfreixes Unfortunately adding a threading lock just for While we could play whack-a-mole and squelch the One way to approach it:
from there, you can either just let threads with in-use connections deal with the errors from their action (timeouts, read-only errors, etc.), which they have to do anyway, or you can implement a mechanism similar to what you would do in a signal handler... set a flag on each connection and then let the thread that owns the connection deal with the flag when it is safe to do so. Having the in-use connections deal with the timeouts, etc. isn't near as gross as it seems for clients if they are using the |
The signal here try to minimize the refactoring having in mind commands that keep the connection opened until there is something to do, perhaps The use of |
Connection.disconnect() were ripping connections out from under the threads or processes that owned them redis#732
Connection.disconnect() were ripping connections out from under the threads or processes that owned them redis#732
Connection.disconnect() were ripping connections out from under the threads or processes that owned them redis#732
Can we please merge @rolette's patches? I looked over them and they seem very reasonable, also tested them for a while and found no problems, so +1 from me. |
We had these changes in production for a while and after a Redis master failover, there is a flood of new connections opening, eventually making the process unusable, since it runs out of client TCP ports. That might be specific to our environment, but I think the code is not safe as is. |
@lalinsky: That sounds more like an application resource problem than a problem with the pool. There's no rate limit on the connection pool for opening connections. When the master changes, the old connections have to be closed and replaced with new ones (currently on an as-needed basis, not all connections concurrently). When you close a connection, the network stack doesn't make the port available immediately. Even if the connection goes through a FIN, FIN-ACK, ACK sequence, it gets put into the TIME_WAIT state for some amount of time. If you are pushing the limit on the number of concurrent connections, it's easy to start getting errors when you try to open a new connection because no client ports are available. In Python, that shows up an OSError 99 - "Cannot assign requested address." You might be able to fix that scenario by using the BlockingConnectionPool and setting appropriate limits. Just make sure you get the parallel fixes to that referenced earlier in the thread. If you want to have some real fun and "interesting" backtraces, test your app when the master fails over without this changeset. |
@rolette I don't think it's the rate of the new connections that it's the core problem, it's that the connections never make it properly to the pool and are never reused. I have not done any serious testing to confirm that, but that's my intuition. If it was just the rate of new connections, we would have the same problem on app startup, but that's not happening there. |
…master failover redis/redis-py#732 APP-2467 APP-3994
@rolette FYI, I have figured out the problem with your branch. When the master address changes, it's not remembered and therefore every new connection attempt causes the connection pool to close all existing connections. That's easy to fix at https://github.com/andymccurdy/redis-py/blob/master/redis/sentinel.py#L106, but I'm going to open a slightly different PR with the sentinel pool redesigned to not close the connections immediately -- #909 |
@lalinsky Thanks for following up. That's an issue in the master (nothing specific to my branch). We recently merged in PR #847 into our code base to fix the OSError 99 problems. I keep hoping @andymccurdy will merge the various |
I'm seeing a few more errors, all coming from the same place, wondering if they have the same cause. In addition to
and
I'm guessing they're just other things that can go wrong when the connection disappears without proper locking. |
I have also been running into @hobzcalvin's errors periodically and inconsistently. We utilize redis as a Celery backend and when running multiple tasks asynchronously, a stream of these AttributeErrors and ProtocolErrors pop up after task timeouts or after a task failure. I understand that this problem might happen due to leftover bytes in connection after the socket closes that throw the ProtocolException when the connection is re-opened. I am hoping there is a graceful way to handle this and flush the bytes. |
3.2.0 has just been released and fixes this issues. |
Connection.disconnect() were ripping connections out from under the threads or processes that owned them redis#732
Connection.disconnect() were ripping connections out from under the threads or processes that owned them redis#732
It seems like there is a race condition in the code for
Connection.disconnect()
. I run redis as a celery broker, so I seem to be running into a situation where I get a TypeError: NoneType has no attribute_close()
. It would appear that in one processdisconnect()
is being called and setting theself._sock
to None while another process has already executed the if statement that checks ifself._sock
is None.The text was updated successfully, but these errors were encountered: