-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: This socket has been ended by the other party #192
Comments
@llambda -- does it crash? Or is that the error returned by the promise? |
Actually if I properly read the stacktrace, your query is rejected. I guess this can happen if you something like this happen:
@llambda -- how does often does it happen? Do you have unreliable connection between your server and the database? |
Did not crash. It was a local node web server hitting a local rethinkdb on my Linux desktop. Once it happened, it seemed like no queries resolved until I restarted the web app. I happened I think twice in the last few days, seemed to be after returning to the computer after away from a while. I don't have any power management on except blank screen, so not sure if that is related or not. |
I have another stack with latest version of Rethinkdbdash. The process uptime was 75405 seconds on a production box, so no sleep or power management. It seems to be occurring in a Promise rejection, as it is handled and logged for me by a bluebird unhandled rejection handler, and therefore the process did not crash. I am not sure if it automatically recovers or not; but I am pretty sure that it gets in a state where no DB queries return results, until the web server process is restarted. I'm not sure because it was on one server out of three and I was upgrading Node on them so ended up restarting the web server processes and never noticed the issue had occurred until I was reviewing the logs today. (The servers are load balanced, so if a problem on one occurs, I don't normally notice it unfortunately unless I guess they all went down.) I'll keep an eye on it and try to troubleshoot further. I don't know if it would be helpful here to log the host of the rethinkdb. I have 3 computers in the cluster, each runs a rethinkdb and a node web server. So the rethinkdb connection here could potentially be to a different host than the host the web server is running on.
|
Hum, I still can't figure how this could happen, but I can add one more check in |
I just pushed 2.2.10 that should catch earlier this error but I believe it doesn't address the underlying issue. My guess is that we attempt to write on a TCP socket after it's closed and before it emitted @llambda -- when it happens again, can you provide the stacktrace? Thanks! |
Hey @neumino , I got 5 of the following logged this afternoon, all on just one server.
|
@llambda -- My bad, just pushed a fix in 2.2.11 for that. |
So, I haven't seen any unhandled promise rejections now. On 2.2.11. I do see the following, repeated twice, at the end of the stderr of the web server that had the original rejections. However, since it appears to just have been logged to stderr, there is no date information, so I am not sure when it happened. The app hasn't crashed or stopped working so I am guessing this is just normal logging for something handled by the connection pool now?
|
Hum, this seems to mean that the connection was closed ( @llambda -- do you use the connection pool to run query? Or do you open connection yourself? |
I use the connection pool. Come to think of it, those 2 connection messages could have been at process shutdown, though, since right before there was a "Draining the pool connected to host:28015" |
Ok thanks, I have a rough idea what's happening. I probably won't have time in the next few days since I'm moving, but I should be ablet to take a look at that after. |
@neumino this is happening for me as well. It's a pretty big issue as there is (or at least was) an expectation that rdbdash will silently handle reconnection (read: we don't have to worry as long as the remote server is up, queries will "Just Work™"). I can reproduce this just by leaving my local rest api running, closing my laptop, opening it back up and attempting to use the api. I think the issue should be reopened? |
I still can't reproduce it, but I have a theory that may explain why. |
I actually haven't had the issue anymore myself. I run on Fedora 23 desktop using node 5. Production is CentOS 7, with 3 computers each running the same node.js script and a rethinkdb instance. I run clustered with socket.io and am using changefeeds. I wonder if taking something like https://github.com/llambda/tcpslow and adding random fault-injecting socket.destroy calls might help trigger it. I might try to do that. |
I added fault injection and did manage to get a
I had to sit in the browser and press Ctrl + R a lot to get that. But nothing else, just expected errors like that and "unhandledPromiseRejection: ReqlServerError: The connection was closed before the query could be completed for ...". |
@mbroadst when it happens what do you see (if anything) for error message, and does your process exit? Which version are you on? (I am on 2.2.15) |
So this stacktrace is basically you calling When that happens, the TCP connection emits I'll dig a bit more, but so far I've been unable to reproduce it and find how this could happen. |
I just pushed 2.2.16. In case a connection is being used by a query, and the connection receive a FIN packet, maybe we put back the connection in the pool later? Though that seems impossible as it would mean that the query actually succeeded. @llambda, @mbroadst -- let me know if you still see this error with 2.2.16. |
One more question, are you using |
Hey guys, sorry busy weekend over here. Let's see
I haven't had a chance to test again, but I'll try to run some test scenarios later tonight and see if I can recreate it. Thanks |
@snackycracky -- Your issue is not related. The connection on which you open a changefeed was closed by something. This is probably an issue in your setup (docker? firewall? linux tcp settings?) |
yeah I am using docker + ELB on aws, do you have an idea what this "something" could be ? The Ilde-time for the load balancers maybe? |
@snackycracky -- I have never used ELB, but looking at the docs, there's a 60 timeout for idle connections: https://aws.amazon.com/blogs/aws/elb-idle-timeout-control/ |
yeah, there are multiple options possible with ELB, but connecting directly to rethinkdb seems to solve the problem also with rethinkdbdash. Probably the ilde timeout is not the only reason, I probably also need sticky sessions and HAProxy and enable ProxyProtocolPolicyType. Sorry for highjacking this issue. My problem is more related to rethinkdb/rethinkdb#2956 |
Same error here, all queries stopped working (2.2.17).
One other thing which may be related is that the computer changed wifi networks, and the log has quite a few of these messages:
In this case RethinkDB ran with |
If you switched network, it's expected that all the connections will dropped. |
Sure, I just mentioned it here because it may be related. Whether it ever recovers, I don't know... certainly didn't do so within 5 or so minutes I was looking at it. What may be the original problem reappeared today on a test server (no changing networks there) -- all queries stopped working. Looking at open file descriptors, ( The process had been running for 5 days up to that point, so I guess the open connections accumulate and never get properly reused or cleaned up? Sorry I can't be of more help atm. |
Well, since now I know how to count open connections, I did a quick test:
So it looks like rethinkdbdash keeps opening new connections and never even reuses or closes the old ones! |
This is working as expected. If you set a buffer of 50 connections, the driver will try to keep 50 connections available. What happens in your example is:
If you wait a bit, the connection will be cleaned. By default it's 2 hours if I'm not mistaken. Basically it's still working as intended. |
Ah, cool! Thanks for the explanation, it makes sense now. The source of my too-many-connections problem turned out to be not closing the changefeed stream returned by rethinkdbdash defines its own |
I'm also running into the problem @llambda ran into: var express = require('express');
var server = require('http').Server(express());
var config = rootRequire('config').rethinkdb;
var r = require('rethinkdbdash')({servers: [config]});
var winston = rootRequire('log');
var metric = rootRequire('metric');
var stackTrace = require('stack-trace');
module.exports = function(io) {
io.on('connection', function(socket) {
var now = Date.now();
socket.on('initNotification', function(identifier) {
winston.info('a user connected', {user: identifier, endpoint: 'socketio'});
return r.db('notification').table('events')
.filter(function(item) {
return item('time').gt(now).and(
item('user2').eq(identifier))
})
.changes()
.run()
.then(function(cursor) {
cursor.each(function(err, item) {
if (item && item.new_val && !item.old_val) {
socket.emit('notification', item.new_val);
}
})
socket.on('disconnect', function() {
winston.info('a user disconnected correctly', {endpoint: 'socketio'});
cursor.close();
})
})
.then(function() {
metric.checkDaily(function() {
var values = {users: identifier};
metric.updateTable('dailyMetrics', values);
});
})
.catch(function(err) {
winston.error(JSON.stringify(err),
{user: identifier, endpoint: 'socketio', trace: stackTrace.parse(err)});
});
});
});
} When I leave my server running with these connections open for a long period of time I'll get a similar output as listed above:
Is there something I'm not doing in the catch statement or something else I'm missing? I'm running on 2.2.18 with RethinkDB 2.2.5 |
Do you run that locally? What's the tcp timeout on your machine? |
I'm running it on a DigitalOcean droplet with RethinkDB on the same box. It's my dev environment. I'm not sure what the tcp timeout is on the box. How can I check my tcp timeout? |
So basically here's what happen is the following:
The driver does specifies the connection to be keep alived As far as I know, RethinkDB doesn't close idle connection, so it's probably linux that decides to kill the connection. There's some doc here about tcp keepalive Maybe you have bad defaults? Or try to increase the keep alive time? |
we're experiencing similar issues. we have a changes feed on a table but if the connection brakes with the db for whatever reason (db restart, network issue) we will not receive the changes events anymore. |
@danielmewes -- can we add something in the protocol to just ping a connection? It looks like keepalive is not always sufficient for some people. |
@neumino Could you just run something like That being said the idea has come up before. I think some routers don't work too well when it comes to TCP keepalive, or the timeouts are simply too long. Sounds worth considering, but probably not at the top of the priority list for the next few months. |
Ok, I just released 2.3.3 with the option Note that we use |
cool Sent from my Tricorder
|
Saw this a couple times today and yesterday, I am not sure why though. 2.2.4.
The text was updated successfully, but these errors were encountered: