Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket server stop responding after some hours. #328

Closed
pradeeptakhatoi opened this issue Jun 11, 2015 · 33 comments
Closed

Websocket server stop responding after some hours. #328

pradeeptakhatoi opened this issue Jun 11, 2015 · 33 comments

Comments

@pradeeptakhatoi
Copy link

Hello. I have used the supervisor concept to check and restart my chat server if it is stopped.

After some time interval my chat not working.When i check the console i didn't find any websocket connection fail error.

Then i use below command to check my server is running or not and see the server still running.

ps aux | grep php

So i am confused where is the issue or bug for which my chat application not working.

@pradeeptakhatoi
Copy link
Author

When i kill the chat server process manually by using the "kill" command then the supervisor start the chat server immediately and my chat working fine.

@cboden
Copy link
Member

cboden commented Jun 15, 2015

Put the following code in your script and restart it:

error_reporting(E_ALL);
ini_set('display_errors', 1);

Please paste the output of the following command here:

composer show -i

Once the error happens again don't kill the script. Check and paste the latest output of the supervisor stdout and error logs in this issue.

Once the error has happens ssh directly into the server and run the telnet command and enter the following into it (and paste the output here):

GET / HTTP/1.1
Host: localhost

@rohitkhatri
Copy link

I'm going through the same problem.

and here is what was shown on "composer show -i" command

cboden/ratchet v0.3.3 PHP WebSocket library
evenement/evenement v2.0.0 Événement is a very simple event dispatching library for PHP
guzzle/common v3.9.2 Common libraries used by Guzzle
guzzle/http v3.9.2 HTTP libraries used by Guzzle
guzzle/parser v3.9.2 Interchangeable parsers used by Guzzle
guzzle/stream v3.9.2 Guzzle stream wrapper component
react/event-loop v0.4.1 Event loop abstraction layer that libraries can use for evented I/O.
react/socket v0.4.2 Library for building an evented socket server.
react/stream v0.4.2 Basic readable and writable stream interfaces that support piping.
react/zmq v0.3.0 ZeroMQ bindings for React.
symfony/event-dispatcher v2.7.1 Symfony EventDispatcher Component
symfony/http-foundation v2.7.1 Symfony HttpFoundation Component
symfony/routing v2.7.1 Symfony Routing Component

I placed these statements at the starting of my server file, but didn't find any.

error_reporting(E_ALL);
ini_set('display_errors', 1);

please help me.

@cboden
Copy link
Member

cboden commented Jul 1, 2015

@rohitkhatri What happens when you do the telnet thing after the server has stopped responding?

@cboden
Copy link
Member

cboden commented Aug 20, 2015

Closing due to extended period of inactivity. I'll re-open if reporting continues.

@cboden cboden closed this as completed Aug 20, 2015
@Chaosvex
Copy link

Slightly late to the party but I've been looking into Ratchet again and I wanted to check to see if some of the issues that I encountered the last time I tried to use it (two years ago) have been resolved.

This particular issue was a showstopper when it came to using Ratchet for a service that ran continuously and I'm seeing a fair number of reports here with similar symptoms. Ratchet would work fine for a while before becoming unresponsive to new connections, maxing out the CPU and needing a restart. Initially, it'd run happily for several weeks but that ended dropping to needing manual intervention every couple of days.

The problem turned out to be file descriptor leakage and as soon as it hit the maximum (1024 by default, I believe), restarting the process would be required. As the site's traffic grew, it'd end up hitting the limit sooner. I didn't have the time or interest to fix the root problem for a hobby project so I can't offer up a patch but perhaps being aware that there was/is a leak can help get the ball rolling.

Edit: I can see you've suggested raising the limit in another issue (#349 (comment)), so I guess you're already aware of the problem. I don't consider increasing the limit to be a robust solution, so is there something that's preventing this from being fixed or a way to ensure the user doesn't run into this problem?

@pinumanu
Copy link

pinumanu commented Nov 6, 2015

pradeepta20 .. got any solution for that...

@pradeeptakhatoi
Copy link
Author

Initially i have done the database coding using PDO.Later i change the code to mysqli its working fine for me.
I don't know the exact reason but it was due to some database coding error.

@pradeeptakhatoi
Copy link
Author

Use below technique to connect automatically when Websocket server restarded.

var wsUri;
$(function() {
wsUri = $("ws_uri").val();
init();
});
function init() {
websocket = new WebSocket(wsUri);
websocket.onopen = function(ev) { //connection is open //
}
websocket.onmessage = function(ev) {
};
websocket.onerror = function(ev){
};
websocket.onclose = function(ev) {
init();
};
}

@MarcGodard
Copy link

I am having a similar issue. The web browser doesn't connect (but it doesnt fail either) and hen I check the websocket logs, everything looks fine and the new connection isn't started. My sockets are basic without a database.

@cboden
Copy link
Member

cboden commented May 1, 2016

@Chaosvex The file descriptor limit issue is an operating system level feature...you will run into this problem with any language/environment (not just Ratchet/PHP). If you're hosting a WebSocket server 1024 is WAY too low if you expect any traffic at all.

@MarcGodard Please follow the directions from this comment above and report the output so we can help troubleshoot.

@Chaosvex
Copy link

Chaosvex commented May 1, 2016

@cboden Yes, I know but that has nothing to do with the leaks within Ratchet.

@MarcGodard
Copy link

@cboden I figured out the issue minutes after posting that. It was my fault. Thanks for the quick responce.

@andig
Copy link

andig commented May 1, 2016

@MarcGodard could you share your finding if its of general interest?

@MarcGodard
Copy link

I had an error in my code, that's all.

@Vendin
Copy link

Vendin commented Aug 24, 2016

The same problem as at @pradeepta20, in several hours sockets cease to answer. In this discussion I haven't found the solution. Whether there is a solution in general, except as restart of sockets constantly?

@grappetite-tahir
Copy link

I'm using AWS EC2 server for my application. My chat has stopped 2 times till now automatically but when I go to terminal and use the pgrep command to find the websocket process, its showing me processes, its mean that socket is connected to server. but still chat was not working. Then i kill all two process mnually and my cron job which is set for every 1 minute to start socket its running again and my chat get live again. I faced this issue 2 times. I don't know how things are going on in the background. I didn't find any logs, I have maintained all the logs that user is attached, user msg to this user, blah blah. but there is no error log in my logs.

Further more if mobile user just turned off his/her wifi or mobile data my chat is not firing onClose method. So the thing is that how to mark a user offline in this scenario if someone met this criteria. Which is the way to mark user offline if onClose method not fired.

Thanks!

@Chaosvex
Copy link

Chaosvex commented Mar 8, 2019

Considering that neither developer has responded to a showstopper of a bug in several years, you'd be best dropping this project and looking for an alternative.

@WyriHaximus
Copy link

We fixed several memory leaks in ReactPHP over the last few years since your last comment. We implemented new event loops and done a ton of performance upgrades. The thing is you need to account for the following when deploying:

  • Raise ulimit for your user/application so it can handle more then 1024 open FD's
  • Use an event loop extension over the default stream select loop which is hard capped at 1024 FD's
  • Monitor your application's connections, memory usage, CPU usage, request/messages per second

we simple cannot debug very deeply into your problem without information like that. But what also matters is what you do when a connection has been establisht, what kind of resources you use and how you connect to databases/storage etc etc.

And last I really want to stress that we cannot change the 1024 thing for you on your deployment. You have to do that yourself. It's baked in hardcoded into stream_select and you need to change the event loop and raise it on ulimit if you want to resolve that. Wheither or not you consider raising the limit a good solution, that is how the OS works, it's entirely out of our hands and into yours to resolve that on your systems.

@Chaosvex
Copy link

Chaosvex commented Mar 8, 2019

This was never about memory leaks or even a concurrent handle/descriptor limit, it was about the descriptors leaking so a popular Ratchet-based site would have to be frequently killed off and restarted to prevent it from live-locking and potentially bringing an entire system to a crawl. Concurrent handles was irrelevant.

@WyriHaximus
Copy link

File descriptors don't leak, and according to your own description hitting the 1024 limit was/is the root cause of the issue you where having. They can leak over from a parent process into a child process but that's a totally different issue. (And we added tooling against that in react/child-process v0.6.0.)

The thing is we have several users doing more then 50K+ connections in a single process with Ratchet without a hitch. And I don't want to suggest the problem is within your code, but implementing a long running service in PHP is a delicate dance between us the maintainers of Ratchet/ReactPHP and the implementer's of our libraries. This is also on a lower level PHP then your normal request-response cycle. I gladly sit down for a hangouts call with anyone to resolve their problems, but it's going to get technical and diving into tracing, logging and monitoring really fast. And to be really honest I can't do much with "It's locking up please fix". I want to dive into that and preferably try and find a way to reproduce it so we can fix it up stream.

Personally I monitor the memory usage (amongst things) fanatically to stay ahead of issues like this.

image

@Chaosvex
Copy link

Chaosvex commented Mar 8, 2019

Yes, hitting 1024 total over the lifetime of the process, not 1024 concurrent. When those connections are no longer active and there's no longer a handle available, that's when it becomes fair to refer to it as a leak.

@cboden
Copy link
Member

cboden commented Mar 8, 2019

Just tested this with StreamSelectLoop, no problems. I suspect you're holding a reference somewhere, holding open 1024 connections, causing your problems.

composer require cboden/ratchet

<?php
    require __DIR__ . '/vendor/autoload.php';

    $app = new Ratchet\App('localhost', 8080);
    $app->route('/echo', new Ratchet\Server\EchoServer, array('*'));
    $app->run();
// In a browser
setInterval(() => {
  let c = new WebSocket('ws://localhost:8080/echo');
  setTimeout(() => c.close(), 250);
}, 500);

screen shot 2019-03-08 at 2 11 47 pm


^ Server still responding with 101 HTTP status code after 1024 connections.

@inri13666
Copy link

Given example implements variant with closed connections
What about not closed connections
I'm failed after 125 connections
with

var x = []; for(var i=0; i<1024; i++) {x[i] = new WebSocket('ws://MYHOSTADDERSS/ws');}

@cboden
Copy link
Member

cboden commented Apr 5, 2019

@inri13666 What's the error you're getting? The reason I closed connections was because the browser I was testing with didn't allow that many concurrent connections to a single server as a security measure.

@inri13666
Copy link

I do not have any errors in the log, the socket not responding and only socket command restart may help :(

@inri13666
Copy link

inri13666 commented Apr 17, 2019

I found the cause for my issue,
Web browsers allow huge numbers of open WebSockets
The infamous 6 connections per host limit does not apply to WebSockets. Instead a far bigger limit holds (255 in Chrome and 200 in Firefox).

Chrome Sources
Source Article

@inri13666
Copy link

inri13666 commented Apr 17, 2019

Totally I found the final solution
Never use React\EventLoop\StreamSelectLoop this is the cause of all issues because on Linux it fails after 1023 connections and on windows, it fails after 512 source

To solve the issue described here install one of the following PHP extensions

  1. Event extension My Choice, but for windows, have some issues
  2. libuv wrapper This loop is known to work PHP 7+.
  3. interface to libev library This loop is known to work with PHP 5.4 through PHP 7+.
  4. Libevent - event notification Not recommended for PHP 7+ but have An unofficial update
  5. PHP libev Extension

P/S
Sample application to check connections
Server: https://github.com/inri13666/web-socket-echoserver
Client: https://github.com/inri13666/web-socket-test

@machomath
Copy link

Hi I am sorry for this late comment. I was having a similar issue that server seemingly stopped responding after some time. Tried many solutions but at the end turned out that MySQL was taking timeout after 8 hours of inactivity and disconnecting. So I put a root cron job
0 0,8,16 * * * systemctl restart abc.service
It worked for me.

@nivshah
Copy link

nivshah commented Mar 22, 2021

I know this issue is quite old and has been closed for a while but it best covers what I have been running into.

Every once a while, my websocket message interface just stops working/responding. I restart the server and things immediately work again, with clients immediately reconnecting. Sometimes it breaks again within 10 or so minutes and sometimes it works for hours or days without any issue. I have a cronjob that restarts this server every 3 hours, so my question is:

Is there determinism on which PECL extension (ev, event) is used whenever the websocket loop is defined? Is there a chance that sometimes I'm getting a loop that uses one interface and sometimes I get the other if I have both PECL extensions installed?

I do not see any memory or CPU issues on the server, but whenever I see a series of broken pipe issues:

An error has occurred: Unable to write to stream: fwrite(): SSL: Broken pipe

I restart and either don't see these errors again or do, in which case I restart again until they subside.

My theory is that sometimes the code that creates the loop uses the ev library and other times it uses libevent, and one of the two is not working well and the other one is fine. I am not quite sure how to test this issue, since the only indicator I have is the message interface having a series of onError callbacks all at once, with the Broken Pipe error I pasted above.

@abbaasi69
Copy link

I had an experience which may help somebody.
My server was stoping responding after one hour when the number of concurrent socket connections reached about 700. After doing all of the possible solutions, I realized that I had a ProxyPass in apache which redirects port 443 (SSL) to 8080 (my socket port). Finally, I increased the ServerLimit in my Apache prefork configuration from 700 to 1700 and the problem was solved temporarily.
This shows that if you use ProxyPass of Apache (or another webserver) the Apache will become busy as it is between the client and WebSocket server.

@roguitar88
Copy link

roguitar88 commented Jan 12, 2022

Well, folks, I dunno quite well whether my issue has something to do with that one in specific... However, this is what I've experienced for practically one month or so: I have a websocket service running via supervisor on port 8080, and it works perfectly with my code implementation (PHP/Javascript/Codeigniter/Ratchet lib from socket.io)... Yes, it works well, but just until PAST 10 OR 12 HOURS (GENERALLY THE FOLLOWING DAY). When I'm gonna check the websocket service itself, okay, it's there... Running on port 8080, okay... The issue seems to be on client side (at least apparently, because I also suspect that there may be some hinder on server side). Checking my console (Javascript logs), and I see that:

  1. It really connects (onopen() is fired normally)
  2. onmessage() works
  3. But... when the algorhythm tries to do anything with websocket.send(), the connection is closed immediately (note: onerror() is not fired, but onclose() is and weirdly no error). This happens on client side (browser)
  4. I have ev (installed via pecl) installed and running there, I did it yesterday, but the issue seems to persist.
  5. I'm using OriginCheck, but I believe it has no relation at all with the issue
  6. In Ningx, I'm using a server block similar to this:
  map $http_upgrade $connection_upgrade {
      default upgrade;
      ''      close;
  }

  upstream ws-backend {             
      # server websocket.your-site.com:8080;       
      # server localhost:8080;            
      # server 135.659.25.124:8080;
      server 127.0.0.1:8080;
  }

  #If you want to force HTTPS, use the server block below:
  server {
      listen 80;
      listen [::]:80;

      server_name websocket.your-site.com

      return 301 https://$host$request_uri;
  }

  server {
      listen 443 ssl http2;
      listen [::]:443 ssl http2;

      server_name websocket.your-site.com;

      #Start the SSL configurations
      #ssl on;
      ssl_protocols TLSv1.2;

      # Fix The Logjam Attack.
      ssl_ciphers EECDH+AESGCM:EDH+AE...; #only for example purposes
      ssl_prefer_server_ciphers on;
      ssl_dhparam /etc/ssl/dh2048_param.pem;
      
      ssl_certificate /etc/pathto/websocket.your-site.com/fullchain.pem; # managed by Certbot
      ssl_certificate_key /etc/pathto/websocket.your-site.com/privkey.pem; # managed by Certbot
      
      location / {
          proxy_set_header       X-Forwarded-For $remote_addr;
          proxy_set_header       Host $http_host;
          proxy_pass             http://ws-backend;
          proxy_http_version     1.1;
          proxy_set_header       Upgrade $http_upgrade;
          proxy_set_header       Connection $connection_upgrade;
      }
  }
  1. As you can see above, I created a subdomain just for the websocket thing. and I'm using Connection $connection_upgrade always. If I use Connection "", I simply will not be able to connect to the websocket any longer, it outputs: WebSocket connection to 'wss://websocket.your-site.com/' failed:. But... Of course, this doesn't have anything to do with the issue either
  2. In nginx.conf, I set worker_processes auto;, worker_rlimit_nofile 500000;, worker_connections 1024;... At least so far... But I think those numbers are more than enough
  3. I checked the file descriptors limit and I solved to increase (absurdly) some values: fs.file-max = 2097152 there in /etc/sysctl.conf, root hard nofile 500000, oot soft nofile 500000, there in /etc/security/limits.conf, session required pam_limits.so there in /etc/pam.d/common-session
  4. I'm using PDO and the database connection settings have been placed inside the construct method of the websocket app main class
  5. The port 8080 (which I use to connect the websocket) is okay and enabled by sudo ufw allow 8080/tcp
  6. These are my server specs: 16GB RAM, 400GB SSD, 6vCPU cores, 32TB bandwidth (unlimited incoming traffic), PHP 8.0.14, Ubuntu 20.04 LTS, Nginx 1.18.0
  7. In sum, I just dunno what on Earth is causing this. In truth, I can't take it any longer, lol.

@amirlegendesk
Copy link

in 2023 I am also facing same issue. Nothing in logs, no errors but after few hours the messages are not sent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests