Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

number of file descriptors increases beyond limit #68

Open
fanchyna opened this issue May 30, 2016 · 6 comments
Open

number of file descriptors increases beyond limit #68

fanchyna opened this issue May 30, 2016 · 6 comments

Comments

@fanchyna
Copy link
Contributor

This happens on one of the web servers (03). I first got a "server down message" from the monitor. I then looked at the localhost file and saw lots of errors like ""too many files opened". Basically, the system cannot even open a sitemap file and respond to the client. I then checked "ullimit -Hn" and the number was 8192. Using "lsof" I see more than 10,000 rows and most of them are like "web server -> repo server [CLOSE_WAIT]". This CLOSE_WAIT status is very annoying and has been directly causing web server down issues. However, as far as I know, there's no way to clean the socket except to restart the service or reboot. In this situation, I had to restart Tomcat, but I wonder if this could be a software or hardware issue. The numbers of file descriptors are below the limit on the other web servers (also 8192).

@kylemarkwilliams
Copy link
Contributor

kylemarkwilliams commented May 30, 2016

Does this only happen on one of the web servers or all of them from time to time?

On Mon, May 30, 2016 at 10:37 AM, Jian Wu [email protected] wrote:

This happens on one of the web servers (03). I first got a "server down
message" from the monitor. I then looked at the localhost file and saw lots
of errors like ""too many files opened". Basically, the system cannot even
open a sitemap file and respond to the client. I then checked "ullimit -Hn"
and the number was 8192. Using "lsof" I see more than 10,000 rows and most
of them are like "web server -> repo server [CLOSE_WAIT]". This CLOSE_WAIT
status is very annoying and has been directly causing web server down
issues. However, as far as I know, there's no way to clean the socket
except to restart the service or reboot. In this situation, I had to
restart Tomcat, but I wonder if this could be a software or hardware issue.
The numbers of file descriptors are below the limit on the other web
servers (also 8192).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#68, or mute the thread
https://github.com/notifications/unsubscribe/ABLPxEkVXFVvUA8Y7lrANcuGQYvm4_6Bks5qGvYugaJpZM4Ip2Wf
.

http://www.personal.psu.edu/kiw5209/

@dwj300
Copy link
Contributor

dwj300 commented May 30, 2016

I think we can add a timeout so it automatically closes the socket:
http://stackoverflow.com/questions/3000214/java-http-client-request-with-defined-timeout
On Mon, May 30, 2016 at 07:45 Kyle Williams [email protected]
wrote:

Does this only happen on one of the web servers or all of them?

On Mon, May 30, 2016 at 10:37 AM, Jian Wu [email protected]
wrote:

This happens on one of the web servers (03). I first got a "server down
message" from the monitor. I then looked at the localhost file and saw
lots
of errors like ""too many files opened". Basically, the system cannot
even
open a sitemap file and respond to the client. I then checked "ullimit
-Hn"
and the number was 8192. Using "lsof" I see more than 10,000 rows and
most
of them are like "web server -> repo server [CLOSE_WAIT]". This
CLOSE_WAIT
status is very annoying and has been directly causing web server down
issues. However, as far as I know, there's no way to clean the socket
except to restart the service or reboot. In this situation, I had to
restart Tomcat, but I wonder if this could be a software or hardware
issue.
The numbers of file descriptors are below the limit on the other web
servers (also 8192).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#68, or mute the thread
<
https://github.com/notifications/unsubscribe/ABLPxEkVXFVvUA8Y7lrANcuGQYvm4_6Bks5qGvYugaJpZM4Ip2Wf

.

http://www.personal.psu.edu/kiw5209/


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#68 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABAMfvg_fKEhpg6SftOR-SY7ay5fmbvXks5qGvgRgaJpZM4Ip2Wf
.

@fanchyna
Copy link
Contributor Author

This time it only on csxweb03. That is why I wonder if this could be
related to hardware issues because sockets were virtualized.

On Mon, May 30, 2016 at 10:45 AM, Kyle Williams [email protected]
wrote:

Does this only happen on one of the web servers or all of them?

On Mon, May 30, 2016 at 10:37 AM, Jian Wu [email protected]
wrote:

This happens on one of the web servers (03). I first got a "server down
message" from the monitor. I then looked at the localhost file and saw
lots
of errors like ""too many files opened". Basically, the system cannot
even
open a sitemap file and respond to the client. I then checked "ullimit
-Hn"
and the number was 8192. Using "lsof" I see more than 10,000 rows and
most
of them are like "web server -> repo server [CLOSE_WAIT]". This
CLOSE_WAIT
status is very annoying and has been directly causing web server down
issues. However, as far as I know, there's no way to clean the socket
except to restart the service or reboot. In this situation, I had to
restart Tomcat, but I wonder if this could be a software or hardware
issue.
The numbers of file descriptors are below the limit on the other web
servers (also 8192).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#68, or mute the thread
<
https://github.com/notifications/unsubscribe/ABLPxEkVXFVvUA8Y7lrANcuGQYvm4_6Bks5qGvYugaJpZM4Ip2Wf

.

http://www.personal.psu.edu/kiw5209/


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#68 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABrZq6SBKgOmlnNiwBPzLIrPAqhSS2A8ks5qGvgRgaJpZM4Ip2Wf
.

@kylemarkwilliams
Copy link
Contributor

@dwj300 I think the timeout could be useful, but it's important to try and figure out why this is happening on only one server and not the others. @fanchyna Do they maybe have some different settings?

@fanchyna
Copy link
Contributor Author

The ulimit settings are identical (8192). What other settings do I need to
look at?

On Mon, May 30, 2016 at 10:53 AM, Kyle Williams [email protected]
wrote:

@dwj300 https://github.com/dwj300 I think the timeout could be useful,
but it's important to try and figure out why this is happening on only one
server and not the others. @fanchyna https://github.com/fanchyna Do
they maybe have some different settings?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#68 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABrZq7w6EgSOPbZzA0vV5_GGGwsHjln5ks5qGvnSgaJpZM4Ip2Wf
.

@fanchyna
Copy link
Contributor Author

The same problem happens today on web01. "lsof" gives 9075 while "ulimit -Hn" gives 8192. I have doubled the limit on all web servers. But I believe there are some underlying problems, either on the software or on the hardware (network?). I do not exclude hardware issues because most of them are related to csxrepository01 on CLOSE_WAIT status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants