-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server appears to be down #481
Comments
Hmm, I got no notice of it being down, nor does it seem down right now. The logs show a slowly increasing number of "Client Request Interrupted" (with no clear reason) at 4pm, followed by a number of "Backend connection timeout", and finally "Process running mem=572M, Error R14 (Memory quota exceeded)" at 4:10pm. It restarted at 5:34pm. I'm tempted to file this under #459. |
That happened to me yesterday and the badges are showing broken images right now. |
I am seeing the main site itself (http://shields.io) intermittently timing out for the past week or so |
@ms-ati The website itself is hosted by GitHub. It is https://badges.github.io/shields/. It timing out would be a GitHub problem. |
The shields are not rendered on my GitHub page ) |
Every morning there is a problem with rendering the shields. Right now as we speak they will not render. |
Everything appears to be loading just fine from where I am. |
I'll switch servers soon. |
+1 |
+1 web runs, but images aren't served |
That's strange. (There was a small 20min disruption an hour ago, so maybe it was simply a timeout.) In case it is not, could you point me to a badge in particular? |
https://img.shields.io/pypi/v/cot.svg Both of these fail with an "Application Error" at this time.
|
I'm seeing them correctly; it was probably a temporary fluke. The really interesting thing will be when we switch servers: #459 |
The main page appears to be down again. This GitHub page loads, but the images won't load. |
Seems it's down again. Should we finally admit that shields.io is a nice idea but in practice it just doesn't work? Most serious services have svg badges now-days .... |
I agree. Great idea, but poorly executed. |
It seems up? |
Look at the server logs. If you see 500 errors it means that it was down. |
It looks like it is now working. |
Well, you're breaking my heart. It is true that we've had a few 5min hiccups, as we do a few times a week when requests go past 400req/s. Heroku. However, we haven't switched to new servers (#459) because I'm trying to be extra careful, precisely to avoid you saying things like that! |
I'm not personally attacking you. I am more disappointed than anything because this was a great service. However, I can't keep using it because when your service is broken and a page attempts to load the image it makes the page hang and when the page hangs that reflects poorly on the page owner. This problem has been going on since June so I feel like I've waited long enough for a solution. If you want my advice on how to scale properly I can certainly help you there. Here's some of the first ideas that come to mind:
|
"AWS" and "cheap" in the same sentence? What? |
The guy is doing 400 requests/second, I would hardly call AWS overkill. |
For the price of a single AWS instance and the monthly data transfer required for all the images (which I imagine is at least several GB), you can probably get several VPSes at another provider and load balance them which will result in an overall much more scalable configuration. BuyVM offer anycast IPs for free when you have servers in multiple regions, and I'm sure other providers do too. |
Let's do some back of the napkin math here. So let's say each SVG is 700 bytes and for an entire month he is serving 400 SVGs per second: 700 bytes * 400 per second * 60 seconds * 60 minutes * 24 hours * 30 days = ~ 725G per month For the first 10TB AWS charges $0.09 out which would cost $65. You think that's a lot of money? |
Sorry for heating the spirits here. I'm pretty sure any solution costs money+time. What I wanted to say is that it's hard to sustain, assuming this service doesn't make any money. |
Plus $19.04 for an actual server (for comparison, I selected a t2.small in US East). That brings us to US$82.94 for 725 GB transfer + one single server. For $83 per month you could pretty easily find a high-end server with unmetered bandwidth (or alternatively you could have several servers and load balance them). The issue with EC2 is if someone DDoSes your site, you need to pay for it (literally, due to the increased bandwidth usage).
For sure. Someone will have to pay the hosting costs at some point, whether that be out of someone's pocket or through sponsorship :) Maybe some sort of sponsorship deal could be good, like how CDN providers sponsor jsdelivr and cdnjs. |
seems it is down again :( |
Looks ok to me. |
The images seem to have gone out again. |
The current status is that we have switched servers. Things are faster now, but potentially brittle, as I improve the tools I use to manage the server. |
Ah, ok. Seemed to be the thing to. FWIW, it appears to be back up now. |
All images dead all day today for me. @espadrine Anything we could do to help? Donations? Known code hot-spots? |
I can confirm, they're all down currently. If I fund a second server (on digital ocean) would you be willing to add it to cloudflare's DNS? That way requests will get automatically load-balanced between the two servers. We can then use https://runbook.io to perform automatic failover by automatically removing IP addresses from the DNS when they fail? |
Failover by modifying DNS has a significant performance penalty as you need Sent from my phone.
|
@ForbesLindesay I'm digging into what might be the cause. Apart from the crashing issue, the server's CPU seems happy. /var/log/kern.log excerpt:
phantomjs seems to be oom-killer'ed often. Is it the cause? It seems to do that quite often within normal operations. I don't understand the whole picture yet. (Side note: the current gratipay donations could cover the expenses of an additional server) |
If you are consistently running into OOM issues, then it's possible that the OOM killer misfires and murders a critical system process, thus causing the crash. I'm curious, though - how much RAM does the server have, and how much does Node use? Unless there's something I'm missing, this shouldn't require particularly many resources. What's PhantomJS used for, for example? |
Phantomjs is used for converting an svg to a png (https://github.com/badges/shields/blob/5ffc6328ce1d54034e97fec6d815e08234db2739/svg-to-img.js). Incidentally, since most people should probably be using svg images anyway, how about if we moved the svg -> png converter into a separate server, and posted it some svg and received a png back? That way the svg -> png function could fail independently of the rest of the application? |
PhantomJS is used for those users that want PNG images. The SVG gets converted to PNG through a PhantomJS script: https://github.com/badges/shields/blob/master/phantomjs-svg2png.js The server has 2G RAM (and 10 GB SSD which go largely unused for now). I use a RAM LRU cache which might take more space than necessary: https://github.com/badges/shields/blob/master/server.js#L160 |
Looking through the code it doesn't look like the generated PNGs etc. are cached? I would expect that the most common badges are requested extremely frequently, could we cache the pngs of those? |
Are they not? There seems to be a 1000-sized LRU cache here: Line 10 in 5ffc632
|
I somehow missed that, disregard my previous comment :) |
One thought: given that we are buffering up the entire image anyway, it might be more efficient to just send the buffer to the response rather than creating a stream from the buffer when we have a cache hit. |
I'd suggest using something like supervisord to monitor processes and restart the service if it is killed (eg. due to OOM). If you're OOM'ing constantly, then you probably want more RAM (or another server) :D For the PNGs, rather than caching images in RAM yourself, consider writing the PNG files to the file system and letting the operating system's cache handle them. Better to use built-in caching mechanisms rather than rolling your own, especially when they're built for this specific purpose (caching of files in RAM) and properly handle evicting items from the cache when RAM is low :). In that case, are you using something like Nginx in front of Node? Using the |
It doesn't feel like phantomjs is taking much memory at all, maybe it simply is the drop of water that breaks the camel's back. It is chromium after all. I am using forever in front of node to restart it when dead. The OOM isn't really constant, just regular — something like once every 5h. I wonder if I am leaking, though. Maybe the assumptions of my cache are wrong. Maybe I should try changing the LRU cache to be outrageously small, just to see if it changes anything. |
Why not just use Inkscape or ImageMagick for rendering the SVGs? That would seem a lot lighter on resources than loading an entire browser context. I concur with @Daniel15 on the suggestion to just let the OS handle the cacheing, or at most use something like Redis. There's certainly no reason why you'd need 2GB of RAM for a service like this. |
The use of phantomjs guaranteed a correct result. I had issues in the past with changes to the SVG template sneakily resulting in incorrect results for PNG because of Inkspace or (worse) ImageMagick-specific quirks. But as I said, I doubt phantomjs is taking all that much memory, especially considering it dies after each use. The fact that a process spawns at all, and becomes the one that hits the limit, is probably why OOM wants to put the blame on it. I have tried restarting the server with a cache that is a lot smaller. I'll see if it changes anything. |
Since I killed the HTTP server, things are much smoother. Now, all HTTP traffic is redirected to the HTTPS server. |
https://img.shields.io/ seems to be down.
The text was updated successfully, but these errors were encountered: