Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server appears to be down #481

Closed
ForbesLindesay opened this issue Jun 22, 2015 · 49 comments
Closed

Server appears to be down #481

ForbesLindesay opened this issue Jun 22, 2015 · 49 comments

Comments

@ForbesLindesay
Copy link
Contributor

https://img.shields.io/ seems to be down.

image

@espadrine
Copy link
Member

Hmm, I got no notice of it being down, nor does it seem down right now. The logs show a slowly increasing number of "Client Request Interrupted" (with no clear reason) at 4pm, followed by a number of "Backend connection timeout", and finally "Process running mem=572M, Error R14 (Memory quota exceeded)" at 4:10pm. It restarted at 5:34pm.

I'm tempted to file this under #459.

@chriscannon
Copy link

That happened to me yesterday and the badges are showing broken images right now.

@ms-ati
Copy link

ms-ati commented Jun 24, 2015

I am seeing the main site itself (http://shields.io) intermittently timing out for the past week or so

@espadrine
Copy link
Member

@ms-ati The website itself is hosted by GitHub. It is https://badges.github.io/shields/. It timing out would be a GitHub problem.

@slavafomin
Copy link

The shields are not rendered on my GitHub page )

@chriscannon
Copy link

Every morning there is a problem with rendering the shields. Right now as we speak they will not render.

@Starefossen
Copy link

Everything appears to be loading just fine from where I am.

@espadrine
Copy link
Member

I'll switch servers soon.

@Dervisevic
Copy link

Website is working for me but images are broken.
screen shot 2015-07-02 at 09 05 57

@dirkmoors
Copy link

+1

@iki
Copy link

iki commented Jul 24, 2015

+1 web runs, but images aren't served

@espadrine
Copy link
Member

That's strange. (There was a small 20min disruption an hour ago, so maybe it was simply a timeout.) In case it is not, could you point me to a badge in particular?

@glennmatthews
Copy link

https://img.shields.io/pypi/v/cot.svg
https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat

Both of these fail with an "Application Error" at this time.

Application Error

An error occurred in the application and your page could not be served. Please try again in a few moments.

If you are the application owner, check your logs for details.

@espadrine
Copy link
Member

I'm seeing them correctly; it was probably a temporary fluke.

The really interesting thing will be when we switch servers: #459

@jakirkham
Copy link

The main page appears to be down again. This GitHub page loads, but the images won't load.

@ionelmc
Copy link

ionelmc commented Aug 19, 2015

Seems it's down again.

Should we finally admit that shields.io is a nice idea but in practice it just doesn't work? Most serious services have svg badges now-days ....

@chriscannon
Copy link

I agree. Great idea, but poorly executed.

@espadrine
Copy link
Member

It seems up?

@chriscannon
Copy link

Look at the server logs. If you see 500 errors it means that it was down.

@jakirkham
Copy link

It looks like it is now working.

@espadrine
Copy link
Member

Well, you're breaking my heart. It is true that we've had a few 5min hiccups, as we do a few times a week when requests go past 400req/s. Heroku.

However, we haven't switched to new servers (#459) because I'm trying to be extra careful, precisely to avoid you saying things like that!

@chriscannon
Copy link

I'm not personally attacking you. I am more disappointed than anything because this was a great service. However, I can't keep using it because when your service is broken and a page attempts to load the image it makes the page hang and when the page hangs that reflects poorly on the page owner. This problem has been going on since June so I feel like I've waited long enough for a solution.

If you want my advice on how to scale properly I can certainly help you there. Here's some of the first ideas that come to mind:

  1. Get off Heroku. I have never heard anything good about them in terms of scaling and they charge way too much money. Switch to AWS because it's cheap and gives you the control you need.
  2. If you're not using nginx stop what you're doing and switch to that now. It's performance and configurability cannot be beat.
  3. Heavily rely on caching. If I had to guess, I bet you're processing each request every time it comes in. I don't really care if my badge is out of date for an hour, but I do care if the badge doesn't load. You could even use a service like Akamai for edge caching if you want your load times to be really fast.

@Daniel15
Copy link
Member

Switch to AWS because it's cheap

"AWS" and "cheap" in the same sentence? What?
EC2 is great if you use multiple AWS services, but it's absolutely overkill for a service like this. This site would be fine on a regular VPS service (dependent on how much monthly transfer it uses) for a lot cheaper than EC2.

@chriscannon
Copy link

The guy is doing 400 requests/second, I would hardly call AWS overkill.

@Daniel15
Copy link
Member

For the price of a single AWS instance and the monthly data transfer required for all the images (which I imagine is at least several GB), you can probably get several VPSes at another provider and load balance them which will result in an overall much more scalable configuration. BuyVM offer anycast IPs for free when you have servers in multiple regions, and I'm sure other providers do too.

@chriscannon
Copy link

Let's do some back of the napkin math here. So let's say each SVG is 700 bytes and for an entire month he is serving 400 SVGs per second:

700 bytes * 400 per second * 60 seconds * 60 minutes * 24 hours * 30 days = ~ 725G per month

For the first 10TB AWS charges $0.09 out which would cost $65. You think that's a lot of money?

@ionelmc
Copy link

ionelmc commented Aug 19, 2015

Sorry for heating the spirits here. I'm pretty sure any solution costs money+time. What I wanted to say is that it's hard to sustain, assuming this service doesn't make any money.

@Daniel15
Copy link
Member

For the first 10TB AWS charges $0.09 out which would cost $65. You think that's a lot of money?

Plus $19.04 for an actual server (for comparison, I selected a t2.small in US East). That brings us to US$82.94 for 725 GB transfer + one single server. For $83 per month you could pretty easily find a high-end server with unmetered bandwidth (or alternatively you could have several servers and load balance them).

The issue with EC2 is if someone DDoSes your site, you need to pay for it (literally, due to the increased bandwidth usage).

Sorry for heating the spirits here. I'm pretty sure any solution costs money+time. What I wanted to say is that it's hard to sustain, assuming this service doesn't make any money.

For sure. Someone will have to pay the hosting costs at some point, whether that be out of someone's pocket or through sponsorship :) Maybe some sort of sponsorship deal could be good, like how CDN providers sponsor jsdelivr and cdnjs.

@bogdanRada
Copy link
Contributor

seems it is down again :(

@jakirkham
Copy link

Looks ok to me.

@jakirkham
Copy link

The images seem to have gone out again.

@espadrine
Copy link
Member

The current status is that we have switched servers. Things are faster now, but potentially brittle, as I improve the tools I use to manage the server.

@jakirkham
Copy link

Ah, ok. Seemed to be the thing to. FWIW, it appears to be back up now.

@anko
Copy link

anko commented Sep 18, 2015

All images dead all day today for me.

@espadrine Anything we could do to help? Donations? Known code hot-spots?

@ForbesLindesay
Copy link
Contributor Author

I can confirm, they're all down currently. If I fund a second server (on digital ocean) would you be willing to add it to cloudflare's DNS? That way requests will get automatically load-balanced between the two servers. We can then use https://runbook.io to perform automatic failover by automatically removing IP addresses from the DNS when they fail?

@Daniel15
Copy link
Member

Failover by modifying DNS has a significant performance penalty as you need
to set the TTL of the DNS entries to something very low, which means that
DNS servers won't be able to properly cache any of the lookups. It's better
to use a proper load balancer (which could be an inexpensive VPS - load
balancers don't use much CPU or memory, just bandwidth)

Sent from my phone.
On Sep 18, 2015 7:02 AM, "Forbes Lindesay" [email protected] wrote:

I can confirm, they're all down currently. If I fund a second server (on
digital ocean) would you be willing to add it to cloudflare's DNS? That way
requests will get automatically load-balanced between the two servers. We
can then use https://runbook.io to perform automatic failover by
automatically removing IP addresses from the DNS when they fail?


Reply to this email directly or view it on GitHub
#481 (comment).

@espadrine
Copy link
Member

@ForbesLindesay I'm digging into what might be the cause. Apart from the crashing issue, the server's CPU seems happy.

/var/log/kern.log excerpt:

Sep 18 07:07:53 vps197850 kernel: [254101.816478] Node 0 DMA32 free:44612kB min:44696kB low:55868kB high:67044kB a
ctive_anon:1893312kB inactive_anon:5140kB active_file:56kB inactive_file:8kB unevictable:0kB isolated(anon):0kB is
olated(file):0kB present:2031608kB managed:1993648kB mlocked:0kB dirty:0kB writeback:0kB mapped:2072kB shmem:5288k
B slab_reclaimable:10428kB slab_unreclaimable:14172kB kernel_stack:1312kB pagetables:8544kB unstable:0kB bounce:0k
B free_cma:0kB writeback_tmp:0kB pages_scanned:116 all_unreclaimable? yes
Sep 18 07:07:53 vps197850 kernel: [254101.816482] lowmem_reserve[]: 0 0 0 0
Sep 18 07:07:53 vps197850 kernel: [254101.816483] Node 0 DMA: 14*4kB (UEM) 4*8kB (UE) 1*16kB (M) 5*32kB (UEM) 1*64
kB (U) 3*128kB (UE) 1*256kB (E) 2*512kB (EM) 2*1024kB (UE) 2*2048kB (ER) 0*4096kB = 8136kB
Sep 18 07:07:53 vps197850 kernel: [254101.816494] Node 0 DMA32: 1811*4kB (UEM) 459*8kB (UEM) 118*16kB (UE) 82*32kB
 (UEM) 52*64kB (UEM) 54*128kB (UEM) 26*256kB (UE) 12*512kB (UM) 2*1024kB (U) 0*2048kB 1*4096kB (R) = 44612kB
Sep 18 07:07:53 vps197850 kernel: [254101.816507] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugep
ages_size=2048kB
Sep 18 07:07:53 vps197850 kernel: [254101.816508] 1339 total pagecache pages
Sep 18 07:07:53 vps197850 kernel: [254101.816512] 0 pages in swap cache
Sep 18 07:07:53 vps197850 kernel: [254101.816513] Swap cache stats: add 0, delete 0, find 0/0
Sep 18 07:07:53 vps197850 kernel: [254101.816514] Free swap  = 0kB
Sep 18 07:07:53 vps197850 kernel: [254101.816515] Total swap = 0kB
Sep 18 07:07:53 vps197850 kernel: [254101.816516] 511900 pages RAM
Sep 18 07:07:53 vps197850 kernel: [254101.816516] 0 pages HighMem/MovableOnly
Sep 18 07:07:53 vps197850 kernel: [254101.816517] 9490 pages reserved
Sep 18 07:07:53 vps197850 kernel: [254101.816517] 0 pages hwpoisoned
Sep 18 07:07:53 vps197850 kernel: [254101.816518] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score
_adj name
Sep 18 07:07:53 vps197850 kernel: [254101.816522] [  138]     0   138    10197      137      20        0         -
1000 systemd-udevd
Sep 18 07:07:53 vps197850 kernel: [254101.816524] [  140]     0   140     7448      589      18        0          
   0 systemd-journal
Sep 18 07:07:53 vps197850 kernel: [254101.816525] [  282]     0   282     6347     1724      15        0          
   0 dhclient
Sep 18 07:07:53 vps197850 kernel: [254101.816527] [  306]     0   306     6857       64      19        0          
   0 cron
Sep 18 07:07:53 vps197850 kernel: [254101.816528] [  307]     0   307    13791      170      31        0         -
1000 sshd
Sep 18 07:07:53 vps197850 kernel: [254101.816529] [  309]     0   309     7065       81      18        0          
   0 systemd-logind
Sep 18 07:07:53 vps197850 kernel: [254101.816531] [  310]   104   310    10529      104      27        0          
-900 dbus-daemon
Sep 18 07:07:53 vps197850 kernel: [254101.816532] [  312]     0   312    64665      237      29        0          
   0 rsyslogd
Sep 18 07:07:53 vps197850 kernel: [254101.816534] [  325]     0   325     3209       39      12        0          
   0 agetty
Sep 18 07:07:53 vps197850 kernel: [254101.816535] [  326]     0   326     3548       44      12        0          
   0 agetty
Sep 18 07:07:53 vps197850 kernel: [254101.816536] [  336]     0   336     6769      106      16        0          
   0 systemd
Sep 18 07:07:53 vps197850 kernel: [254101.816537] [  337]     0   337    12446      368      25        0          
   0 (sd-pam)
Sep 18 07:07:53 vps197850 kernel: [254101.816539] [  359]     0   359   160321     5616      54        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816540] [  370]     0   370   159041     5616      52        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816542] [ 4670]     0  4670   384097   160740     624        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816543] [ 5532]     0  5532   482556   259305     982        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816544] [ 6608]     0  6608   326546    21330      89        0          
   0 phantomjs
Sep 18 07:07:53 vps197850 kernel: [254101.816545] [ 6609]     0  6609   326402    19252      82        0          
   0 phantomjs
Sep 18 07:07:53 vps197850 kernel: [254101.816547] Out of memory: Kill process 5532 (node) score 502 or sacrifice c
hild
Sep 18 07:07:53 vps197850 kernel: [254101.817845] Killed process 5532 (node) total-vm:1930224kB, anon-rss:1037220k
B, file-rss:0kB
Sep 18 17:32:38 vps197850 kernel: [    0.000000] Initializing cgroup subsys cpuset

phantomjs seems to be oom-killer'ed often. Is it the cause? It seems to do that quite often within normal operations. I don't understand the whole picture yet.

(Side note: the current gratipay donations could cover the expenses of an additional server)

@espadrine espadrine mentioned this issue Sep 18, 2015
5 tasks
@joepie91
Copy link

If you are consistently running into OOM issues, then it's possible that the OOM killer misfires and murders a critical system process, thus causing the crash.

I'm curious, though - how much RAM does the server have, and how much does Node use? Unless there's something I'm missing, this shouldn't require particularly many resources. What's PhantomJS used for, for example?

@ForbesLindesay
Copy link
Contributor Author

Phantomjs is used for converting an svg to a png (https://github.com/badges/shields/blob/5ffc6328ce1d54034e97fec6d815e08234db2739/svg-to-img.js).

Incidentally, since most people should probably be using svg images anyway, how about if we moved the svg -> png converter into a separate server, and posted it some svg and received a png back? That way the svg -> png function could fail independently of the rest of the application?

@espadrine
Copy link
Member

PhantomJS is used for those users that want PNG images. The SVG gets converted to PNG through a PhantomJS script: https://github.com/badges/shields/blob/master/phantomjs-svg2png.js

The server has 2G RAM (and 10 GB SSD which go largely unused for now). I use a RAM LRU cache which might take more space than necessary: https://github.com/badges/shields/blob/master/server.js#L160

@ForbesLindesay
Copy link
Contributor Author

Looking through the code it doesn't look like the generated PNGs etc. are cached? I would expect that the most common badges are requested extremely frequently, could we cache the pngs of those?

@espadrine
Copy link
Member

Are they not? There seems to be a 1000-sized LRU cache here:

var imgCache = new LruCache(1000);

@ForbesLindesay
Copy link
Contributor Author

I somehow missed that, disregard my previous comment :)

@ForbesLindesay
Copy link
Contributor Author

One thought: given that we are buffering up the entire image anyway, it might be more efficient to just send the buffer to the response rather than creating a stream from the buffer when we have a cache hit.

@Daniel15
Copy link
Member

I'd suggest using something like supervisord to monitor processes and restart the service if it is killed (eg. due to OOM). If you're OOM'ing constantly, then you probably want more RAM (or another server) :D

For the PNGs, rather than caching images in RAM yourself, consider writing the PNG files to the file system and letting the operating system's cache handle them. Better to use built-in caching mechanisms rather than rolling your own, especially when they're built for this specific purpose (caching of files in RAM) and properly handle evicting items from the cache when RAM is low :). In that case, are you using something like Nginx in front of Node? Using the X-Accel-Redirect header to serve a file from the file system will likely have some perf gains compared to writing the raw bytes yourself.

@espadrine
Copy link
Member

It doesn't feel like phantomjs is taking much memory at all, maybe it simply is the drop of water that breaks the camel's back. It is chromium after all.

I am using forever in front of node to restart it when dead. The OOM isn't really constant, just regular — something like once every 5h.

I wonder if I am leaking, though. Maybe the assumptions of my cache are wrong. Maybe I should try changing the LRU cache to be outrageously small, just to see if it changes anything.

@joepie91
Copy link

Why not just use Inkscape or ImageMagick for rendering the SVGs? That would seem a lot lighter on resources than loading an entire browser context.

I concur with @Daniel15 on the suggestion to just let the OS handle the cacheing, or at most use something like Redis. There's certainly no reason why you'd need 2GB of RAM for a service like this.

@espadrine
Copy link
Member

The use of phantomjs guaranteed a correct result. I had issues in the past with changes to the SVG template sneakily resulting in incorrect results for PNG because of Inkspace or (worse) ImageMagick-specific quirks.

But as I said, I doubt phantomjs is taking all that much memory, especially considering it dies after each use. The fact that a process spawns at all, and becomes the one that hits the limit, is probably why OOM wants to put the blame on it.

I have tried restarting the server with a cache that is a lot smaller. I'll see if it changes anything.

@espadrine
Copy link
Member

Since I killed the HTTP server, things are much smoother. Now, all HTTP traffic is redirected to the HTTPS server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests