Server appears to be down #481

ForbesLindesay · 2015-06-22T15:15:58Z

https://img.shields.io/ seems to be down.

espadrine · 2015-06-22T15:59:14Z

Hmm, I got no notice of it being down, nor does it seem down right now. The logs show a slowly increasing number of "Client Request Interrupted" (with no clear reason) at 4pm, followed by a number of "Backend connection timeout", and finally "Process running mem=572M, Error R14 (Memory quota exceeded)" at 4:10pm. It restarted at 5:34pm.

I'm tempted to file this under #459.

chriscannon · 2015-06-23T14:58:20Z

That happened to me yesterday and the badges are showing broken images right now.

ms-ati · 2015-06-24T17:22:40Z

I am seeing the main site itself (http://shields.io) intermittently timing out for the past week or so

espadrine · 2015-06-24T18:05:11Z

@ms-ati The website itself is hosted by GitHub. It is https://badges.github.io/shields/. It timing out would be a GitHub problem.

slavafomin · 2015-06-25T14:23:25Z

The shields are not rendered on my GitHub page )

chriscannon · 2015-06-25T14:24:48Z

Every morning there is a problem with rendering the shields. Right now as we speak they will not render.

Starefossen · 2015-06-25T14:32:27Z

Everything appears to be loading just fine from where I am.

espadrine · 2015-06-26T08:17:47Z

I'll switch servers soon.

Dervisevic · 2015-07-02T07:06:45Z

Website is working for me but images are broken.

dirkmoors · 2015-07-23T15:04:30Z

+1

iki · 2015-07-24T08:46:47Z

+1 web runs, but images aren't served

espadrine · 2015-07-24T09:33:00Z

That's strange. (There was a small 20min disruption an hour ago, so maybe it was simply a timeout.) In case it is not, could you point me to a badge in particular?

glennmatthews · 2015-07-27T15:36:01Z

https://img.shields.io/pypi/v/cot.svg
https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat

Both of these fail with an "Application Error" at this time.

Application Error

An error occurred in the application and your page could not be served. Please try again in a few moments.

If you are the application owner, check your logs for details.

espadrine · 2015-07-27T16:29:56Z

I'm seeing them correctly; it was probably a temporary fluke.

The really interesting thing will be when we switch servers: #459

jakirkham · 2015-08-11T12:45:49Z

The main page appears to be down again. This GitHub page loads, but the images won't load.

ionelmc · 2015-08-19T15:52:36Z

Seems it's down again.

Should we finally admit that shields.io is a nice idea but in practice it just doesn't work? Most serious services have svg badges now-days ....

chriscannon · 2015-08-19T16:02:52Z

I agree. Great idea, but poorly executed.

espadrine · 2015-08-19T16:16:30Z

It seems up?

chriscannon · 2015-08-19T16:18:22Z

Look at the server logs. If you see 500 errors it means that it was down.

jakirkham · 2015-08-19T16:22:18Z

It looks like it is now working.

espadrine · 2015-08-19T16:22:47Z

Well, you're breaking my heart. It is true that we've had a few 5min hiccups, as we do a few times a week when requests go past 400req/s. Heroku.

However, we haven't switched to new servers (#459) because I'm trying to be extra careful, precisely to avoid you saying things like that!

chriscannon · 2015-08-19T17:03:41Z

I'm not personally attacking you. I am more disappointed than anything because this was a great service. However, I can't keep using it because when your service is broken and a page attempts to load the image it makes the page hang and when the page hangs that reflects poorly on the page owner. This problem has been going on since June so I feel like I've waited long enough for a solution.

If you want my advice on how to scale properly I can certainly help you there. Here's some of the first ideas that come to mind:

Get off Heroku. I have never heard anything good about them in terms of scaling and they charge way too much money. Switch to AWS because it's cheap and gives you the control you need.
If you're not using nginx stop what you're doing and switch to that now. It's performance and configurability cannot be beat.
Heavily rely on caching. If I had to guess, I bet you're processing each request every time it comes in. I don't really care if my badge is out of date for an hour, but I do care if the badge doesn't load. You could even use a service like Akamai for edge caching if you want your load times to be really fast.

Daniel15 · 2015-08-19T21:34:53Z

Switch to AWS because it's cheap

"AWS" and "cheap" in the same sentence? What?
EC2 is great if you use multiple AWS services, but it's absolutely overkill for a service like this. This site would be fine on a regular VPS service (dependent on how much monthly transfer it uses) for a lot cheaper than EC2.

chriscannon · 2015-08-19T21:45:28Z

The guy is doing 400 requests/second, I would hardly call AWS overkill.

Daniel15 · 2015-08-19T21:50:17Z

For the price of a single AWS instance and the monthly data transfer required for all the images (which I imagine is at least several GB), you can probably get several VPSes at another provider and load balance them which will result in an overall much more scalable configuration. BuyVM offer anycast IPs for free when you have servers in multiple regions, and I'm sure other providers do too.

chriscannon · 2015-08-19T22:11:12Z

Let's do some back of the napkin math here. So let's say each SVG is 700 bytes and for an entire month he is serving 400 SVGs per second:

700 bytes * 400 per second * 60 seconds * 60 minutes * 24 hours * 30 days = ~ 725G per month

For the first 10TB AWS charges $0.09 out which would cost $65. You think that's a lot of money?

ionelmc · 2015-08-19T23:00:50Z

Sorry for heating the spirits here. I'm pretty sure any solution costs money+time. What I wanted to say is that it's hard to sustain, assuming this service doesn't make any money.

Daniel15 · 2015-08-20T03:14:12Z

For the first 10TB AWS charges $0.09 out which would cost $65. You think that's a lot of money?

Plus $19.04 for an actual server (for comparison, I selected a t2.small in US East). That brings us to US$82.94 for 725 GB transfer + one single server. For $83 per month you could pretty easily find a high-end server with unmetered bandwidth (or alternatively you could have several servers and load balance them).

The issue with EC2 is if someone DDoSes your site, you need to pay for it (literally, due to the increased bandwidth usage).

Sorry for heating the spirits here. I'm pretty sure any solution costs money+time. What I wanted to say is that it's hard to sustain, assuming this service doesn't make any money.

For sure. Someone will have to pay the hosting costs at some point, whether that be out of someone's pocket or through sponsorship :) Maybe some sort of sponsorship deal could be good, like how CDN providers sponsor jsdelivr and cdnjs.

bogdanRada · 2015-09-02T09:45:48Z

seems it is down again :(

jakirkham · 2015-09-02T11:24:35Z

Looks ok to me.

jakirkham · 2015-09-09T03:06:03Z

The images seem to have gone out again.

espadrine · 2015-09-09T08:55:47Z

The current status is that we have switched servers. Things are faster now, but potentially brittle, as I improve the tools I use to manage the server.

jakirkham · 2015-09-09T13:46:00Z

Ah, ok. Seemed to be the thing to. FWIW, it appears to be back up now.

anko · 2015-09-18T12:52:36Z

All images dead all day today for me.

@espadrine Anything we could do to help? Donations? Known code hot-spots?

ForbesLindesay · 2015-09-18T14:02:41Z

I can confirm, they're all down currently. If I fund a second server (on digital ocean) would you be willing to add it to cloudflare's DNS? That way requests will get automatically load-balanced between the two servers. We can then use https://runbook.io to perform automatic failover by automatically removing IP addresses from the DNS when they fail?

Daniel15 · 2015-09-18T16:04:39Z

Failover by modifying DNS has a significant performance penalty as you need
to set the TTL of the DNS entries to something very low, which means that
DNS servers won't be able to properly cache any of the lookups. It's better
to use a proper load balancer (which could be an inexpensive VPS - load
balancers don't use much CPU or memory, just bandwidth)

Sent from my phone.
On Sep 18, 2015 7:02 AM, "Forbes Lindesay" [email protected] wrote:

I can confirm, they're all down currently. If I fund a second server (on
digital ocean) would you be willing to add it to cloudflare's DNS? That way
requests will get automatically load-balanced between the two servers. We
can then use https://runbook.io to perform automatic failover by
automatically removing IP addresses from the DNS when they fail?

—
Reply to this email directly or view it on GitHub
#481 (comment).

espadrine · 2015-09-18T16:27:33Z

@ForbesLindesay I'm digging into what might be the cause. Apart from the crashing issue, the server's CPU seems happy.

/var/log/kern.log excerpt:

Sep 18 07:07:53 vps197850 kernel: [254101.816478] Node 0 DMA32 free:44612kB min:44696kB low:55868kB high:67044kB a
ctive_anon:1893312kB inactive_anon:5140kB active_file:56kB inactive_file:8kB unevictable:0kB isolated(anon):0kB is
olated(file):0kB present:2031608kB managed:1993648kB mlocked:0kB dirty:0kB writeback:0kB mapped:2072kB shmem:5288k
B slab_reclaimable:10428kB slab_unreclaimable:14172kB kernel_stack:1312kB pagetables:8544kB unstable:0kB bounce:0k
B free_cma:0kB writeback_tmp:0kB pages_scanned:116 all_unreclaimable? yes
Sep 18 07:07:53 vps197850 kernel: [254101.816482] lowmem_reserve[]: 0 0 0 0
Sep 18 07:07:53 vps197850 kernel: [254101.816483] Node 0 DMA: 14*4kB (UEM) 4*8kB (UE) 1*16kB (M) 5*32kB (UEM) 1*64
kB (U) 3*128kB (UE) 1*256kB (E) 2*512kB (EM) 2*1024kB (UE) 2*2048kB (ER) 0*4096kB = 8136kB
Sep 18 07:07:53 vps197850 kernel: [254101.816494] Node 0 DMA32: 1811*4kB (UEM) 459*8kB (UEM) 118*16kB (UE) 82*32kB
 (UEM) 52*64kB (UEM) 54*128kB (UEM) 26*256kB (UE) 12*512kB (UM) 2*1024kB (U) 0*2048kB 1*4096kB (R) = 44612kB
Sep 18 07:07:53 vps197850 kernel: [254101.816507] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugep
ages_size=2048kB
Sep 18 07:07:53 vps197850 kernel: [254101.816508] 1339 total pagecache pages
Sep 18 07:07:53 vps197850 kernel: [254101.816512] 0 pages in swap cache
Sep 18 07:07:53 vps197850 kernel: [254101.816513] Swap cache stats: add 0, delete 0, find 0/0
Sep 18 07:07:53 vps197850 kernel: [254101.816514] Free swap  = 0kB
Sep 18 07:07:53 vps197850 kernel: [254101.816515] Total swap = 0kB
Sep 18 07:07:53 vps197850 kernel: [254101.816516] 511900 pages RAM
Sep 18 07:07:53 vps197850 kernel: [254101.816516] 0 pages HighMem/MovableOnly
Sep 18 07:07:53 vps197850 kernel: [254101.816517] 9490 pages reserved
Sep 18 07:07:53 vps197850 kernel: [254101.816517] 0 pages hwpoisoned
Sep 18 07:07:53 vps197850 kernel: [254101.816518] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score
_adj name
Sep 18 07:07:53 vps197850 kernel: [254101.816522] [  138]     0   138    10197      137      20        0         -
1000 systemd-udevd
Sep 18 07:07:53 vps197850 kernel: [254101.816524] [  140]     0   140     7448      589      18        0          
   0 systemd-journal
Sep 18 07:07:53 vps197850 kernel: [254101.816525] [  282]     0   282     6347     1724      15        0          
   0 dhclient
Sep 18 07:07:53 vps197850 kernel: [254101.816527] [  306]     0   306     6857       64      19        0          
   0 cron
Sep 18 07:07:53 vps197850 kernel: [254101.816528] [  307]     0   307    13791      170      31        0         -
1000 sshd
Sep 18 07:07:53 vps197850 kernel: [254101.816529] [  309]     0   309     7065       81      18        0          
   0 systemd-logind
Sep 18 07:07:53 vps197850 kernel: [254101.816531] [  310]   104   310    10529      104      27        0          
-900 dbus-daemon
Sep 18 07:07:53 vps197850 kernel: [254101.816532] [  312]     0   312    64665      237      29        0          
   0 rsyslogd
Sep 18 07:07:53 vps197850 kernel: [254101.816534] [  325]     0   325     3209       39      12        0          
   0 agetty
Sep 18 07:07:53 vps197850 kernel: [254101.816535] [  326]     0   326     3548       44      12        0          
   0 agetty
Sep 18 07:07:53 vps197850 kernel: [254101.816536] [  336]     0   336     6769      106      16        0          
   0 systemd
Sep 18 07:07:53 vps197850 kernel: [254101.816537] [  337]     0   337    12446      368      25        0          
   0 (sd-pam)
Sep 18 07:07:53 vps197850 kernel: [254101.816539] [  359]     0   359   160321     5616      54        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816540] [  370]     0   370   159041     5616      52        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816542] [ 4670]     0  4670   384097   160740     624        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816543] [ 5532]     0  5532   482556   259305     982        0          
   0 node
Sep 18 07:07:53 vps197850 kernel: [254101.816544] [ 6608]     0  6608   326546    21330      89        0          
   0 phantomjs
Sep 18 07:07:53 vps197850 kernel: [254101.816545] [ 6609]     0  6609   326402    19252      82        0          
   0 phantomjs
Sep 18 07:07:53 vps197850 kernel: [254101.816547] Out of memory: Kill process 5532 (node) score 502 or sacrifice c
hild
Sep 18 07:07:53 vps197850 kernel: [254101.817845] Killed process 5532 (node) total-vm:1930224kB, anon-rss:1037220k
B, file-rss:0kB
Sep 18 17:32:38 vps197850 kernel: [    0.000000] Initializing cgroup subsys cpuset

phantomjs seems to be oom-killer'ed often. Is it the cause? It seems to do that quite often within normal operations. I don't understand the whole picture yet.

(Side note: the current gratipay donations could cover the expenses of an additional server)

joepie91 · 2015-09-18T16:34:22Z

If you are consistently running into OOM issues, then it's possible that the OOM killer misfires and murders a critical system process, thus causing the crash.

I'm curious, though - how much RAM does the server have, and how much does Node use? Unless there's something I'm missing, this shouldn't require particularly many resources. What's PhantomJS used for, for example?

ForbesLindesay · 2015-09-18T16:39:51Z

Phantomjs is used for converting an svg to a png (https://github.com/badges/shields/blob/5ffc6328ce1d54034e97fec6d815e08234db2739/svg-to-img.js).

Incidentally, since most people should probably be using svg images anyway, how about if we moved the svg -> png converter into a separate server, and posted it some svg and received a png back? That way the svg -> png function could fail independently of the rest of the application?

espadrine · 2015-09-18T16:40:27Z

PhantomJS is used for those users that want PNG images. The SVG gets converted to PNG through a PhantomJS script: https://github.com/badges/shields/blob/master/phantomjs-svg2png.js

The server has 2G RAM (and 10 GB SSD which go largely unused for now). I use a RAM LRU cache which might take more space than necessary: https://github.com/badges/shields/blob/master/server.js#L160

ForbesLindesay · 2015-09-18T16:44:06Z

Looking through the code it doesn't look like the generated PNGs etc. are cached? I would expect that the most common badges are requested extremely frequently, could we cache the pngs of those?

espadrine · 2015-09-18T16:46:57Z

Are they not? There seems to be a 1000-sized LRU cache here:

shields/svg-to-img.js

Line 10 in 5ffc632

var imgCache = new LruCache(1000);

ForbesLindesay · 2015-09-18T16:53:56Z

I somehow missed that, disregard my previous comment :)

ForbesLindesay · 2015-09-18T16:55:33Z

One thought: given that we are buffering up the entire image anyway, it might be more efficient to just send the buffer to the response rather than creating a stream from the buffer when we have a cache hit.

Daniel15 · 2015-09-18T17:06:48Z

I'd suggest using something like supervisord to monitor processes and restart the service if it is killed (eg. due to OOM). If you're OOM'ing constantly, then you probably want more RAM (or another server) :D

For the PNGs, rather than caching images in RAM yourself, consider writing the PNG files to the file system and letting the operating system's cache handle them. Better to use built-in caching mechanisms rather than rolling your own, especially when they're built for this specific purpose (caching of files in RAM) and properly handle evicting items from the cache when RAM is low :). In that case, are you using something like Nginx in front of Node? Using the X-Accel-Redirect header to serve a file from the file system will likely have some perf gains compared to writing the raw bytes yourself.

espadrine · 2015-09-18T17:16:25Z

It doesn't feel like phantomjs is taking much memory at all, maybe it simply is the drop of water that breaks the camel's back. It is chromium after all.

I am using forever in front of node to restart it when dead. The OOM isn't really constant, just regular — something like once every 5h.

I wonder if I am leaking, though. Maybe the assumptions of my cache are wrong. Maybe I should try changing the LRU cache to be outrageously small, just to see if it changes anything.

joepie91 · 2015-09-18T17:45:16Z

Why not just use Inkscape or ImageMagick for rendering the SVGs? That would seem a lot lighter on resources than loading an entire browser context.

I concur with @Daniel15 on the suggestion to just let the OS handle the cacheing, or at most use something like Redis. There's certainly no reason why you'd need 2GB of RAM for a service like this.

espadrine · 2015-09-21T12:45:28Z

The use of phantomjs guaranteed a correct result. I had issues in the past with changes to the SVG template sneakily resulting in incorrect results for PNG because of Inkspace or (worse) ImageMagick-specific quirks.

But as I said, I doubt phantomjs is taking all that much memory, especially considering it dies after each use. The fact that a process spawns at all, and becomes the one that hits the limit, is probably why OOM wants to put the blame on it.

I have tried restarting the server with a cache that is a lot smaller. I'll see if it changes anything.

espadrine · 2015-11-10T09:01:32Z

Since I killed the HTTP server, things are much smoother. Now, all HTTP traffic is redirected to the HTTPS server.

iki mentioned this issue Jul 24, 2015

Badges not working (pypip.in down?) badges/pypipins#37

Closed

espadrine mentioned this issue Aug 14, 2015

Service not working #506

Closed

espadrine mentioned this issue Sep 18, 2015

Server switch #459

Closed

5 tasks

espadrine mentioned this issue Sep 30, 2015

All shields down #540

Closed

espadrine closed this as completed Nov 10, 2015

Server appears to be down #481

Server appears to be down #481

Comments

ForbesLindesay commented Jun 22, 2015

espadrine commented Jun 22, 2015

chriscannon commented Jun 23, 2015

ms-ati commented Jun 24, 2015

espadrine commented Jun 24, 2015

slavafomin commented Jun 25, 2015

chriscannon commented Jun 25, 2015

Starefossen commented Jun 25, 2015

espadrine commented Jun 26, 2015

Dervisevic commented Jul 2, 2015

dirkmoors commented Jul 23, 2015

iki commented Jul 24, 2015

espadrine commented Jul 24, 2015

glennmatthews commented Jul 27, 2015

espadrine commented Jul 27, 2015

jakirkham commented Aug 11, 2015

ionelmc commented Aug 19, 2015

chriscannon commented Aug 19, 2015

espadrine commented Aug 19, 2015

chriscannon commented Aug 19, 2015

jakirkham commented Aug 19, 2015

espadrine commented Aug 19, 2015

chriscannon commented Aug 19, 2015

Daniel15 commented Aug 19, 2015

chriscannon commented Aug 19, 2015

Daniel15 commented Aug 19, 2015

chriscannon commented Aug 19, 2015

ionelmc commented Aug 19, 2015

Daniel15 commented Aug 20, 2015

bogdanRada commented Sep 2, 2015

jakirkham commented Sep 2, 2015

jakirkham commented Sep 9, 2015

espadrine commented Sep 9, 2015

jakirkham commented Sep 9, 2015

anko commented Sep 18, 2015

ForbesLindesay commented Sep 18, 2015

Daniel15 commented Sep 18, 2015

espadrine commented Sep 18, 2015

joepie91 commented Sep 18, 2015

ForbesLindesay commented Sep 18, 2015

espadrine commented Sep 18, 2015

ForbesLindesay commented Sep 18, 2015

espadrine commented Sep 18, 2015

ForbesLindesay commented Sep 18, 2015

ForbesLindesay commented Sep 18, 2015

Daniel15 commented Sep 18, 2015

espadrine commented Sep 18, 2015

joepie91 commented Sep 18, 2015

espadrine commented Sep 21, 2015

espadrine commented Nov 10, 2015