Backup makes 100% CPU usage on server #1409

eumanito · 2021-09-24T12:05:00Z

eumanito
Sep 24, 2021

When backup starts the AWS EC2 server is at 100% usage and then crashes,

I am using https://tenancyforlaravel.com/ with 10 tenants. Backup is done on each.

$tenants = Tenant::all();
foreach ($tenants as $tenant) {
        Config::set('database.connections.central.database', $tenant->id);
        DB::reconnect('central');
        Config::set('backup.backup.name', 'backup-' . $tenant->id);
        $this->call('backup:clean');
        $this->call('backup:run', ['--only-db']);
}

Has anyone had this problem?

eumanito · 2021-09-24T19:19:19Z

eumanito
Sep 24, 2021
Author

Hi @freekmurze, why did you finish this issue?

1 reply

freekmurze Sep 24, 2021
Maintainer

I simply moved ot to a discussion, which is more approriate for this question.

WaveHack · 2021-10-19T09:07:10Z

WaveHack
Oct 19, 2021

I'm also consistently running into this issue for a few months now, with the added problem of the whole website going down for a few minutes. Supposedly due to 100% CPU usage. I thereby believe this is related, so I'm placing it in this discussion thread. I'm also able to provide some additional information for my particular issue.

tl;dr: I don't have a resolution for my problem (yet), but I wanted to share my findings since I found this open discussion. I hope this helps someone (including future me) in resolving their problem.

Update: We identified the cause for our problem, and it wasn't related to this package. See my comments down below for more context. I'm leaving this up in case it helps someone in identifying 100% CPU load issues on their end.

Scenario

At work we manage a Laravel (v6) application for one of our clients, which includes this package spatie/laravel-backup (v6.16.0). backup:clean is scheduled to be run daily at 01:00, with backup:run daily at 02:17.

Its backups are approx. 5.5 GB in size.

Problem

Almost every night we get a message in Slack from our UptimeRobot integration that the website briefly goes down. This consistently happens a few minutes into backup:run, and only lasts for 3 to 4 minutes on average:

Observations

Aside from the aforementioned things (it happens almost every night, it always happens during backup:run, downtime lasts 3-4 mins on average), I concluded the following:

When I run backups manually (php artisan backup:run) then I observe the following things:

First the database dump gets created, fully utilizing a single CPU core:

This runs for about a few minutes. During this time the website works as normal. This aligns with the server not going down immediately when starting the backups according to Uptime Robot, but a few minutes into it.

Afterwards during the "determining files to backup"-step, all CPU cores briefly spike up towards 80% usage, before going down during the "zipping X files and directories..."-step.

Manually firing backup:run during the day does not seem to reproduce this issue for my scenario, which makes it a hassle to investigate.

Things I've already looked into and managed to exclude as a cause:

Backup storage size

Increasing the MaximumStorageInMegabytes health check from 20GB to 40GB in config/backup.php, and changing 'delete_oldest_backups_when_using_more_megabytes_than' from 5GB to 20GB seemed to work for a bit.

This was changed a few months ago, and it actually seemed to help for about a month with no nightly downtime. Afterwards the nightly downtime started returning, and stayed ever since.

Application error logs

No application errors in the Laravel logs related to the backups were found.

For full disclosure there ARE a LOT of other errors in the logs due to legacy and shoddy application code. But all of them are in controllers or deeper in the domain.

From what I can find there's no weird errors going on during the generic application lifecycle, like broken code in service providers.

Nginx error logs

During the time where the server goes down, there are a lot of "resource temporarily unavailable" errors:

2021/10/19 00:19:33 [error] 822#822: *79889 connect() to unix:/var/run/php/php7.4-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 127.0.0.1, server: -snip-, request: "GET /-snip- HTTP/1.1", upstream: "fastcgi://unix:/var/run/php/php7.4-fpm.sock:", host: "-snip-"

What's curious is that there are a LOT (~700) of these errors in the span of approx two second for a lot of different URLs of our application, and they all seem to be originating from 127.0.0.1.

In addition, up to a minute afterwards there's a load of "upstream timed out" errors from different clients (nightly visitors like bots and scrapers):

2021/10/19 00:20:15 [error] 822#822: *78334 upstream timed out (110: Connection timed out) while reading response header from upstream, client: -snip-, server: -snip-, request: "GET /-snip- HTTP/1.1", upstream: "fastcgi://unix:/var/run/php/php7.4-fpm.sock", host: "-snip-"

This continues for approx 5 minutes, until the PHP-FPM process manages to handle these requests again coming in from Nginx.

Note: The time seems off by two hours, but this is due to timezones (CEST / UTC+2).

Other things we've tried

Increasing the amount of PHP-FPM www pool process workers (pm.workers, pm.max_children)
Increasing the amount of UNIX socket connections on the host
Upgrading the server from a 2 CPU, 4 GB RAM to 4 CPU, 8 GB RAM.

Conclusion

It looks like that a probable main cause for the nightly downtime is a LOT of simultaneous requests going to PHP-FPM through Nginx when the backup is running (as indicated by the "Resource temporarily unavailable" errors in the Nginx logs).

I need to do further investigation as to where these come from, and if they're related to this backup package.

3 replies

erikn69 Oct 19, 2021

what is the filesystem of your backup disk? S3? what package do you use for filesystem upload?

WaveHack Oct 19, 2021

We're using the local filesystem for our "backups". So our backups go in the storage directory. On the same server.

(I'm fully aware of the implications of such a setup. Despite my repeated counseling I have not been given the authority by the necessary parties to change this. At least not at this particular moment in time.)

I currently have little reason to believe this might cause downtime due to CPU-related problems, since the last step is primarily disk I/O. But with everything we've investigated so far I guess it can't hurt to try if our current list of "maybe this works"-tasks runs dry. If anything it might be the right incentive to push towards external backups. 😃

WaveHack Oct 22, 2021

I believe we managed to resolve our issue. It wasn't related to this package, after all.

Our 100% CPU load with nightly downtime issue was caused by other still-running console commands. In our case this was sitemap generation (using spatie/laravel-sitemap), which was taking 3 hours (much longer than anticipated) and hogging the CPU to 100% (or 400% with 4 cores).

During the time where the sitemap was still busy generating, the console kernel would also fire the command to start a new backup. This was too much load for the server, effectively bringing it down, albeit briefly.

We've since then increased the amount of allocated time for generating the sitemap before starting the backup (4:30 hours), which seems to solve our case. At least for now.

This is also confirmed when I look at the server stats. The only period which is causing 100% (400%) CPU load is the sitemap generation, not creating a backup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup makes 100% CPU usage on server #1409

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Backup makes 100% CPU usage on server #1409

eumanito Sep 24, 2021

Replies: 2 comments · 4 replies

eumanito Sep 24, 2021 Author

freekmurze Sep 24, 2021 Maintainer

WaveHack Oct 19, 2021

Scenario

Problem

Observations

Backup storage size

Application error logs

Nginx error logs

Other things we've tried

Conclusion

erikn69 Oct 19, 2021

WaveHack Oct 19, 2021

WaveHack Oct 22, 2021

eumanito
Sep 24, 2021

Replies: 2 comments 4 replies

eumanito
Sep 24, 2021
Author

freekmurze Sep 24, 2021
Maintainer

WaveHack
Oct 19, 2021