Replies: 2 comments 4 replies
-
Hi @freekmurze, why did you finish this issue? |
Beta Was this translation helpful? Give feedback.
-
I'm also consistently running into this issue for a few months now, with the added problem of the whole website going down for a few minutes. Supposedly due to 100% CPU usage. I thereby believe this is related, so I'm placing it in this discussion thread. I'm also able to provide some additional information for my particular issue. tl;dr: I don't have a resolution for my problem (yet), but I wanted to share my findings since I found this open discussion. I hope this helps someone (including future me) in resolving their problem. Update: We identified the cause for our problem, and it wasn't related to this package. See my comments down below for more context. I'm leaving this up in case it helps someone in identifying 100% CPU load issues on their end. ScenarioAt work we manage a Laravel (v6) application for one of our clients, which includes this package spatie/laravel-backup (v6.16.0). Its backups are approx. 5.5 GB in size. ProblemAlmost every night we get a message in Slack from our UptimeRobot integration that the website briefly goes down. This consistently happens a few minutes into ObservationsAside from the aforementioned things (it happens almost every night, it always happens during backup:run, downtime lasts 3-4 mins on average), I concluded the following: When I run backups manually ( First the database dump gets created, fully utilizing a single CPU core: This runs for about a few minutes. During this time the website works as normal. This aligns with the server not going down immediately when starting the backups according to Uptime Robot, but a few minutes into it. Afterwards during the "determining files to backup"-step, all CPU cores briefly spike up towards 80% usage, before going down during the "zipping X files and directories..."-step. Manually firing Things I've already looked into and managed to exclude as a cause: Backup storage sizeIncreasing the MaximumStorageInMegabytes health check from 20GB to 40GB in config/backup.php, and changing 'delete_oldest_backups_when_using_more_megabytes_than' from 5GB to 20GB seemed to work for a bit. This was changed a few months ago, and it actually seemed to help for about a month with no nightly downtime. Afterwards the nightly downtime started returning, and stayed ever since. Application error logsNo application errors in the Laravel logs related to the backups were found. For full disclosure there ARE a LOT of other errors in the logs due to legacy and shoddy application code. But all of them are in controllers or deeper in the domain. From what I can find there's no weird errors going on during the generic application lifecycle, like broken code in service providers. Nginx error logsDuring the time where the server goes down, there are a lot of "resource temporarily unavailable" errors:
What's curious is that there are a LOT (~700) of these errors in the span of approx two second for a lot of different URLs of our application, and they all seem to be originating from 127.0.0.1. In addition, up to a minute afterwards there's a load of "upstream timed out" errors from different clients (nightly visitors like bots and scrapers):
This continues for approx 5 minutes, until the PHP-FPM process manages to handle these requests again coming in from Nginx. Note: The time seems off by two hours, but this is due to timezones (CEST / UTC+2). Other things we've tried
ConclusionIt looks like that a probable main cause for the nightly downtime is a LOT of simultaneous requests going to PHP-FPM through Nginx when the backup is running (as indicated by the "Resource temporarily unavailable" errors in the Nginx logs). I need to do further investigation as to where these come from, and if they're related to this backup package. |
Beta Was this translation helpful? Give feedback.
-
When backup starts the AWS EC2 server is at 100% usage and then crashes,
I am using https://tenancyforlaravel.com/ with 10 tenants. Backup is done on each.
Has anyone had this problem?
Beta Was this translation helpful? Give feedback.
All reactions