Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add flag to disable writing failure files in batch daemon #3832

Closed
wants to merge 1 commit into from

Conversation

flexgrip
Copy link

Hello all 👋

This is just a quick little patch that I think might help with GAE users. Let me explain.

The problem
The moment I added google/cloud to my Laravel + GAE Standard project, I noticed I was getting critical log entries and failed requests at regular intervals in production. The errors read like this:
Exceeded soft memory limit of 512 MB with 577 MB after servicing 89 requests total.

I hunted for memory leaks in my code, but could not reproduce any. Then one day I remembered that GAE standard has what is basically a tmpfs (ram disk) in the /tmp directory. I thought, "Maybe there are some files being written in there." Well... for once I was right! I made a little function that would display all of the files in the /tmp, along with their size (since you can't really access the file system of gae standard instances). To my surprise, there were actually some files from the batch daemon that were filling up the ram on my gae standard instances. They looked like this:

/tmp/batch-daemon-failure/failed-items.1 - 100MB
/tmp/batch-daemon-failure/failed-items.2 - 73MB
etc...

After an instance was started, these files would steadily grow until they occupied about 75% of my instance's memory. And then poof, the instance would run out of memory and die. This explains the random failures and 500's I was getting.

A possible fix
In this PR, I am basically doing the same thing I had to fork this package for. I add an env var that lets me just skip the fwrite operation that writes these failed-items files. So they never even get written. Tests still pass.

I know what you might be thinking. "Ray, there's an env var that let's you change the directory that these failed-items files are created." You are right! But on GAE standard, there is only one writable directory. If I tell BatchDaemon to write the files elsewhere, it throws exceptions. So this PR let's you just stop these files from being written.

Further discussion
The silly part to this whole story is that I still don't quite know what the batch-runner is doing for us in my app. We don't use it. The only reason it is installed is because another package we use had google/cloud as a dep in composer. I actually tried downloading one of the failed-items files and it was unreadable binary looking data. Maybe what I was actually fixing here was a symptom and perhaps a configuration change in my app would keep these failed-items from being written to begin with. But the important part for me is figuring out what data they hold. When I say these files filled up quick, I don't mean a few kilobytes per hour. I mean, I would hit refresh and see three of them grow by 2-3MB. So each instance would only last for 15-30 minutes before running out of ram.

I am very open to suggestion and being wrong about my assumptions in this case. Just let me know what you think. This PR is very similar. Looks related, just in a different trait.

#2336

Thanks and Happy Valentines Day!

@flexgrip flexgrip requested a review from a team as a code owner February 15, 2021 06:01
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Feb 15, 2021
@dwsupplee dwsupplee added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 22, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 22, 2021
@bshaffer
Copy link
Contributor

bshaffer commented Dec 30, 2021

@flexgrip hello and thank you for your contribution. Taking a look at this, I think what's more compelling / concerning to me is the fact that your failure files are filling up so fast. it's also strange that the contents are binary, since in the code it should be a serialized PHP object. I would recommend diagnosing what is causing all the failures first, and trying to fix the issue there.

It would be better to have a fail-safe to ensure these files don't fill up so fast. Or as you said, a configuration or flag to turn it off. Your solution of using an environment variable (GOOGLE_CLOUD_BATCH_DAEMON_DISABLE_FAILURE_FILES) might be the right way to do this, but I'd like to look into other methods as well.

Thanks and Happy New Year!

@bshaffer
Copy link
Contributor

bshaffer commented Jan 5, 2022

A better solution may be to accept false as a valid string for GOOGLE_CLOUD_BATCH_DAEMON_FAILURE_DIR, in which case the failures aren't logged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants