Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should retry on ECONNRESET #2254

Closed
inlined opened this issue Apr 25, 2017 · 21 comments
Closed

Should retry on ECONNRESET #2254

inlined opened this issue Apr 25, 2017 · 21 comments
Assignees

Comments

@inlined
Copy link

inlined commented Apr 25, 2017

edit by @stephenplusplus

If you're just joining the conversation...

Here is where we stand: We have disabled the forever agent by default in Cloud Function environments. If you un- and re-install @google-cloud/storage, you will pick up the new behavior automatically.

Original Post

Environment details

  • OS: Google Cloud Functions (not an OS. I know)
  • Node.js version: LTS
  • npm version: LTS
  • google-cloud-node version: HEAD

Steps to reproduce

  1. git clone [email protected]:firebase/functions-samples.git
  2. cd generate-thumbnail
  3. firebase init
  4. firebase deploy
  5. play with app for a while and observe that it gets into a mode that always fails with ECONNRESET

Theory

I've noticed this in my own Cloud Functions and only in those that use google-cloud-node. I think the socket is dying between requests when the Cloud Functions environment is throttled. This should be OK, but I suspect the google-cloud-node module needs to add an extra error case here. I'll write a pull request to that effect.

inlined added a commit to inlined/google-cloud-node that referenced this issue Apr 25, 2017
Response to googleapis#2254

Should hopefully catch the issue customers are seeing in Google Cloud Functions where google APIs get into a permanent failed state.
@stephenplusplus
Copy link
Contributor

Continuing from #2255 (comment) (cc @s2young)

I'm not able to get the process mentioned in this issue to fail. Does anyone have a fool proof way I can get this script or another to break? I've been uploading 24 images at a time and the thumbnails are all generated successfully.

@s2young
Copy link

s2young commented May 4, 2017

My app is doing quite a bit more than the example, and I'm stripping some of this out to see if I can avoid the problem. But all is just standard Storage api stuff:

  1. Download the source file to tmp directory.
  2. Call getMetadata (so I can get the byte size)- this I'm stripping out.
  3. Resize the file (3 different sizes for different screens).
  4. Upload the resized images to Storage.
  5. Delete the source file from Cloud Storage.
  6. Update firebase real-time db with info about the resized images.

The application is a mobile app, with a content management piece that allows users to upload photos when building a layout. Images are resized for display on Tablets, Large/Small phones, etc.
@stephenplusplus

@s2young
Copy link

s2young commented May 5, 2017

One more caveat @stephenplusplus: If you follow the pattern I've given, make sure you ignore the upload of the copies. In my case, when I upload a resized copy I am adding some metadata to the image so that the imageListener function can ignore those storage events. I got myself into an infinite loop early on!

@stephenplusplus
Copy link
Contributor

Haha, thanks for the tip! I was able to reproduce the error (ECONNRESET and ESOCKETTIMEDOUT) but it was only when I was deleting the file before it was done being used. I'm not sure why that triggered those errors, but that's the only thing I can pin-point as the cause. I'll continue to play around.

Here's my script-- https://gist.github.com/stephenplusplus/1ac55732adadc1562737cf903307a98d. It takes an upload and creates 10 thumbnails. I'm testing with 25 at a time, and I'm consistently ending up with the correct total of images in the bucket, 250.

@inlined
Copy link
Author

inlined commented May 5, 2017

My theory theory is this happens between periods of inactivity with the Cloud Function instance is throttled. I think there's a cached socket that dies--possibly from being too CPU starved to acknowledge any keep-alive fast enough. Try the same script now (hopefully your instance is still alive) or a 20m break after that attempt. Once you see it you should see the error be pretty fatal until a redeploy (which forces a reallocation of your Cloud Function instance)

@s2young
Copy link

s2young commented May 5, 2017

Another update, in case it helps. I have another Firebase Function that is listening to the RealTime DB for changes and then deleting the related image from Google Cloud Storage. Basically, in my CMS when a user deletes an image he's deleting the data representation of it in Firebase DB. The listener then deletes the actual image from Storage.

I am experiencing the ECONNRESET error in this db listener function. The result is I have orphaned images that are attached to object ids that no longer exist in my db.

@stephenplusplus
Copy link
Contributor

@inlined I ran the script successfully, waited 20 minutes, and it ran successfully again without errors.

Would anyone be able to give me a repo I can clone, deploy, and then steps I can follow to reproduce the error?

@s2young
Copy link

s2young commented May 5, 2017

Hey all, I forked your gist and add a couple of things to hopefully help - though I didn't test this so be warned!

  1. Added a step to the thumbnail sequence that writes a node to the real-time db referencing the image's path.
  2. Added a firebase db listener that is triggered when you delete the node created in step 1. It then removes the file from Cloud Storage (or at least is supposed to!).

My recommendation is run the script, wait a few minutes, and then manually delete some of the nodes created in the real-time db and see if the ECONNRESET happens. I hope this helps @stephenplusplus !

https://gist.github.com/s2young/068f082fd30d17f0a1f2146d953106de

@s2young
Copy link

s2young commented May 5, 2017

Hey @stephenplusplus I've added you as a collaborator to my 'app-server' project. You should be able to deploy this to your firebase project, and then locally run 'jasmine' command to trigger the /spec/dev/mage.spec.js test. In this test's current state, I'm seeing tons of ECONNRESETs in my Firebase Functions logs. I hope it helps.

https://github.com/GobaLLC/app-server/invitations

@s2young
Copy link

s2young commented May 5, 2017

Also, if it helps, I noticed a big jump in ECONNRESETs when I used a singleton instance of a google-cloud/storage object rather than one instance per function.

@s2young
Copy link

s2young commented May 5, 2017

And yet another thing: I just updated my firebase-tools, and now the firebase.config().firebase call no longer works, so the code that names the bucket is broken. You'll need to tweak that.

Do you know how to programmatically grab the current-context projectId from either firebase or googe-cloud objects? It seems to me that if I'm authed on my local machine my sdk should be able to know the context it's running in. Or do I have to have a config file?? Thanks @stephenplusplus

@stephenplusplus
Copy link
Contributor

I did some digging and found a post on StackOverflow from someone who ran into this (cc @DrPaulBrewer). He opened an issue on the official issue tracker, "cloud storage frequently throws ECONNRESET, read or write, from cloud functions environment." It seems the community has found a couple solutions:

@inlined @DrPaulBrewer @hayanmind can you check your apps using the newest @google-cloud/storage official module? Theoretically, these issues should be fixed, since we released the logic that retries on ECONNRESET errors yesterday. Also, any other information that could help us fix this once and for all would be very welcome.

As a side note, you should be able to set forever: false without forking:

var gcs = require('@google-cloud/storage')(...)
gcs.interceptors.push({
  request: function(reqOpts) {
    reqOpts.forever = false
    return reqOpts
  }
})

@s2young thank you for that gist -- I set it up and ran it. I've been doing this for the last several hours, waiting anywhere from 3, 5, 10, 25, and 60 minutes between events and monitoring the logs. Still, everything is working consistently for me.

Do you know how to programmatically grab the current-context projectId from either firebase or googe-cloud objects?

The libraries that I help maintain are the @google-cloud/* ones, so in this case @google-cloud/storage. With that library, if we're not given a projectId argument explicitly, we will use the getProjectId() function from google-auto-auth internally:

var authClient = require('google-auto-auth')()
authClient.getProjectId(function(err, projectId) {})

That will pick it up from the GCLOUD_PROJECT environment variable, the gcloud SDK, a service account JSON key file, and Google Cloud environments (App Engine, Compute Engine, Cloud Functions).

@bogacg
Copy link

bogacg commented May 6, 2017

@s2young I'm doing exact same steps (except augmenting metadata) in my app, and it ate my brain for a week thinking why using async/await causes this problem. After changing back my code to promises I kept getting error again and then I've figured it wasn't my code but Cloud Functions'.

@stephenplusplus as I stated above, my code is doing exactly same things w. @s2young's and after seeing your reply, I've deleted node_modules, npm install, recompiled (typescript/webpack) and deployed my functions. Not getting errors so far.
I did get error again (after few tries within 5-7 minutes). So I'll try forever: false solution now.

However I didn't see any version number change in google-cloud-node installed from last time I was getting errors (yesterday), so is the change you made internal?

Update

After forever: false applied, I didn't get any errors, will update here if anything happens.

@stephenplusplus
Copy link
Contributor

However I didn't see any version number change in google-cloud-node installed from last time I was getting errors (yesterday), so is the change you made internal?

Yes, I added the retry solution in a dependency, retry-request, so new and re-installations of @google-cloud/storage will pick up the change. The all-in-one package, google-cloud will not have it until our next release.

Thanks for trying these solutions out. Please keep letting us know the results you get.

@stephenplusplus
Copy link
Contributor

If everyone can confirm:

  • forever: false is the best solution, and
  • retrying when we get a socket hangup error does not work sufficiently,

we can automatically set forever to false when we see that we're in the Cloud Functions environment.

@s2young
Copy link

s2young commented May 8, 2017

So far, that is true for me. I haven't experienced any further ECONNRESETs since making the forever:false change. Thanks so much!

@stephenplusplus
Copy link
Contributor

We have disabled the forever agent by default in Cloud Function environments. If you un- and re-install @google-cloud/storage, you will pick up the new behavior automatically. Thanks for all of the helpful debugging, everyone!

@DrPaulBrewer
Copy link

DrPaulBrewer commented May 9, 2017

@stephenplusplus Thanks for your efforts. Other tasks here took priority. I updated my SO posting to reflect the above changes but have not had time to test.

@bobocandys
Copy link

I am seeing ECONNRESET in my Google Cloud Functions now. I am using @google-cloud/bigquery: 1.3.0, @google-cloud/storage: 1.7.0

Unfortunately, even explicitly setting the reqOpts.forever = false does not solve the issue for me. Any new things changed this behavior?

@atlanteh
Copy link

Could this be due to the fact that network and CPU are throttled down to 0 between function invocations, so the connection is lost on the new invocation?

@timscottbell
Copy link

timscottbell commented Sep 10, 2022

I get this error 100% of the time when trying to upload files in parallel. Adding

storage.interceptors.push({
  request(reqOpts: DecorateRequestOptions) {
    reqOpts.forever = false;
    return reqOpts;
  },
});

did nothing to resolve the issue. Our use case is using the client library to upload files in parallel, a fairly straightforward use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants