Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry after HttpError code 400 #290

Closed
JackKelly opened this issue Sep 30, 2020 · 17 comments · Fixed by #380
Closed

Retry after HttpError code 400 #290

JackKelly opened this issue Sep 30, 2020 · 17 comments · Fixed by #380

Comments

@JackKelly
Copy link

JackKelly commented Sep 30, 2020

Google Cloud Storage occasionally throws an HTTP Error 400 (which is nominally a 'bad request'. See the Google Cloud docs on HTTP response 400). But this happens on requests that have worked previously and work again after retrying. I've seen these spurious HTTP Error 400s when calling gcs.du and when using dask to read data from Google Cloud.

The error message from GCP is: Error 400 (Bad Request)! That's an error. Your client has issued a malformed or illegal request. That's all we know.

Monkey-patching gcsfs.utils.is_retriable fixes the issue for me:

import gcsfs

# Override is_retriable.  Google Cloud sometimes throws
# a HttpError code 400.  gcsfs considers this to not be retriable.
# But it is retriable!

def is_retriable(exception):
    """Returns True if this exception is retriable."""
    errs = list(range(500, 505)) + [
        # Jack's addition.  Google Cloud occasionally throws Bad Requests for no apparent reason.
        400,
        # Request Timeout
        408,
        # Too Many Requests
        429,
    ]
    errs += [str(e) for e in errs]
    if isinstance(exception, gcsfs.utils.HttpError):
        return exception.code in errs

    return isinstance(exception, gcsfs.utils.RETRIABLE_EXCEPTIONS)

gcsfs.utils.is_retriable = is_retriable

In a perfect world, I guess the best solution would be to ask Google Cloud to not throw spurious HTTP Error 400s. But perhaps a pragmatic approach is to modify gcsfs to retry after HTTP Error 400s :)

Environment:

  • Dask version: 2.28.0
  • Python version: 3.8.5
  • Operating System: Ubuntu 20.04 on a Google Cloud VM
  • Install method: conda
@martindurant
Copy link
Member

Sorry for the late reply.

Indeed, if a 400 is thrown by an intermittent situation, it would be OK to include it in the list of retriable exceptions. However, the code and the message indicate the opposite, so I'm not certain - should we not believe Google, and trust that (eventually) they get their error codes right?

@tswast
Copy link

tswast commented Oct 27, 2020

This is a tough one. In some situations it occurs when the multi-part upload session gets corrupted.

You can't retry the request as-is when this error is thrown, but you can restart the upload from the beginning.

@tswast
Copy link

tswast commented Oct 27, 2020

Sounds like you get it from reads, too though?

@tswast
Copy link

tswast commented Oct 27, 2020

CC @frankyn

@martindurant
Copy link
Member

In some situations it occurs when the multi-part upload session gets corrupted.

Any idea why?

If it's not on the individual call, but on the whole process, then it wouldn't fall into the normal retry logic - you would need your own retry logic at a higher level (e.g., dask distributed has task-wise retries).

@tswast
Copy link

tswast commented Oct 27, 2020

I'm trying to find the thread where I read that. I see another reference to 400 errors here: googleapis/google-resumable-media-python#115 but that one appears to have a different root cause (probably due to a bad API rollout)

@tswast
Copy link

tswast commented Oct 27, 2020

Found the thread (internal Google issue 128935544). It's actually 410 (and 404) errors that need to start from the beginning. Documented here: https://cloud.google.com/bigquery/docs/loading-data-local#best-practices

@frankyn
Copy link

frankyn commented Oct 27, 2020

Hi folks,

HTTP 400 is not considered retriable, but read through the background internal issue (ignoring because it's a 410/404* -- please correct me if it's still related).

I'm not convinced that the 400 is a service issue and would like to validate in the backend (if possible).

Do y'all have request logs with body payload for the transient 400 errors?

If not: if this occurred in the last two weeks, could you send me an email at [email protected] with the following:

  1. Project id
  2. bucket name
  3. Time frame of 400 failure

Thank you for your patience.

@JackKelly
Copy link
Author

Thanks loads for the replies! I don't have logs to hand right now but if I come across this problem again then I'll be sure to follow-up here with more details (including logs and details of the project).

@oliverwm1
Copy link
Contributor

Possibly related: recently had gcsfs return an HttpError on a long-running process that had previously completed successfully—seems like a service issue to me. Here is the logging I have available. Unfortunately not sure of specific HttpError code. @frankyn I can email you those specifics if it'd be helpful. Thanks!

gcsfs version: v0.7.1
fsspec version: v0.8.4

Traceback (most recent call last):
  File "/usr/local/bin/append.py", line 225, in append_zarr_along_time
    zarr.consolidate_metadata(fsspec.get_mapper(target_path))
  File "/usr/lib/python3/dist-packages/zarr/convenience.py", line 1119, in consolidate_metadata
    'metadata': {
  File "/usr/lib/python3/dist-packages/zarr/convenience.py", line 1120, in <dictcomp>
    key: json_loads(store[key])
  File "/usr/local/lib/python3.8/dist-packages/fsspec/mapping.py", line 132, in __getitem__
    result = self.fs.cat(k)
  File "/usr/local/lib/python3.8/dist-packages/fsspec/asyn.py", line 230, in cat
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 826, in _cat_file
    headers, out = await self._call("GET", u2)
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 1230, in validate_response
    raise HttpError({"code": status})

@frankyn
Copy link

frankyn commented Nov 2, 2020

Hi @oliverwm1, thanks for raising this.

Unfortunately, without the status code and a request log it will be difficult to point down what went wrong. Do you have more context on the error such as use-case?

@oliverwm1
Copy link
Contributor

@frankyn Thanks for quick reply, I'll see if I can track down more specifics. The use-case is a long-running workflow on GKE that:

  1. generates data
  2. appends it to a zarr-store on GCS (we use gsutil cp for this step, have not had any issues)
  3. calls zarr.consolidate_metadata on that zarr-store (uses gcsfs to read/write many small metadata files)
  4. then repeats.

The error happens rarely, maybe one in a thousand calls to consolidate_metadata as a very very rough estimate.

Based on where the error is raised, can at least eliminate error codes 403, 404, 502 or <400.

@frankyn
Copy link

frankyn commented Nov 2, 2020

Thanks for the additional context @oliverwm1.

I'm not familiar with the retry strategy in this library, but for someone that does, do you think this is more related to a timeout?

github-actions bot pushed a commit to ai2cm/fv3net that referenced this issue Nov 3, 2020
Consolidating metadata of zarr's is slow and sometimes unreliable (see fsspec/gcsfs#290). This uses @nbren12's custom multi-threaded metadata consolidation script in the run-fv3gfs appending step, in hopes of faster and more reliable performance.

Significant internal changes:
- use custom multi-threaded zarr metadata consolidation script
@nbren12
Copy link
Contributor

nbren12 commented Apr 28, 2021

I am also running into this issue, and am able to reliably reproduce it, but not in a very shareable way (see ai2cm/fv3net#1162). Some context:

  • The problem occurs in gcsfs v0.6.1-0.8.0
  • occurs when using .to_zarr on a dask-backed xarray object. This xarray object has a fairly involved dask graph involving loading data from GCS.

@martindurant
Copy link
Member

Do you mind trying with v2021.04.0 ?

@nbren12
Copy link
Contributor

nbren12 commented Apr 28, 2021

Sure. I unfortunately get the same error on v2021.04.0. I've been trying to build a more minimal example, but it's hard to do so.

nbren12 added a commit to ai2cm/gcsfs that referenced this issue Apr 28, 2021
The kwargs was being overwritten by ._get_args. If any retries are
needed, then on the second iteration datain will always be 'None'.  This
will cause uploading steps to return with exit code 400.

This commit fixes this bug by refactoring the _call method to minimize
the number of variabes in the scope.

Resolves fsspec#290
@nbren12
Copy link
Contributor

nbren12 commented Apr 28, 2021

I think I fixed it! Was a tricky bug to find.

martindurant pushed a commit that referenced this issue Apr 28, 2021
* fix 400 errors on retries in _call

The kwargs was being overwritten by ._get_args. If any retries are
needed, then on the second iteration datain will always be 'None'.  This
will cause uploading steps to return with exit code 400.

This commit fixes this bug by refactoring the _call method to minimize
the number of variabes in the scope.

Resolves #290

* rename to params for clarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants