Retry after HttpError code 400 #290

JackKelly · 2020-09-30T08:19:05Z

Google Cloud Storage occasionally throws an HTTP Error 400 (which is nominally a 'bad request'. See the Google Cloud docs on HTTP response 400). But this happens on requests that have worked previously and work again after retrying. I've seen these spurious HTTP Error 400s when calling gcs.du and when using dask to read data from Google Cloud.

The error message from GCP is: Error 400 (Bad Request)! That's an error. Your client has issued a malformed or illegal request. That's all we know.

Monkey-patching gcsfs.utils.is_retriable fixes the issue for me:

import gcsfs

# Override is_retriable.  Google Cloud sometimes throws
# a HttpError code 400.  gcsfs considers this to not be retriable.
# But it is retriable!

def is_retriable(exception):
    """Returns True if this exception is retriable."""
    errs = list(range(500, 505)) + [
        # Jack's addition.  Google Cloud occasionally throws Bad Requests for no apparent reason.
        400,
        # Request Timeout
        408,
        # Too Many Requests
        429,
    ]
    errs += [str(e) for e in errs]
    if isinstance(exception, gcsfs.utils.HttpError):
        return exception.code in errs

    return isinstance(exception, gcsfs.utils.RETRIABLE_EXCEPTIONS)

gcsfs.utils.is_retriable = is_retriable

In a perfect world, I guess the best solution would be to ask Google Cloud to not throw spurious HTTP Error 400s. But perhaps a pragmatic approach is to modify gcsfs to retry after HTTP Error 400s :)

Environment:

Dask version: 2.28.0
Python version: 3.8.5
Operating System: Ubuntu 20.04 on a Google Cloud VM
Install method: conda

The text was updated successfully, but these errors were encountered:

martindurant · 2020-10-27T13:32:39Z

Sorry for the late reply.

Indeed, if a 400 is thrown by an intermittent situation, it would be OK to include it in the list of retriable exceptions. However, the code and the message indicate the opposite, so I'm not certain - should we not believe Google, and trust that (eventually) they get their error codes right?

tswast · 2020-10-27T14:37:11Z

This is a tough one. In some situations it occurs when the multi-part upload session gets corrupted.

You can't retry the request as-is when this error is thrown, but you can restart the upload from the beginning.

tswast · 2020-10-27T14:38:49Z

Sounds like you get it from reads, too though?

tswast · 2020-10-27T14:39:25Z

CC @frankyn

martindurant · 2020-10-27T14:41:11Z

In some situations it occurs when the multi-part upload session gets corrupted.

Any idea why?

If it's not on the individual call, but on the whole process, then it wouldn't fall into the normal retry logic - you would need your own retry logic at a higher level (e.g., dask distributed has task-wise retries).

tswast · 2020-10-27T15:25:07Z

I'm trying to find the thread where I read that. I see another reference to 400 errors here: googleapis/google-resumable-media-python#115 but that one appears to have a different root cause (probably due to a bad API rollout)

tswast · 2020-10-27T15:27:31Z

Found the thread (internal Google issue 128935544). It's actually 410 (and 404) errors that need to start from the beginning. Documented here: https://cloud.google.com/bigquery/docs/loading-data-local#best-practices

frankyn · 2020-10-27T15:36:20Z

Hi folks,

HTTP 400 is not considered retriable, but read through the background internal issue (ignoring because it's a 410/404* -- please correct me if it's still related).

I'm not convinced that the 400 is a service issue and would like to validate in the backend (if possible).

Do y'all have request logs with body payload for the transient 400 errors?

If not: if this occurred in the last two weeks, could you send me an email at [email protected] with the following:

Project id
bucket name
Time frame of 400 failure

Thank you for your patience.

JackKelly · 2020-10-27T16:40:17Z

Thanks loads for the replies! I don't have logs to hand right now but if I come across this problem again then I'll be sure to follow-up here with more details (including logs and details of the project).

oliverwm1 · 2020-11-02T16:01:32Z

Possibly related: recently had gcsfs return an HttpError on a long-running process that had previously completed successfully—seems like a service issue to me. Here is the logging I have available. Unfortunately not sure of specific HttpError code. @frankyn I can email you those specifics if it'd be helpful. Thanks!

gcsfs version: v0.7.1
fsspec version: v0.8.4

Traceback (most recent call last):
  File "/usr/local/bin/append.py", line 225, in append_zarr_along_time
    zarr.consolidate_metadata(fsspec.get_mapper(target_path))
  File "/usr/lib/python3/dist-packages/zarr/convenience.py", line 1119, in consolidate_metadata
    'metadata': {
  File "/usr/lib/python3/dist-packages/zarr/convenience.py", line 1120, in <dictcomp>
    key: json_loads(store[key])
  File "/usr/local/lib/python3.8/dist-packages/fsspec/mapping.py", line 132, in __getitem__
    result = self.fs.cat(k)
  File "/usr/local/lib/python3.8/dist-packages/fsspec/asyn.py", line 230, in cat
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 826, in _cat_file
    headers, out = await self._call("GET", u2)
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/usr/local/lib/python3.8/dist-packages/gcsfs/core.py", line 1230, in validate_response
    raise HttpError({"code": status})

frankyn · 2020-11-02T16:10:24Z

Hi @oliverwm1, thanks for raising this.

Unfortunately, without the status code and a request log it will be difficult to point down what went wrong. Do you have more context on the error such as use-case?

oliverwm1 · 2020-11-02T16:40:55Z

@frankyn Thanks for quick reply, I'll see if I can track down more specifics. The use-case is a long-running workflow on GKE that:

generates data
appends it to a zarr-store on GCS (we use gsutil cp for this step, have not had any issues)
calls zarr.consolidate_metadata on that zarr-store (uses gcsfs to read/write many small metadata files)
then repeats.

The error happens rarely, maybe one in a thousand calls to consolidate_metadata as a very very rough estimate.

Based on where the error is raised, can at least eliminate error codes 403, 404, 502 or <400.

frankyn · 2020-11-02T16:55:49Z

Thanks for the additional context @oliverwm1.

I'm not familiar with the retry strategy in this library, but for someone that does, do you think this is more related to a timeout?

@nbren12

Consolidating metadata of zarr's is slow and sometimes unreliable (see fsspec/gcsfs#290). This uses @nbren12's custom multi-threaded metadata consolidation script in the run-fv3gfs appending step, in hopes of faster and more reliable performance. Significant internal changes: - use custom multi-threaded zarr metadata consolidation script

nbren12 · 2021-04-28T01:17:30Z

I am also running into this issue, and am able to reliably reproduce it, but not in a very shareable way (see ai2cm/fv3net#1162). Some context:

The problem occurs in gcsfs v0.6.1-0.8.0
occurs when using .to_zarr on a dask-backed xarray object. This xarray object has a fairly involved dask graph involving loading data from GCS.

martindurant · 2021-04-28T01:19:38Z

Do you mind trying with v2021.04.0 ?

nbren12 · 2021-04-28T01:52:51Z

Sure. I unfortunately get the same error on v2021.04.0. I've been trying to build a more minimal example, but it's hard to do so.

The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will always be 'None'. This will cause uploading steps to return with exit code 400. This commit fixes this bug by refactoring the _call method to minimize the number of variabes in the scope. Resolves fsspec#290

nbren12 · 2021-04-28T03:01:23Z

I think I fixed it! Was a tricky bug to find.

* fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will always be 'None'. This will cause uploading steps to return with exit code 400. This commit fixes this bug by refactoring the _call method to minimize the number of variabes in the scope. Resolves #290 * rename to params for clarity

oliverwm1 mentioned this issue Nov 2, 2020

Use multithreaded zarr metadata consolidation ai2cm/fv3net#753

Merged

martindurant mentioned this issue Jan 23, 2021

Possibly solve intermittent permission issues #335

Merged

nbren12 mentioned this issue Apr 28, 2021

fix 400 errors on retries in _call #380

Merged

martindurant closed this as completed in #380 Apr 28, 2021

oliverwm1 mentioned this issue May 4, 2021

Integration testing requirements pangeo-data/pangeo-integration-tests#1

Open

TiansuYu mentioned this issue Jul 22, 2021

gcsfs==2021.06 breaks customized retriable errors #402

Closed

Metamess mentioned this issue May 15, 2024

Invalid multipart request on retry #621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry after HttpError code 400 #290

Retry after HttpError code 400 #290

JackKelly commented Sep 30, 2020 •

edited

Loading

martindurant commented Oct 27, 2020

tswast commented Oct 27, 2020 •

edited

Loading

tswast commented Oct 27, 2020

tswast commented Oct 27, 2020

martindurant commented Oct 27, 2020

tswast commented Oct 27, 2020

tswast commented Oct 27, 2020

frankyn commented Oct 27, 2020

JackKelly commented Oct 27, 2020

oliverwm1 commented Nov 2, 2020

frankyn commented Nov 2, 2020

oliverwm1 commented Nov 2, 2020

frankyn commented Nov 2, 2020

nbren12 commented Apr 28, 2021

martindurant commented Apr 28, 2021

nbren12 commented Apr 28, 2021

nbren12 commented Apr 28, 2021

Retry after HttpError code 400 #290

Retry after HttpError code 400 #290

Comments

JackKelly commented Sep 30, 2020 • edited Loading

martindurant commented Oct 27, 2020

tswast commented Oct 27, 2020 • edited Loading

tswast commented Oct 27, 2020

tswast commented Oct 27, 2020

martindurant commented Oct 27, 2020

tswast commented Oct 27, 2020

tswast commented Oct 27, 2020

frankyn commented Oct 27, 2020

JackKelly commented Oct 27, 2020

oliverwm1 commented Nov 2, 2020

frankyn commented Nov 2, 2020

oliverwm1 commented Nov 2, 2020

frankyn commented Nov 2, 2020

nbren12 commented Apr 28, 2021

martindurant commented Apr 28, 2021

nbren12 commented Apr 28, 2021

nbren12 commented Apr 28, 2021

JackKelly commented Sep 30, 2020 •

edited

Loading

tswast commented Oct 27, 2020 •

edited

Loading