-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry after HttpError code 400 #290
Comments
Sorry for the late reply. Indeed, if a 400 is thrown by an intermittent situation, it would be OK to include it in the list of retriable exceptions. However, the code and the message indicate the opposite, so I'm not certain - should we not believe Google, and trust that (eventually) they get their error codes right? |
This is a tough one. In some situations it occurs when the multi-part upload session gets corrupted. You can't retry the request as-is when this error is thrown, but you can restart the upload from the beginning. |
Sounds like you get it from reads, too though? |
CC @frankyn |
Any idea why? If it's not on the individual call, but on the whole process, then it wouldn't fall into the normal retry logic - you would need your own retry logic at a higher level (e.g., dask distributed has task-wise retries). |
I'm trying to find the thread where I read that. I see another reference to 400 errors here: googleapis/google-resumable-media-python#115 but that one appears to have a different root cause (probably due to a bad API rollout) |
Found the thread (internal Google issue 128935544). It's actually 410 (and 404) errors that need to start from the beginning. Documented here: https://cloud.google.com/bigquery/docs/loading-data-local#best-practices |
Hi folks, HTTP 400 is not considered retriable, but read through the background internal issue (ignoring because it's a 410/404* -- please correct me if it's still related). I'm not convinced that the 400 is a service issue and would like to validate in the backend (if possible). Do y'all have request logs with body payload for the transient 400 errors? If not: if this occurred in the last two weeks, could you send me an email at [email protected] with the following:
Thank you for your patience. |
Thanks loads for the replies! I don't have logs to hand right now but if I come across this problem again then I'll be sure to follow-up here with more details (including logs and details of the project). |
Possibly related: recently had gcsfs return an HttpError on a long-running process that had previously completed successfully—seems like a service issue to me. Here is the logging I have available. Unfortunately not sure of specific HttpError code. @frankyn I can email you those specifics if it'd be helpful. Thanks! gcsfs version: v0.7.1
|
Hi @oliverwm1, thanks for raising this. Unfortunately, without the status code and a request log it will be difficult to point down what went wrong. Do you have more context on the error such as use-case? |
@frankyn Thanks for quick reply, I'll see if I can track down more specifics. The use-case is a long-running workflow on GKE that:
The error happens rarely, maybe one in a thousand calls to Based on where the error is raised, can at least eliminate error codes 403, 404, 502 or <400. |
Thanks for the additional context @oliverwm1. I'm not familiar with the retry strategy in this library, but for someone that does, do you think this is more related to a timeout? |
Consolidating metadata of zarr's is slow and sometimes unreliable (see fsspec/gcsfs#290). This uses @nbren12's custom multi-threaded metadata consolidation script in the run-fv3gfs appending step, in hopes of faster and more reliable performance. Significant internal changes: - use custom multi-threaded zarr metadata consolidation script
I am also running into this issue, and am able to reliably reproduce it, but not in a very shareable way (see ai2cm/fv3net#1162). Some context:
|
Do you mind trying with v2021.04.0 ? |
Sure. I unfortunately get the same error on v2021.04.0. I've been trying to build a more minimal example, but it's hard to do so. |
The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will always be 'None'. This will cause uploading steps to return with exit code 400. This commit fixes this bug by refactoring the _call method to minimize the number of variabes in the scope. Resolves fsspec#290
I think I fixed it! Was a tricky bug to find. |
* fix 400 errors on retries in _call The kwargs was being overwritten by ._get_args. If any retries are needed, then on the second iteration datain will always be 'None'. This will cause uploading steps to return with exit code 400. This commit fixes this bug by refactoring the _call method to minimize the number of variabes in the scope. Resolves #290 * rename to params for clarity
Google Cloud Storage occasionally throws an HTTP Error 400 (which is nominally a 'bad request'. See the Google Cloud docs on HTTP response 400). But this happens on requests that have worked previously and work again after retrying. I've seen these spurious HTTP Error 400s when calling
gcs.du
and when usingdask
to read data from Google Cloud.The error message from GCP is:
Error 400 (Bad Request)! That's an error. Your client has issued a malformed or illegal request. That's all we know.
Monkey-patching
gcsfs.utils.is_retriable
fixes the issue for me:In a perfect world, I guess the best solution would be to ask Google Cloud to not throw spurious HTTP Error 400s. But perhaps a pragmatic approach is to modify
gcsfs
to retry after HTTP Error 400s :)Environment:
The text was updated successfully, but these errors were encountered: