Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: attempt to only remember received content when the header is within success range by http spec #7970

Merged
merged 9 commits into from
Jan 10, 2025

Conversation

indreka
Copy link
Contributor

@indreka indreka commented Dec 31, 2024

Remembering "We encountered an internal error. Please try again." for http response 500, 503 etc will result in data corruption otherwise.

@indreka indreka requested review from a team as code owners December 31, 2024 07:33
Copy link

google-cla bot commented Dec 31, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@indreka indreka changed the title attempt to only remember received content when the header is within success range by http spec fix: attempt to only remember received content when the header is within success range by http spec Dec 31, 2024
@bshaffer
Copy link
Contributor

bshaffer commented Jan 6, 2025

Could @vishwarajanand, @JesseLovelace or someone on the Storage team confirm that this is what we want?

@bshaffer
Copy link
Contributor

bshaffer commented Jan 7, 2025

@indreka one question I have is - if the Request has a status code between 200 and 300, would it even throw an exception? I assume not, and so I think this logic may be too restrictive.

@indreka
Copy link
Contributor Author

indreka commented Jan 7, 2025

@bshaffer
You are incorrect about the exception part.
If connection error happens in middle of read, things get strange/interesting.

I did a quick local test where in apache I ran a script that did:

echo "test sleep\n";
flush();
sleep(60);
echo 'followup message after 60s';
flush();
exit;

Then I did a sample script I executed from command line:

  $client = new GuzzleHttp\Client();
  try {
    $res = $client->request('GET', 'http://localhost/sleepmessage.php', [
      \GuzzleHttp\RequestOptions::SYNCHRONOUS => true,
      \GuzzleHttp\RequestOptions::CONNECT_TIMEOUT => 1.0,
      \GuzzleHttp\RequestOptions::TIMEOUT => 70.0,
      \GuzzleHttp\RequestOptions::READ_TIMEOUT => 70.0,
    ]);
    echo "status={$res->getStatusCode()} body={$res->getBody()}\n";
  } catch (Exception $e) {
    echo "exception happened: {$e->getMessage()}\n";
    if ($e instanceof GuzzleHttp\Exception\RequestException && $e->hasResponse()) {
      $res = $e->getResponse();
      echo "status={$res->getStatusCode()} body={$res->getBody()}\n";
    } else {
      echo "no response in exception\n";
    }
  }
  exit;

Then I triggered the command line script and in the middle of sleep, I used kill -9 to "surprise-kill" apache instance and got this result.
exception happened: cURL error 18: transfer closed with outstanding read data remaining (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost/sleepmessage.php status=200 body=test sleep

So response status is indeed 200 while body contains partial message and exception is thrown because curl detected network-level issue.

Using direct curl_init + curl_exec the output is different from guzzle, body is not returned from curl_exec, meaning Guzzle does something different (did not dig too deep there):
status=200 errNo=18 errMsg=transfer closed with outstanding read data remaining body=

I am not versatile with the conformance test system used in this project so maybe someone can hint how to do this sort of network interrupt test case (if it is possible at all).
As much as I understood from current conformance tests, that try to show "retry works well after error 503", they are bit simplistic, only check function calls happening with correct params but nothing checks the actual file content being returned being correct.

@bshaffer
Copy link
Contributor

bshaffer commented Jan 7, 2025

Wow @indreka thank you for your very thorough investigation! This does seem to indicate that the fix you've provided will stil allow for chunks to be added as expected for certain types of exceptions.

I'd still like to confirm with the Google Cloud Storage team (and/or compare with other Google Cloud Storage client libraries) that these are the status codes we want to select for before merging, but I agree that this seems like the correct solution. Thanks again!

@bshaffer
Copy link
Contributor

bshaffer commented Jan 7, 2025

@indreka as you've mentioned, it's hard to see what the current Conformance tests are actually testing. I've commented out that block of code entirely, and all the conformance tests still pass.

I am going to wait to hear from @frankyn or @saranshdhingra to see if they can give me more context as to what that block of code is supposed to be doing, and if it's being tested anywhere.

@indreka
Copy link
Contributor Author

indreka commented Jan 8, 2025

I can guess the intent of the retry based on the code: if 1GB download gets network error at 950MB then it is much more efficient to only ask the last 50MB than the full 1GB (n-th 1GB retry may encounter error again etc). Two-fold result: allow graceful resuming for intermittent errors and avoid excessive bandwidth usage.

The apache surprise-kill example proved that Guzzle throws exception with status 200 and partial body so for pretty cases the code works quite well.

I will check if I manage to add simple unittest (not conformance case) to cover this fix.

@indreka
Copy link
Contributor Author

indreka commented Jan 8, 2025

Added a unittest that shows how the partial resuming is expected to work and shows that when statuscode check blindly accepts error message as content, retry request tries to do partial request that skips bytes from beginning and ends up corrupting the data.

Co-authored-by: Brent Shaffer <[email protected]>
@indreka
Copy link
Contributor Author

indreka commented Jan 9, 2025

Any other feedback or comments/suggestions?
The testcase I provided shows quite clearly, what happens if server returns error 503 with body. Just comment out the change in Rest.php and tests start failing.
At least my mind this sort of silent data corruption is very bad thing and could/should be expedited?

@bshaffer
Copy link
Contributor

@indreka I agree the data corruption is a pretty bad issue with this behavior. We have a release later today, and I will work to get confirmation from the Storage team so that we can merge by then

Copy link
Contributor

@vishwarajanand vishwarajanand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I believe the fix is indeed correct and follows from other Storage libraries. For example in Python lib, (although a little different context of batches) there's a pattern of having the error code checked for being in the range 200 to 300.

@indreka
Copy link
Contributor Author

indreka commented Jan 10, 2025

Style updates merged. Is all set now from my side or should I squash all changes into single commit?

Storage/tests/Unit/Connection/RestTest.php Outdated Show resolved Hide resolved
@bshaffer bshaffer enabled auto-merge (squash) January 10, 2025 19:40
@bshaffer bshaffer merged commit 6216964 into googleapis:main Jan 10, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants