fix: gracefully handle read errors when retrying documents #137

kruskall · 2024-03-13T19:43:55Z

Do not ignore read errors.

Stop reading and returns the error

vikmik

I think you need to make sure you don't read data that doesn't belong to the current request. If the previous request had a larger payload, then you will get non-EOF errors towards the end of the request because the gzip reader will attempt to read data at the end of copyBuf

I had fixed it here: https://github.com/elastic/go-docappender/pull/129/files#diff-c7cec697c2474f331487a79c052f6150eca5c9a2162a74094df972d831e8223aL235 - it only requires making sure copyBuf's length doesn't go past the data from the current request payload.

I think the tests are failing because of this right now

vikmik · 2024-03-13T19:54:08Z

bulk_indexer.go

@@ -348,7 +351,10 @@ func (b *bulkIndexer) Flush(ctx context.Context) (BulkIndexerResponseStat, error
 					// loop until we've seen the start newline
 					for seen+newlines < startln {
 						seen += newlines
-						n, _ := gr.Read(buf[:cap(buf)])
+						n, err := gr.Read(buf[:cap(buf)])
+						if err != nil && err != io.EOF {


If you encounter an unexpected EOF then this will be an infinite loop. Defensive code should bail and error when that's the case, even though it shouldn't be possible in normal circumstances

If you encounter an unexpected EOF then this will be an infinite loop. Defensive code should bail and error when that's the case, even though it shouldn't be possible in normal circumstances

Could you clarify this ? If we get an error we just return early, I don't see why this is an infinite loop

You only return early if err != nil and err != EOF

If err == EOF and we haven't yet seen the start newline, this may never exit. This shouldn't happen if the contents of copyBuf are what we expect, so it depends how defensive you want to be.

(alternatively, see https://github.com/elastic/go-docappender/pull/129/files#diff-c7cec697c2474f331487a79c052f6150eca5c9a2162a74094df972d831e8223aR350 )

vikmik · 2024-03-13T19:54:49Z

bulk_indexer.go

@@ -364,7 +370,10 @@ func (b *bulkIndexer) Flush(ctx context.Context) (BulkIndexerResponseStat, error
 						// loop until we've seen the end newline
 						for seen+newlines < endln {
 							seen += newlines
-							n, _ := gr.Read(buf[:cap(buf)])
+							n, err := gr.Read(buf[:cap(buf)])
+							if err != nil && err != io.EOF {


(same here - this may be an infinite loop if we encounter an unexpected EOF while we still expect to find newlines)

vikmik

I don't mean to block this PR behind the infinite loop stuff, which should in theory only happen on memory corruption scenarios. This isn't a regression - approving now as I need to step away in case you want to release now.

Otherwise feel free to amend as you see fit.

kruskall · 2024-03-13T22:12:18Z

I don't mean to block this PR behind the infinite loop stuff, which should in theory only happen on memory corruption scenarios. This isn't a regression - approving now as I need to step away in case you want to release now.

I think we need to revisit this approach. With this approach we would end up losing any metrics for the current flush request if there's any error in the retry phase. I don't think that's ideal

vikmik · 2024-03-14T00:18:14Z

I think we need to revisit this approach. With this approach we would end up losing any metrics for the current flush request if there's any error in the retry phase. I don't think that's ideal

Prior to this PR we were having spurious failed gzip reads due to a bug, but ignoring them. First and foremost error handling helps with maintainability/debuggability: now that the error handling was added, the "bug" surfaced and is now fixed with your latest commit.

After this, if there are any errors it means the payload data is corrupted or there's a more serious service bug. In general there are no reasons we shouldn't be able to properly decompress something we've just compressed. If memory is getting corrupted, metrics should be the last of our problems imho. This change is only for maintainability and to ensure we can catch very serious unexpected issues, not to catch something we expect to happen in steady state.

fix: gracefully handle read errors when retrying documents

d6cfbab

kruskall requested a review from a team as a code owner March 13, 2024 19:43

elastic-apm-tech added the safe-to-test Automated label for running bench-diff on forked PRs label Mar 13, 2024

vikmik reviewed Mar 13, 2024

View reviewed changes

fix: always size the copyBuf appropriately

ef1ad10

vikmik approved these changes Mar 13, 2024

View reviewed changes

kruskall merged commit b3aa972 into elastic:main Mar 14, 2024
5 checks passed

kruskall deleted the gzip/err branch March 14, 2024 12:13

kruskall mentioned this pull request Apr 12, 2024

handle errors when retrying documents #135

Closed

endorama mentioned this pull request Apr 17, 2024

APM Server 8.14.0 Test Plan elastic/apm-server#12988

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gracefully handle read errors when retrying documents #137

fix: gracefully handle read errors when retrying documents #137

kruskall commented Mar 13, 2024 •

edited

Loading

vikmik left a comment •

edited

Loading

vikmik Mar 13, 2024

kruskall Mar 13, 2024

vikmik Mar 13, 2024 •

edited

Loading

vikmik Mar 13, 2024

vikmik Mar 13, 2024 •

edited

Loading

vikmik left a comment

kruskall commented Mar 13, 2024

vikmik commented Mar 14, 2024 •

edited

Loading

fix: gracefully handle read errors when retrying documents #137

fix: gracefully handle read errors when retrying documents #137

Conversation

kruskall commented Mar 13, 2024 • edited Loading

vikmik left a comment • edited Loading

Choose a reason for hiding this comment

vikmik Mar 13, 2024

Choose a reason for hiding this comment

kruskall Mar 13, 2024

Choose a reason for hiding this comment

vikmik Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

vikmik Mar 13, 2024

Choose a reason for hiding this comment

vikmik Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

vikmik left a comment

Choose a reason for hiding this comment

kruskall commented Mar 13, 2024

vikmik commented Mar 14, 2024 • edited Loading

kruskall commented Mar 13, 2024 •

edited

Loading

vikmik left a comment •

edited

Loading

vikmik Mar 13, 2024 •

edited

Loading

vikmik Mar 13, 2024 •

edited

Loading

vikmik commented Mar 14, 2024 •

edited

Loading