Retry the entire screenshotStitcher call #20770

stacey-gammon · 2018-07-13T14:03:37Z

"Fixes" #19563 by retrying. It was proving extremely difficult to debug due to the fact that:

Only showed up on jenkins ci (running locally in a loop, with a smaller maxDimensionSize did not cause the bug to appear).
Adding debug logic right after Page.captureScreenshot to ensure the correct size of data cause the bug to disappear, so it's clearly a timing issue.

This is by no means an ideal solution but in an effort to make forward progress on chromium, I think it suffices as a stop gap.

elasticmachine · 2018-07-13T15:36:26Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request

stacey-gammon · 2018-07-13T15:49:39Z

Okay, error was caught twice out of the 11 runs and the retry succeeded. The ci failed (i think) because 11 api reporting test runs caused it to timeout.

elasticmachine · 2018-07-13T17:07:26Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

elasticmachine · 2018-07-16T13:16:18Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

stacey-gammon · 2018-07-17T19:37:02Z

Ping @kobelb / @chrisdavies - Not a huge rush, but if you two could take a look some time this week, I could merge when I come back from vacation on Monday (assuming no major objections/code changes).

kobelb · 2018-07-17T20:00:29Z

The retry leaves me feeling uneasy, and I'm hesitant to give it a LGTM. If it could fail once, it could potentially fail 3 times in a row.

stacey-gammon · 2018-07-17T20:48:24Z

Agree it's no where near ideal but given the difficulties debugging it (only reproducible on jenkins, adding debug output causes the issue to disappear, suspect the issue is with chromium), the lack of bandwidth on the team, and the benefits this brings (turning on ci for chromium tests), it's difficult to justify allocating more resources towards this.

You can check out #20651 this for how strange this bug is. I added code immediately after await Page.captureScreenshot({ which converted the data stream to a png and verified the width and height. Then I returned the original data stream and kept all the other logic the same. This made the bug disappear.

I could have used that code too. I opted against it for efficiency reasons (converts to png, returns data, just to later once again convert to png), and it also doesn't "fix" the issue.

A third option would be to move the data -> png conversion earlier in the process. Again, this wouldn't actually "fix" the bug, but it would avoid the efficiency issues. This would have created the need for some test restructuring (as we had been discussing). And since it also just hides the issue, it didn't seem worth the effort (or refactoring risk of introducing more bugs).

Fourth option is to continue trying to debug this. I don't think this is a good option because:

slew of higher priority items that need my attention (blockers, flaky tests)
chromium bugs can continue to sneak in since no tests on ci.

If it fails three times in a row, good! Hopefully that would mean we would have found a highly reproducible environment and can debug it for reals.

Even though it's a band aid fix, being able to turn chromium tests on in ci is really important and a huge win. It will allow us to catch any other issues that we aren't catching now because we can't turn the tests on until this is resolved.

Time might also be better spent creating a new chromium build automatically as part of the build process which is another thing we have to do. Maybe a new chromium build would fix it for us.

tl;dr; Agree there are cons to checking this in but benefits outweigh them IMO.

cc @epixa - would you mind weighing in here? I'm trying to make a judgement call based off an ad hoc cost/benefit analysis and think it's worth it to check this sub par solution in. Would be interested to hear what you think.

Happy to set up a meeting (for next week) to go over it in a zoom too, if that would help.

stacey-gammon · 2018-07-17T20:51:14Z

Could also argue that we already have precedent for this in reporting, since we have that "retry three times" in case of failed reporting jobs (though that only covers certain code flows so this falls outside of that).

epixa · 2018-07-17T21:06:40Z

Obviously I'd prefer if we fixed the underlying issue in reliable way, but there's no harm in being pragmatic here. If we're wrong about the benefits of this change and the flakiness returns, we can always disable the tests again. If we're right, then we no longer have an entire feature not being tested.

kobelb · 2018-07-18T11:23:34Z

@epixa just to make sure we're on the same page, we aren't doing a "retry" in the context of the tests, we're actually retrying the screen capture process.

epixa · 2018-07-18T21:45:12Z

Thanks for clarifying, @kobelb. That does sour this approach in my mind, but I still think it's worth trying. In the worst case, this bug still happens only less frequently than it does today. The consequence of retrying seems to be minimal, and indeed generating the same report again is likely the advice we'd give a user that was impacted by this bug.

@stacey-gammon I do think some changes would make this more useful though. For one, we should attempt to identify this specific error case/message and only retry when it occurs rather than whenever any error occurs. If a different error appears, reporting should fail as it would have otherwise. Also, we should render a more useful error message in the event that this specific error is encountered 3 times and make a github issue that lists that error message text along with a description of what is going on and what the user can do to work around it (i.e. create the report again). The issue should probably link back to this PR and the original issue, and it should outright ask anyone affected by it to comment on the issue with information about their setup so we can debug.

chrisdavies

LGTM. Agree w/ the other commenters that this is not ideal. We should probably make a task to rip out the retry when we upgrade Chromium and see if the issue disappears w/ the upgrade.

kobelb

LGTM

stacey-gammon · 2018-07-23T21:02:08Z

I think that sounds like a good call @epixa. I've adjusted so this only retries on the error in question and also spits out the git issue link where users can log more information.

@kobelb and @chrisdavies mind re-reviewing the latest changes?

elasticmachine · 2018-07-23T22:50:30Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request

…he git issue link

stacey-gammon · 2018-07-24T12:34:19Z

Sounds like that last build failure was something that got checked in by ML and should be fixed now. Rebasing and re-building.

elasticmachine · 2018-07-24T14:09:52Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

…07-13-retry-screenshot-stitcher

elasticmachine · 2018-07-26T16:48:15Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

…07-13-retry-screenshot-stitcher

elasticmachine · 2018-07-27T14:01:08Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request

* Retry the entire screenshotStitcher call * Go back to a single run * Only retry for this specific error. Post more information including the git issue link

stacey-gammon added :Sharing (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead labels Jul 13, 2018

stacey-gammon requested a review from kobelb July 13, 2018 15:53

stacey-gammon force-pushed the 2018-07-13-retry-screenshot-stitcher branch from dff1126 to 423980d Compare July 16, 2018 12:03

stacey-gammon requested a review from chrisdavies July 16, 2018 15:46

chrisdavies approved these changes Jul 23, 2018

View reviewed changes

kobelb approved these changes Jul 23, 2018

View reviewed changes

stacey-gammon force-pushed the 2018-07-13-retry-screenshot-stitcher branch from 423980d to 9ffa281 Compare July 23, 2018 21:00

stacey-gammon added 3 commits July 24, 2018 08:33

Retry the entire screenshotStitcher call

37c9c71

Go back to a single run

6f3df5d

Only retry for this specific error. Post more information including t…

a323769

…he git issue link

stacey-gammon force-pushed the 2018-07-13-retry-screenshot-stitcher branch from 9ffa281 to a323769 Compare July 24, 2018 12:33

This was referenced Jul 24, 2018

highlight sample data section for new users #20953

Merged

Bring back chromium tests #20651

Closed

Turn chromium tests on. #20673

Closed

Merge branch 'master' of https://github.com/elastic/kibana into 2018-…

6444dbc

…07-13-retry-screenshot-stitcher

Merge branch 'master' of https://github.com/elastic/kibana into 2018-…

afb8b5f

…07-13-retry-screenshot-stitcher

stacey-gammon merged commit 0078e66 into elastic:master Jul 28, 2018

stacey-gammon mentioned this pull request Jul 28, 2018

[6.x] Retry the entire screenshotStitcher call (#20770) #21377

Merged

This was referenced Jul 30, 2018

Fix chromium png bitblt error and reinstate tests #19871

Closed

[6.4] Retry the entire screenshotStitcher call (#20770) #21461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry the entire screenshotStitcher call #20770

Retry the entire screenshotStitcher call #20770

stacey-gammon commented Jul 13, 2018 •

edited

Loading

elasticmachine commented Jul 13, 2018

stacey-gammon commented Jul 13, 2018

elasticmachine commented Jul 13, 2018

elasticmachine commented Jul 16, 2018

stacey-gammon commented Jul 17, 2018

kobelb commented Jul 17, 2018

stacey-gammon commented Jul 17, 2018

stacey-gammon commented Jul 17, 2018

epixa commented Jul 17, 2018

kobelb commented Jul 18, 2018 •

edited

Loading

epixa commented Jul 18, 2018

chrisdavies left a comment

kobelb left a comment

stacey-gammon commented Jul 23, 2018

elasticmachine commented Jul 23, 2018

stacey-gammon commented Jul 24, 2018

elasticmachine commented Jul 24, 2018

elasticmachine commented Jul 26, 2018

elasticmachine commented Jul 27, 2018

Retry the entire screenshotStitcher call #20770

Retry the entire screenshotStitcher call #20770

Conversation

stacey-gammon commented Jul 13, 2018 • edited Loading

elasticmachine commented Jul 13, 2018

💔 Build Failed

stacey-gammon commented Jul 13, 2018

elasticmachine commented Jul 13, 2018

💚 Build Succeeded

elasticmachine commented Jul 16, 2018

💚 Build Succeeded

stacey-gammon commented Jul 17, 2018

kobelb commented Jul 17, 2018

stacey-gammon commented Jul 17, 2018

stacey-gammon commented Jul 17, 2018

epixa commented Jul 17, 2018

kobelb commented Jul 18, 2018 • edited Loading

epixa commented Jul 18, 2018

chrisdavies left a comment

Choose a reason for hiding this comment

kobelb left a comment

Choose a reason for hiding this comment

stacey-gammon commented Jul 23, 2018

elasticmachine commented Jul 23, 2018

💔 Build Failed

stacey-gammon commented Jul 24, 2018

elasticmachine commented Jul 24, 2018

💚 Build Succeeded

elasticmachine commented Jul 26, 2018

💚 Build Succeeded

elasticmachine commented Jul 27, 2018

💚 Build Succeeded

stacey-gammon commented Jul 13, 2018 •

edited

Loading

kobelb commented Jul 18, 2018 •

edited

Loading