Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry the entire screenshotStitcher call #20770

Conversation

stacey-gammon
Copy link
Contributor

@stacey-gammon stacey-gammon commented Jul 13, 2018

"Fixes" #19563 by retrying. It was proving extremely difficult to debug due to the fact that:

  • Only showed up on jenkins ci (running locally in a loop, with a smaller maxDimensionSize did not cause the bug to appear).
  • Adding debug logic right after Page.captureScreenshot to ensure the correct size of data cause the bug to disappear, so it's clearly a timing issue.

This is by no means an ideal solution but in an effort to make forward progress on chromium, I think it suffices as a stop gap.

@stacey-gammon stacey-gammon added :Sharing (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead labels Jul 13, 2018
@elasticmachine
Copy link
Contributor

💔 Build Failed

@stacey-gammon
Copy link
Contributor Author

Okay, error was caught twice out of the 11 runs and the retry succeeded. The ci failed (i think) because 11 api reporting test runs caused it to timeout.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@stacey-gammon stacey-gammon force-pushed the 2018-07-13-retry-screenshot-stitcher branch from dff1126 to 423980d Compare July 16, 2018 12:03
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@stacey-gammon
Copy link
Contributor Author

Ping @kobelb / @chrisdavies - Not a huge rush, but if you two could take a look some time this week, I could merge when I come back from vacation on Monday (assuming no major objections/code changes).

@kobelb
Copy link
Contributor

kobelb commented Jul 17, 2018

The retry leaves me feeling uneasy, and I'm hesitant to give it a LGTM. If it could fail once, it could potentially fail 3 times in a row.

@stacey-gammon
Copy link
Contributor Author

Agree it's no where near ideal but given the difficulties debugging it (only reproducible on jenkins, adding debug output causes the issue to disappear, suspect the issue is with chromium), the lack of bandwidth on the team, and the benefits this brings (turning on ci for chromium tests), it's difficult to justify allocating more resources towards this.

You can check out #20651 this for how strange this bug is. I added code immediately after await Page.captureScreenshot({ which converted the data stream to a png and verified the width and height. Then I returned the original data stream and kept all the other logic the same. This made the bug disappear.

I could have used that code too. I opted against it for efficiency reasons (converts to png, returns data, just to later once again convert to png), and it also doesn't "fix" the issue.

A third option would be to move the data -> png conversion earlier in the process. Again, this wouldn't actually "fix" the bug, but it would avoid the efficiency issues. This would have created the need for some test restructuring (as we had been discussing). And since it also just hides the issue, it didn't seem worth the effort (or refactoring risk of introducing more bugs).

Fourth option is to continue trying to debug this. I don't think this is a good option because:

  • slew of higher priority items that need my attention (blockers, flaky tests)
  • chromium bugs can continue to sneak in since no tests on ci.

If it fails three times in a row, good! Hopefully that would mean we would have found a highly reproducible environment and can debug it for reals.

Even though it's a band aid fix, being able to turn chromium tests on in ci is really important and a huge win. It will allow us to catch any other issues that we aren't catching now because we can't turn the tests on until this is resolved.

Time might also be better spent creating a new chromium build automatically as part of the build process which is another thing we have to do. Maybe a new chromium build would fix it for us.

tl;dr; Agree there are cons to checking this in but benefits outweigh them IMO.

cc @epixa - would you mind weighing in here? I'm trying to make a judgement call based off an ad hoc cost/benefit analysis and think it's worth it to check this sub par solution in. Would be interested to hear what you think.

Happy to set up a meeting (for next week) to go over it in a zoom too, if that would help.

@stacey-gammon
Copy link
Contributor Author

Could also argue that we already have precedent for this in reporting, since we have that "retry three times" in case of failed reporting jobs (though that only covers certain code flows so this falls outside of that).

@epixa
Copy link
Contributor

epixa commented Jul 17, 2018

Obviously I'd prefer if we fixed the underlying issue in reliable way, but there's no harm in being pragmatic here. If we're wrong about the benefits of this change and the flakiness returns, we can always disable the tests again. If we're right, then we no longer have an entire feature not being tested.

@kobelb
Copy link
Contributor

kobelb commented Jul 18, 2018

@epixa just to make sure we're on the same page, we aren't doing a "retry" in the context of the tests, we're actually retrying the screen capture process.

@epixa
Copy link
Contributor

epixa commented Jul 18, 2018

Thanks for clarifying, @kobelb. That does sour this approach in my mind, but I still think it's worth trying. In the worst case, this bug still happens only less frequently than it does today. The consequence of retrying seems to be minimal, and indeed generating the same report again is likely the advice we'd give a user that was impacted by this bug.

@stacey-gammon I do think some changes would make this more useful though. For one, we should attempt to identify this specific error case/message and only retry when it occurs rather than whenever any error occurs. If a different error appears, reporting should fail as it would have otherwise. Also, we should render a more useful error message in the event that this specific error is encountered 3 times and make a github issue that lists that error message text along with a description of what is going on and what the user can do to work around it (i.e. create the report again). The issue should probably link back to this PR and the original issue, and it should outright ask anyone affected by it to comment on the issue with information about their setup so we can debug.

Copy link
Contributor

@chrisdavies chrisdavies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Agree w/ the other commenters that this is not ideal. We should probably make a task to rip out the retry when we upgrade Chromium and see if the issue disappears w/ the upgrade.

Copy link
Contributor

@kobelb kobelb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stacey-gammon stacey-gammon force-pushed the 2018-07-13-retry-screenshot-stitcher branch from 423980d to 9ffa281 Compare July 23, 2018 21:00
@stacey-gammon
Copy link
Contributor Author

I think that sounds like a good call @epixa. I've adjusted so this only retries on the error in question and also spits out the git issue link where users can log more information.

screen shot 2018-07-23 at 4 57 39 pm

screen shot 2018-07-23 at 4 59 15 pm

screen shot 2018-07-23 at 4 59 37 pm

@kobelb and @chrisdavies mind re-reviewing the latest changes?

@elasticmachine
Copy link
Contributor

💔 Build Failed

@stacey-gammon stacey-gammon force-pushed the 2018-07-13-retry-screenshot-stitcher branch from 9ffa281 to a323769 Compare July 24, 2018 12:33
@stacey-gammon
Copy link
Contributor Author

Sounds like that last build failure was something that got checked in by ML and should be fixed now. Rebasing and re-building.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@stacey-gammon stacey-gammon merged commit 0078e66 into elastic:master Jul 28, 2018
stacey-gammon added a commit to stacey-gammon/kibana that referenced this pull request Jul 28, 2018
* Retry the entire screenshotStitcher call

* Go back to a single run

* Only retry for this specific error.  Post more information including the git issue link
stacey-gammon added a commit that referenced this pull request Jul 28, 2018
* Retry the entire screenshotStitcher call

* Go back to a single run

* Only retry for this specific error.  Post more information including the git issue link
stacey-gammon added a commit to stacey-gammon/kibana that referenced this pull request Jul 31, 2018
* Retry the entire screenshotStitcher call

* Go back to a single run

* Only retry for this specific error.  Post more information including the git issue link
stacey-gammon added a commit that referenced this pull request Jul 31, 2018
* Retry the entire screenshotStitcher call

* Go back to a single run

* Only retry for this specific error.  Post more information including the git issue link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
(Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants