-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry the entire screenshotStitcher call #20770
Retry the entire screenshotStitcher call #20770
Conversation
💔 Build Failed |
Okay, error was caught twice out of the 11 runs and the retry succeeded. The ci failed (i think) because 11 api reporting test runs caused it to timeout. |
💚 Build Succeeded |
dff1126
to
423980d
Compare
💚 Build Succeeded |
Ping @kobelb / @chrisdavies - Not a huge rush, but if you two could take a look some time this week, I could merge when I come back from vacation on Monday (assuming no major objections/code changes). |
The retry leaves me feeling uneasy, and I'm hesitant to give it a LGTM. If it could fail once, it could potentially fail 3 times in a row. |
Agree it's no where near ideal but given the difficulties debugging it (only reproducible on jenkins, adding debug output causes the issue to disappear, suspect the issue is with chromium), the lack of bandwidth on the team, and the benefits this brings (turning on ci for chromium tests), it's difficult to justify allocating more resources towards this. You can check out #20651 this for how strange this bug is. I added code immediately after I could have used that code too. I opted against it for efficiency reasons (converts to png, returns data, just to later once again convert to png), and it also doesn't "fix" the issue. A third option would be to move the data -> png conversion earlier in the process. Again, this wouldn't actually "fix" the bug, but it would avoid the efficiency issues. This would have created the need for some test restructuring (as we had been discussing). And since it also just hides the issue, it didn't seem worth the effort (or refactoring risk of introducing more bugs). Fourth option is to continue trying to debug this. I don't think this is a good option because:
If it fails three times in a row, good! Hopefully that would mean we would have found a highly reproducible environment and can debug it for reals. Even though it's a band aid fix, being able to turn chromium tests on in ci is really important and a huge win. It will allow us to catch any other issues that we aren't catching now because we can't turn the tests on until this is resolved. Time might also be better spent creating a new chromium build automatically as part of the build process which is another thing we have to do. Maybe a new chromium build would fix it for us. tl;dr; Agree there are cons to checking this in but benefits outweigh them IMO. cc @epixa - would you mind weighing in here? I'm trying to make a judgement call based off an ad hoc cost/benefit analysis and think it's worth it to check this sub par solution in. Would be interested to hear what you think. Happy to set up a meeting (for next week) to go over it in a zoom too, if that would help. |
Could also argue that we already have precedent for this in reporting, since we have that "retry three times" in case of failed reporting jobs (though that only covers certain code flows so this falls outside of that). |
Obviously I'd prefer if we fixed the underlying issue in reliable way, but there's no harm in being pragmatic here. If we're wrong about the benefits of this change and the flakiness returns, we can always disable the tests again. If we're right, then we no longer have an entire feature not being tested. |
@epixa just to make sure we're on the same page, we aren't doing a "retry" in the context of the tests, we're actually retrying the screen capture process. |
Thanks for clarifying, @kobelb. That does sour this approach in my mind, but I still think it's worth trying. In the worst case, this bug still happens only less frequently than it does today. The consequence of retrying seems to be minimal, and indeed generating the same report again is likely the advice we'd give a user that was impacted by this bug. @stacey-gammon I do think some changes would make this more useful though. For one, we should attempt to identify this specific error case/message and only retry when it occurs rather than whenever any error occurs. If a different error appears, reporting should fail as it would have otherwise. Also, we should render a more useful error message in the event that this specific error is encountered 3 times and make a github issue that lists that error message text along with a description of what is going on and what the user can do to work around it (i.e. create the report again). The issue should probably link back to this PR and the original issue, and it should outright ask anyone affected by it to comment on the issue with information about their setup so we can debug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Agree w/ the other commenters that this is not ideal. We should probably make a task to rip out the retry when we upgrade Chromium and see if the issue disappears w/ the upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
423980d
to
9ffa281
Compare
I think that sounds like a good call @epixa. I've adjusted so this only retries on the error in question and also spits out the git issue link where users can log more information. @kobelb and @chrisdavies mind re-reviewing the latest changes? |
💔 Build Failed |
9ffa281
to
a323769
Compare
Sounds like that last build failure was something that got checked in by ML and should be fixed now. Rebasing and re-building. |
💚 Build Succeeded |
…07-13-retry-screenshot-stitcher
💚 Build Succeeded |
…07-13-retry-screenshot-stitcher
💚 Build Succeeded |
* Retry the entire screenshotStitcher call * Go back to a single run * Only retry for this specific error. Post more information including the git issue link
* Retry the entire screenshotStitcher call * Go back to a single run * Only retry for this specific error. Post more information including the git issue link
"Fixes" #19563 by retrying. It was proving extremely difficult to debug due to the fact that:
Page.captureScreenshot
to ensure the correct size of data cause the bug to disappear, so it's clearly a timing issue.This is by no means an ideal solution but in an effort to make forward progress on chromium, I think it suffices as a stop gap.