Initial discussion around CI failure handling #1

joyeecheung · 2018-06-06T12:56:46Z

Specifically

The part in README regarding where to open issues and how to label them, etc.
The schema of the flaky test database (see https://github.com/nodejs/reliability/blob/master/flakes.draft.json)

After we settle down on those I can start implement the actions in ncu-ci to automate the CI status reporting and flake tracking process

joyeecheung · 2018-06-06T13:20:17Z

BTW on why I chose to include the full URL in the flakes.draft.json: the Microsoft acquisition thing (it's unlikely that we will want to move, but anyway)

sam-github · 2019-06-07T17:20:25Z

@joyeecheung @nodejs/tsc Should this/could this be a strategic initiative?

Also, what is current state? Flakes seem to be tracked as issues in nodejs/node, so what is current state? I'd never seen this repo until moments ago!

FYI, I've contact some folks who have offered to look at windows platform issues in the past, and got some promises that some of them will look at the windows flakiness.

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-master/ is discouraging, and helps explain why I've had so much trouble getting trivial PRs green in CI recently.

I've started adding issues and PRing flaky status to tests. We "should" fix them, but while they are not fixed, its not helpful to have unrelated test failures on PR builds.

P.S. RSS of Daily Master failures

refack · 2019-06-07T17:44:27Z

ci.nodejs.org/view/Node.js%20Daily/job/node-daily-master is discouraging, and helps explain why I've had so much trouble getting trivial PRs green in CI recently.

Currently @Trott and myself get emails when "Daily Master" fails. I wanted to setup a mailing list or an RSS/ATOM feed for that so anyone can easily track it's status.

I've started adding issues and PRing flaky status to tests. We "should" fix them, but while they are not fixed, its not helpful to have unrelated test failures on PR builds.

IMHO that's currently our best way to structure the known-flaky DB as test configuration. But without a champion those can turn from temporary measure into permanent "solutions", as we do have some tests that have been flaky for a long time, e.g. nodejs/node#23207, nodejs/node#26401, nodejs/node#20750.

sam-github · 2019-06-07T17:56:50Z

If its a strategic initiative, the current count of ignored tests will get reported at every TSC meeting. No guarantee they will get fixed, of course, but at least they will be visible.

sam-github · 2019-06-07T20:19:33Z

I'd like to find a place to point contributors at the flakes issues as good first contributions, but didn't. I was hoping to find somewhere in our docs that pointed to the good first issue label, but the only reference i found was in doc/onboarding-extras.md, which seems a bit obscure. Is there no "so, you want to help out with Node.js, great, look here" section of our docs?

Perhaps I should simply label all the ci flakes as good first issue?

I think they are good first contributions because they often require no knowledge of node internals. If the tests are written entirely in terms of the public node.js API, then someone with sufficient knowledge of the API alone could take a crack at any of them. For example, some expert user of workers might not have the C++ ability to contribute to the worker implementation, but be perfectly capable of reasoning carefully about the worker tests, and making them more robust. Fixing a flake also adds appreciable value to the project, so make great drive-by contributions for those so inclined.

Trott · 2019-06-07T21:40:18Z

I'd use help wanted and maybe even mentor available for the flaky tests, but definitely not good first issue. They often require very specific knowledge and/or access to a particular platform, often at a particular version. There's also a lot of what you might call "institutional knowledge" about how to reliably reproduce the flakiness, since a straightforward run of node-stress-single-test often doesn't do it.

gireeshpunathil · 2019-06-08T02:13:31Z

addressing CI failures require an organized and continued team work. Individual efforts are not sustainable, and people get worn out easily.

I agree with @Trott , these are definitely not good-first-issues.

Often, the problem determination of these failures takes 2 phases:

recreating and reducing the test to its fundamental form, that involves only the failing expression
analyzing the faulty path, connecting the dots and coming up with a theory, derive or collect patches to prove or disprove the theory

second part is real hard.

gireeshpunathil · 2019-06-08T04:08:21Z

I am +1 on running this through tsc.

joyeecheung · 2019-06-09T15:09:39Z

The idea proposed by the OP has never been used in practice. Sometimes I use ncu-ci walk pr to gather suspicious failures and their patterns and open a thread here, but I have not been able to do so for some time now and ncu-ci now sometimes have trouble understanding windows failures (somehow neither does Jenkins CI understand those). The source of truth about the flakes is still the nodejs/node issue tracker.

I think it's worth being a strategic initiative, but we'll need a champion (or more) who has enough time to keep an eye on this.

Trott · 2019-06-09T16:27:17Z

Arguably, this could be a part of the existing strategic initiative for Build WG resources.

mhdawson · 2019-06-11T13:35:35Z

I'm definitely +1 as a strategic initiative if we can find a champion or if @Trott is volunteering by integrating it into the existing Build strategic initiative. I agree raising the awareness of the flakes by reviewing the numbers in the TSC meeting would be a great first step.

Trott · 2019-06-11T16:27:24Z

I'm happy to highlight a particularly problematic test during each announcements section of the meeting when I'm in attendance. Maybe sam-github or someone else can do it when I'm not around. This week, it would have to be the way sequential/test-cpu-prof is now timing out at or nearly at 100% of the time on ubuntu1604_sharedlibs_debug_x64 making it nearly impossible to land anything that isn't a doc change. This started about 48 hours ago, so only stuff that had CI runs before then tends to be landing right now. :-(

sam-github · 2019-06-11T17:02:34Z

I'll think about what I have time to do (vs what I wish I could do with more time).

I looked at the build strategic initiative, and I'm not sure it is a fit for this. It is related, in that build infratructure failures can manifest as CI failures. Problems with host memory, disk, jenkins disconnections, tap2xml failures, etc. Those are within scope of build resources.

However, the flaky tests in particular seem to be solidly non-build, they need developer effort. The build team doesn't seem to me like it should take on the extra scope of fixing unreliable github.com/nodejs/node unit tests, its strained already.

Trott · 2019-06-11T17:08:32Z

I'm happy to highlight a particularly problematic test during each announcements section of the meeting when I'm in attendance. Maybe sam-github or someone else can do it when I'm not around. This week, it would have to be the way sequential/test-cpu-prof is now timing out at or nearly at 100% of the time on ubuntu1604_sharedlibs_debug_x64 making it nearly impossible to land anything that isn't a doc change. This started about 48 hours ago, so only stuff that had CI runs before then tends to be landing right now. :-(

Refs: nodejs/node#27611

Also, Joyee has a proposed fix as of 7 hours ago.
nodejs/node#28170

mhdawson · 2019-06-11T18:00:43Z

@Trott while highlighting an urgent problem will have some benefit, I think reporting on the trend in terms of CI stability would possibly be more useful. ie are we slowly degrading/getting better and how quickly versus simply. That is something I think the TSC should be on top of and work to try to influence as opposed informing the TSC that some issue needs more help right now and hoping the TSC members have time to help out on that specific issue. Just my 2cents though.

sam-github mentioned this issue Jun 7, 2019

test: add github refs to flaky tests nodejs/node#28123

Closed

4 tasks

refack mentioned this issue Jun 10, 2019

Windows CI failures nodejs/node#28152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial discussion around CI failure handling #1

Initial discussion around CI failure handling #1

joyeecheung commented Jun 6, 2018 •

edited

Loading

joyeecheung commented Jun 6, 2018

sam-github commented Jun 7, 2019 •

edited by refack

Loading

refack commented Jun 7, 2019

sam-github commented Jun 7, 2019

sam-github commented Jun 7, 2019

Trott commented Jun 7, 2019

gireeshpunathil commented Jun 8, 2019

gireeshpunathil commented Jun 8, 2019

joyeecheung commented Jun 9, 2019 •

edited

Loading

Trott commented Jun 9, 2019

mhdawson commented Jun 11, 2019

Trott commented Jun 11, 2019

sam-github commented Jun 11, 2019

Trott commented Jun 11, 2019

mhdawson commented Jun 11, 2019

Initial discussion around CI failure handling #1

Initial discussion around CI failure handling #1

Comments

joyeecheung commented Jun 6, 2018 • edited Loading

joyeecheung commented Jun 6, 2018

sam-github commented Jun 7, 2019 • edited by refack Loading

refack commented Jun 7, 2019

sam-github commented Jun 7, 2019

sam-github commented Jun 7, 2019

Trott commented Jun 7, 2019

gireeshpunathil commented Jun 8, 2019

gireeshpunathil commented Jun 8, 2019

joyeecheung commented Jun 9, 2019 • edited Loading

Trott commented Jun 9, 2019

mhdawson commented Jun 11, 2019

Trott commented Jun 11, 2019

sam-github commented Jun 11, 2019

Trott commented Jun 11, 2019

mhdawson commented Jun 11, 2019

joyeecheung commented Jun 6, 2018 •

edited

Loading

sam-github commented Jun 7, 2019 •

edited by refack

Loading

joyeecheung commented Jun 9, 2019 •

edited

Loading