-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a separate pipeline to stage onboarding new test targets #44031
Comments
Tagging subscribers to this area: @ViktorHofer |
We would still run the separate pipeline per PR to gather the high amount of data but mark it as non failing so that it doesn't impact the PR's result and the main pipeline's pass rate. |
If we're about to run the envisioned "runtime-staging" pipeline per PR, then it should be "a delta to the existing runtime pipeline"; if it was "a staging version of the runtime pipeline that gets copied to the main one once in a while", we'd be duplicating most runs in each PR which is probably out of scope of our lab budget. |
Yes the idea is a delta. Just the platforms/devices that are fresh and which are likely to be flaky (ie the recent addition of android runtime tests would have benefited of this). |
I hope we can leverage this to help with CI of community-supported architectures - e.g. @nealef has been working on IBM zSystem support, and it'd be great to get builds going in our upstream CI, to help spot issues early |
Makes sense for this new pipeline. Depending on hardware budget we could run certain jobs only for rolling builds or even scheduled builds. |
Would https://github.com/microsoft/azure-pipelines-agent.git allow us to plug the existing Linux on Z virtual machines into this infrastructure. Those are currently Jenkins workers but I can install whatever you like on them and they already have docker up and running if you need to do work in containers. |
cc @mthalman @MichaelSimons @MattGal for the question above. |
For the zSystem support I talked with @nealef and we'll setup a branch in dotnet/runtimelab to work on that. We can already cross-build thanks to dotnet/arcade#5863 and dotnet/dotnet-buildtools-prereqs-docker#351 so that should cover the build aspect. For running tests we'd need a Helix queue but we might be able to use the s390x Docker images running via QEMU if we can't plug in a physical system. |
IBM have previously offered a VM in their s390x cloud for CI, but someone with sufficient authority needs to sign a liability waiver (and someone needs to configure it to actually work)
…Sent from my iPhone
On Nov 3, 2020, at 8:13 AM, Alexander Köplinger ***@***.***> wrote:
For the zSystem support I talked with @nealef and we'll setup a branch in dotnet/runtimelab to work on that. We can already cross-build thanks to dotnet/arcade#5863 and dotnet/dotnet-buildtools-prereqs-docker#351 so that should cover the build aspect.
For running tests we'd need a Helix queue but we might be able to use the s390x Docker images running via QEMU if we can't plug in a physical system.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Linux on Z is new to me; we'd have to try it out but from reading it seems like we could just use Helix Docker support for testing. It's an easy experiment to try when we're ready to do it. |
I want to start a discussion here for the next proposal after talking to @sunandabalu and @markwilkie and gathering some data points: Make this build definition a 2 purpose build definition.
Open QuestionsThese are questions that have come up from talking to people about this. Should we just make the outerloop libraries builds this pipeline as part of PRs without failing the build and just making it partially succeed?I gather some numbers on the amount of machine time we spend to run a windows x64 test payload using coreclr for innerloop tests vs outerloop tests. The result was: "InnerloopTime": 00:51:30.2820000, Based on this I would be very cautious about doing that as adding ~1:32 of machine time to each helix queues per PR would be very bad as I don't think we have enough capacity to manage that. How are we going to ensure the build is relevant to people in their PRs?The build is always going to be green, so that means, someone can break an staging test or an innerloop tests in these new platforms, how are we going to make it so that the build is not constantly broken and so that test failures are not a LOT where we get to the point that data is just too big for this to be useful. Should we instead of adding 1 pipeline add 2?Because of the concern from previous question, there was an idea of adding a cc: @danmosemsft @dotnet/runtime-infrastructure @jkotas @stephentoub for opinions? |
In my opinion reliability should be the metric that defines in which pipeline a test/leg runs and cost should be the metric that defines how often a test/leg runs. What's the benefit of introducing yet another pipeline? |
In my opinion, it would be being intentional about what tests run on that pipeline. Outerloop tests are not tests that are flaky or in quarantine trying to make its way back to the main pipeline. Outerloop tests are tests that take a really long time or modify machine state, or require admin elevation, etc. I think as the Outerloop pipeline as a very different purpose than this one. For example, that one is used by the networking team because there are tests that will never be in the main pipeline because of network reliability. So how would you separate, new platforms, tests that are staged, tests that will never be reliable to run on CI or that take a really long time? |
To echo what @safern has said above, each of the pipeline has different purpose, life span, out come for the tests running in the pipeline. It also helps in setting clear owners and expectation around test runs. Plus easier to look at test histories in each of the pipelines and spot trends etc. For eg:
|
I'm a little unclear whether what's being proposed is a solution to bringing up new platforms (which makes sense) or some generalized staging system for flaky or unproven tests. I am not sure of the value of the latter at least for libraries. Our flakiness seems evenly distributed among new and old tests, and among test/product bugs vs environmental problems (OS, network). The latter can only be addressed by some of the resilience ideas we've discussed before -- eg reruns on other machines, or statistical measures. |
I think the onboarding approach is a good idea and can be helpful both for new platforms and any that may significantly misbehave after the fact. I can see some benefits to the flaky pipeline, but maybe as Dan said, built in resilience solves most of the problems. |
Agreed, that the right long-term solution to handle flaky/transient tests is to have in-build resiliency mechanism like auto-retry. Moving them out / Sequestering them to a separate pipeline is an intermediate band-aid to get the main PR builds green, and have a path to be able to not merge on red (which is happening 50% time now). |
We have existing interim solutions that is to disable the flaky test and open an issue on it. Why is that not good enough as interim solution? I think it would be better to spend energy on creating the long-term solution. |
Now that these points are brought up, I agree with them. Moving it to a "Staging=Category" would require a code change. So opening an issue and using the I think we should just focus on the new platforms onboarding pipeline and just disable flaky tests, or add retry to the ones that can't be fixed to be more reliable. (I believe there are some crypto ones that hit an OS bug that repros once every million times). |
The same way we separate platforms that only run per rolling build from platforms that run per PR build. We have that mechanism already in place. It's possible to only run a set of tests on the same pipeline and filter on the type of build: rolling, PR, scheduled, triggered by comment. Not saying we should do that but we should definitely consider the pros and cons of adding a new pipeline vs. reusing an existing one. |
But if we do that, we're moving away from the purpose of this issue, which is onboarding new platforms. Running outerloop tests in PRs without failing the build provides no value to that scenario whatsoever, in fact, it just pollutes the build with data as if there are a bunch of outerloop test failures, investigating what is needed to get the new platform stable will be harder. I think outerloop should stay where it is. If we want to move it to run as rolling, that's a separate discussion and not part of this issue. |
I never said we should run tests that are currently marked as Outerloop per PR. |
Oh, that is what I was understanding from previous offline discussions. Then it seems like this issue has the info I need for new platform on boarding. |
Outerloop tests should continue to runs as is. My point is that we should consider using the outerloop pipeline for the platform bring-up as well. Again, I'm not saying we should do that but I would like to discuss the pros and cons of it. From a naïve developer point of view who interacts with dotnet/runtime, it's already confusing that some PR legs link to entirely different build pipelines. In addition to that, I don't think it's clear what the difference of On a broader topic, is there documentation available with best practices on when to set-up a new build pipeline vs. using an existing one? |
Pros of a new pipeline:
CONS:
That I agree... I think we should rename,
No there is not. But I think, each pipeline should have it's own purpose. So adding a new pipeline vs using an existing one, should be based on, what's the purpose of the new pipeline, can we achieve that with an existing pipeline? If the purpose of the new pipeline will be pointless by changing it, then add a new pipeline. The "less" special a pipeline is and the less things it does, the better in my opinion. Also, another open question I have for this discussion is, for new platforms, should we only ignore test failures for the build outcome? That means, should the build fail in PRs if there is a build failure? I would say yes, as we should help the feature crew at least give some protection for building the vertical on that new platform. |
The thing I would ask is that we're able to query (or somehow know) which tests have been deemed "flaky" which will verify assumptions and retry logic for the resiliency work. This is because there are two types of flakiness - tests which fail intermittently, and random tests which fail intermittently. I presume only the former would be annotated as "flaky"? The immediate goal here is to get the noise level down by taking care of the stuff we know of. (e.g. new platform bring up, etc) Once that's complete, we'll have much better optics into the nature of the flakiness such that we can start to implement mechanisms to be more resilient. |
Based on the conversation above we're ready to at least create the new staging pipeline and using it for new test platforms until they stabilize. We have a good reason now. We will be moving Android test runs that use an emulator. Plus this will be used for new targets we plan to add into our CI. Interpreter legs for iOS and Android, and also tvOS will come soon. Working on this so that we can get something working next week. |
Our repo has been working diligently to increase test coverage for all of our various platforms, but very often when we first onboard a new test target (ie. WASM on Browser tests, Android runtime tests, etc.) we experience flakiness over the first few days due to issues in the underlying infrastructure that only pop up at the scale we test (#43983 is a recent example).
In CI Council, we discussed possibly creating a separate "runtime-staging" pipeline of sorts where we initially include the new testing legs. Once the new legs stabilize, we'd move them into the regular "runtime" pipeline. This would enable us to continue to add new platforms without the spurious failures that only appear at scale effecting the main PR/CI pipelines and tanking our pass rate.
cc: @dotnet/runtime-infrastructure @ViktorHofer
The text was updated successfully, but these errors were encountered: