-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create the set of criteria (a test plan) for marking builds 'good' #186
Comments
A good discussion Shelley. Let me try to describe how I might define those criteria you have suggested. Constraints The factors that impact our test plans include: the length of time available to run the tests, the number and types of machines available to run tests, and the number of tests we have at our disposal. Any more constraints? Quality Levels In theory we can introduce more quality levels, such as builds that pass a longer set of sanity checks (called 1 hr sanity?), those that pass all our automated release tests on one fast platform, but have not had any manual testing (called JCK candidates?), and binaries that have the assurance of a modest level of testing on at least every different CPU / OS type we distribute (called platform coverage?). However, there is a risk that introducing too many binary quality markers causes confusion -- but worth thinking about what is going to be useful to our target users. Practical Issues So, as a starting point, here's my test playlist criteria:
Release builds are our gold standard of binaries. These have undergone the best testing we can perform, including functional tests, running applications, and checking the performance is acceptable. The tests represent real-world usage of the binaries. Ideally, the test framework would accept containers configured with third-party test suites and a standard way of invoking the image with a candidate binary. If each of the third-party tests (e.g. Tomcat tests, Eclipse tests, Lucene tests, Scala tests, etc) report success with the candidate then we can be assured that it is capable of running real applications. Application owners can ask for their tests to be part of our release testing by following the container rules. At some point, the release builds are pushed into the JCK pipeline which requires at least some some level of manual testing before being flagged as a JCK-compatible release.
Nightly builds that are marked as good give developers and users confidence that the changes introduced since the last good nightly build have not caused any significant regression on any platform that we cover. The builds are limited by the number of tests we can run within the 24hr period between nightlies being produced, so is predominantly limited by time and build farm capacity. These should be as close to release quality as possible. The tests run nightly will be those that give best bug finding value (based on historical evidence), are fully automated, and cover a broad spectrum of platform and functional assurances. In some cases, users may pick a nightly build to get a "hot" patch that is not yet available in a release -- because releases are expensive and infrequent, and PR builds have significantly lower quality assurances. Nightlies are missing the long "burn-in" tests and heavy performance runs, and do not have the full suite of application and JCK tests applied because we have to complete all the tests in a reasonable time. They are called "nightlies" to encourage the feedback to the community within 24hrs.
PR builds are highly time sensitive. Ideally, the PR build is very quick so the developer gets immediate feedback before they move on to their next task. For example, by using an incremental rather than full clean and rebuild the developer should know within 5 mins if the code has compiled and passed basic sanity checks on one platform. Within an hour the developer should know that the change has passed PR sanity checks on all platforms, and is permitted to go into the nightly testing regime.. Ideally the test framework will target the most appropriate tests to run within the given time/machine budget, e.g. by figuring out what areas of the build are impacted by the change, such as the module that was modified, and select appropriate PR tests for that functional area and it's dependencies. -- |
I think @ShelleyLambert suggested levels broadly make sense and as @tellison mentions the nightly builds will have to be able to complete in a reasonable timeframe (maybe that's an hour). All seems sound to me. |
Do we have the data yet to produce a table with execution times for the various test suites (with further breakdown into candidates for subsets of the test suites)? |
Those times terribly vary from HW and setup to another HW and setup. If you eg run whole testsuite in ramdisk, you can get to 1/2 of time. Also time for jtregs or tck is simpy divided by number of cores mahcine have and moment X starts to fail. |
I believe that the set of the test should be named for each project x variant. |
As for release x release candidates, Once oracle stops taging based on theirs good feeling/internal tesitng, following coudl be applied:
This actually means that each release was tested at least two times, before becoming public. That is good. I do simialry in RH - both sources and RPMs must pass all to be considered release candidates. And only ater that, suc build can be published on frotn page. |
We're looking at approximately 27 hours (single-threaded) for a JCK8 HotSpot run on the hardware we've got running all the non-manual tests |
@sxa555 my understanding is that we are only obliged to run the full JCK on a major release. Security updates and bug fix releases etc do not require a full re-run; though we may want to include as much of the JCK as we can (time) afford to ensure any regressions are caught early. So I think we need to separate out the "full JCK recorded test run with interactive tests etc." as a special case that may require some out of band intervention for the regular release pipeline testing. |
JCK8 has 3 test suites available as executable JAR - Runtime, Devtools and Compiler.
Approximate execution time of these testsuites on single(good) machine: These can further decrease when the tests on harness is run with multi JVMgroup execution modes.
For JCK, the plan can be: release builds- nightly builds - Pull Request builds - FYI: Manual/Interactive tests can be done only once per platform per cycle. |
Few more points, after listening to "AdoptOpenJDK - Hangout 04/01/2018"
|
Updating this discussion with a few more pieces to this puzzle:
|
I am going to close this (and capture some of the main points of this discussion in the AQA doc linked to from #965). |
This issue is to discuss and decide what criteria we should use to mark an AdoptOpenJDK binary "good"? To kick off the discussion, I propose the following goals:
For release builds, all tests at our disposal should pass, where "all" includes:
For nightly builds, a subset of all tests should be run and pass, where we explicitly state what tests are in the subset, and as more machines are available, we keep adding to the subset (to be as close to the entire list of tests that get run against release builds as we can, given the set of resources we have), starting off with:
For pull request builds, a small subset of tests should be run and pass. Ideally, this set is dynamic and selected to best test the change in the PR, but as a starting point, this set would be a short list that represents a sample from the broad spectrum of full tests we have.
For background, here is brief presentation on testing at AdoptOpenJDK: https://youtu.be/R3rdLIC089k
The text was updated successfully, but these errors were encountered: