Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Common infrastructure for CIs and manual runs #378

Merged

Conversation

danielabdi-noaa
Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa commented Sep 25, 2022

DESCRIPTION OF CHANGES:

This PR addresses the issue described in detail in issue #377 . Basically, it makes both CIs ( Jenkins and Github actions ) use the same build and test infrastructure, especially simplifying Jenkins a lot.

Fundamental test cases ( 9 cases used in Github actions CI) are all green[1] on all platform. This includes test besides build. Here is the jenkins pipeline for future reference. I think this PR is high priority because of this.

[1] Although Cheyenne tests are red, all tests completed successfully on the system. There is some weird issue with tarring the results that does not happen on other systems.

Update: Cheyenne issues are fixed and here is the latest jenkins pipeline result that succeeded on all platforms including Cheyenne.

Detailed description changes:

  • Build for Jenkins is done with test/build.sh
  • WE2E tests for Jenkins are run with tests/WE2E/setup_WE2E_tests.sh
  • Configuration for both Jenkins and Github actions are done with files: fundamental, comprehensive or any other name such as custom. To create a configuration for a specific machine, for example in case of a test case that can not be successfully run on that machine, create config files of the form fundamental.jet etc
  • Job status monitoring in Jenkins is done with existing script tests/WE2E/get_expts_status.sh

With these changes, Jenkins source code is simplified and more robust, because code is not duplicated and uses existing
mature scripts for build and test.

Moreover, bug fixes for Jenkins tests include:

  • Bug fix for Gaea (pre-generated grid directory was not being set in machine/gaea.sh).
  • Bug fix for Cheyenne.
    • Conda was not being loaded because python was loaded in build_cheyenne_intel/gnu and then wflow_cheyenne did not unload python.
    • set +u needed for SLURM_JOB_ID check
  • Orion issue seem to be suddenly gone without any change (maybe software update on the system fixed the issue)
  • NoaaCloud still have issues with libpng version not found (disabled for now, will come back after issues are resolved)

I have left the selection of comprehensive test cases to get a green on all machines for a future PR. Jet is especially difficult since a test case may succeed or fail randomly on it for unknown reasons.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None

DOCUMENTATION:

None

ISSUE:

#377

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@jessemcfarland @christinaholtNOAA

@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Sep 25, 2022
@venitahagerty venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Sep 25, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Sep 25, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1066325011/20220925160516/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 1 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: custom_GFDLgrid
All experiments completed

@venitahagerty
Copy link
Collaborator

venitahagerty commented Sep 25, 2022

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1066325011/20220925160511/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 1 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: custom_GFDLgrid

@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Sep 25, 2022
@venitahagerty venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Sep 25, 2022
@danielabdi-noaa danielabdi-noaa added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Sep 25, 2022
@venitahagerty venitahagerty removed ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Sep 25, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Sep 25, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1066325011/20220925182009/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
All experiments completed

@ufs-community ufs-community deleted a comment from venitahagerty Sep 25, 2022
@ufs-community ufs-community deleted a comment from venitahagerty Sep 25, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Sep 25, 2022

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1066325011/20220925182013/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa One quick question with the modification in ush/launch_FV3LAM_wflow.sh - the comment states that this change is a hack for Gaea, but the logic is for Cheyenne, with a Cheyenne directory. Should the comment read:

Hack for Cheyenne

instead?

ush/launch_FV3LAM_wflow.sh Outdated Show resolved Hide resolved
@danielabdi-noaa
Copy link
Collaborator Author

danielabdi-noaa commented Oct 5, 2022

@MichaelLueken I think this PR is now ready to go in. All the fundamental tests completed successfully on Cheyenne with both gnu and intel. All other systems (except Hera) have green too on fundamental tests. Hera is unusually slow today and I've just checked the jobs are queued. It will probably finish with green by tomorrow. I will push the commit fixing some comments so here is the Jenkins pipeline for future reference. I've put it under the description of the PR as well

Jenkins fundamental test run

Edit: All machines have finished successfully with fundamental tests now.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working through the issues on the tier-1 machines, @danielabdi-noaa and @jessemcfarland! I have submitted the fundamental WE2E tests on Hera and they all passed successfully. Before submitting my approval, I see this morning's Jenkins run has failed - https://jenkins-epic.woc.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-378/28/pipeline. Will this need to be addressed before approval and merging, or is this an issue with either GitHub or Jenkins?

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken Yes, it was a github issue. There is an ongoing re-run that should finish in an hour or so.

Jesse McFarland added 2 commits October 6, 2022 12:03
Only remove the data directories to allow we2e cron jobs to complete
and clean up themselves correctly.
Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice addition, @danielabdi-noaa. Thanks!

Just one possible correction below.

# Array of all optional rrfs_utl executables built
#-----------------------------------------------------------------------
executables_created=( adjust_soiltq.exe \
check_imssnow_fv3lam.exe \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one also be a +=?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just fixed the bug!

@@ -1040,7 +1040,7 @@ def generate_FV3LAM_wflow():
following line can be added to the user's crontab (use \"crontab -e\" to
edit the cron table):

*/3 * * * * cd {EXPTDIR} && ./launch_FV3LAM_wflow.sh called_from_cron=\"TRUE\"
*/{CRON_RELAUNCH_INTVL_MNTS} * * * * cd {EXPTDIR} && ./launch_FV3LAM_wflow.sh called_from_cron=\"TRUE\"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE!

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rerun has of the Jenkins pipeline has successfully completed. I will now give my approval to these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: HIGH run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants