-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLM deaths in restFile_enddef on cheyenne in some cases due to overflow of float to integer on restart #165
Comments
Ben Andre < andre > - 2017-04-14 11:52:42 -0600 We discussed this on the fates SW call, and the feeling was that the restart interface has change so much that it's not worth spending a lot of time on. If a quick look at this points to a an issue with fates, then please mark as expected fail and cheyenne will be tested and fixed with the upcoming fates reintegration. |
Erik Kluzek < erik > - 2017-04-18 17:46:51 -0600 In the DDT debugging session I learned more about the problem here. The error happens on the PIO end definition call. I think what the problem is that the dimension of some variables exceed 32-byte integers (multiplying the dimensions together). numCohorts for this case is about 33 million. So multiplying by a thousand you exceed a signed 4-byte integer. It looks like the open statements are supposed to be opening with large file support 64BIT_OFFSET, so this shouldn't be a limitation, and yet it is. The call it dies in is restFile_enddef in restFile_write...
I tried building everything with 8 byte integers with "-i8" to the compiler and that failed in trying to resolve overloaded functions based on the data types. |
Erik Kluzek < erik > - 2017-04-19 00:59:30 -0600 Jim suggested I try opening up 64BIT_DATA rather than 64BIT_OFFSET. So I tried that and it fails the same way. Index: ../../../components/clm/src/main/ncdio_pio.F90.in--- ../../../components/clm/src/main/ncdio_pio.F90.in (revision 84459)
|
Ben Andre < andre > - 2017-04-25 18:09:18 -0600 TL,DR: A work around for this problem is to change the edTest to use netcdf instead of pnetcdf. I followed up on Jim and Erik discussion in Comment 3 by setting both PIO_64BIT_OFFSET and PIO_64BIT_DATA, but it didn't fix the error:
This case uses pio1 and pnetcdf by default. Replace pnetcdf for netcdf then this will run to completion: $ ./xmlquery PIO_TYPENAME It looks like yellowstone and cheyenne are using:
Looking at the release notes for pnetcdf: https://trac.mcs.anl.gov/projects/parallel-netcdf/wiki/ReleaseNotes-1.7.0 Other updates:
According to the netcdf doc http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html NetCDF Classic Format
Looking at the hexdump of the file that I think is being written seems to indicate that this is a 64bit file:
So I guess I don't really understand the netcdf file formats, but it seems like a contradiction that pnetcdf is now consistent with netcdf v2 format, but limits the size to 32 bits. Maybe something changed between pnetcdf 1.6 and 1.8 that we need to update the internals of the I/O stack? |
Erik Kluzek < erik > - 2017-04-26 16:06:01 -0600 I talked to Ben about this afternoon. Jim has an idea about what is happening, and something bad in our definitions for ED is causing it to output in NetCDF2 format. This is also due to bad behavior on pnetcdf's, which is why going to netcdf works. What I realized in 2445, is that even on yellowstone some of the files are being produced but I can't read them (a ncdump tells me the file is in a unrecognized format). So this problem affects yellowstone as well as it's creating files you can't read. cheyenne is actually being more courteous and stopping rather than creating mystery files that are unreadable. Although CLM is apparently able to read them because restart tests seem to work. |
Jim Edwards < jedwards > - 2017-04-26 16:09:44 -0600 I think that you'll find that the files are unreadable only in certain versions of ncdump. You may need to load a netcdf mopdule so that you point to a newer version of this tool. |
Erik Kluzek < erik > - 2017-04-26 17:43:26 -0600 (In reply to Jim Edwards from comment #6)
OK, I was able to find a version of ncdump that would work. Although only the version 4.4 versions would work, all of the older ones wouldn't. They give this... ncdump -h $MYSCR/SMS_D_Ld5.f45_f45.ICLM45ED.yellowstone_intel.clm-edTest.20170424_133019_2sh1t7/run/SMS_D_Ld5.f45_f45.ICLM45ED.yellowstone_intel.clm-edTest.20170424_133019_2sh1t7.clm2.r.0001-01-06-00000.nc But, then tools built on top of netcdf -- like ncview -- won't necessarily work presumably depending on the version of netcdf they were built with. |
Ben Andre < andre > - 2017-04-28 19:18:14 -0600 TL,DR : My recommendation is to remove this test and mark the issue as won't fix. Jim pointed out that the CDF2 format issue I was seeing was because 64bit offset and data are mutually exclusive. I reverted to 64bit data and it is correctly generating a CDF5 format file. So that was a red herring. I verified that this is a problem with ED specific code by trying a couple of permutations on the test without ED and commenting out the calls to ED restart. More debugging led me further into the ED restart stack which has been massively refactored in the upcoming fates tag. I ran the failing test on my incoming branch and it passes. I'm running the full ED/fates test suite on cheyenne this weekend to verify my branch is OK, and I'll ensure ED/Fates tests work on cheyenne when my tag comes in. My recommendation is to remove this test and mark the issue as won't fix. |
Ben Andre < andre > - 2017-05-02 11:38:13 -0600 Update: I ran all ed and clm tests over the weekend and the results were as follows: yellowstone: ed : cheyenne: So, there's still a problem on cheyenne with high res, but only with clm5fates, not clm45ed. Differences between clm45 and clm5 from the same test run: hexdump -C ../ERS_D_Ld5.f09_g16.ICLM45ED.cheyenne_intel.clm-edTest.C.04281953-edi/run/ERS_D_Ld5.f09_g16.ICLM45ED.cheyenne_intel.clm-edTest.C.04281953-edi.clm2.r.0001-01-04-00000.nc | head hexdump -C ../ERS_D_Ld5.f09_g16.ICLM50FATES.cheyenne_intel.clm-edTest.C.04281953-edi/run/ERS_D_Ld5.f09_g16.ICLM50FATES.cheyenne_intel.clm-edTest.C.04281953-edi.clm2.r.0001-01-04-00000.nc | head Diff-ing the cases, it looks like with clm45 is using netcdf by default, while clm5 is using pnetcdf by default. Reviewing old fates history to figure out why, we encounter an issue a year ago that resulted in just setting all ED compsets to default to netcdf instead of pnetcdf. So, this is probably still broken for all ED/FATES compsets in the upcoming branch. I spend a all day yesterday trying to track this down further by isolating the variables that cause the problem. I identified a couple of them, but haven't figured out yet why they are different. I'll try to look at this more, but I think I'm going to need a debug build of the I/O stack to understand what pnetcdf is unhappy with. I need to do some other cleanup stuff to stay on track with the fates reintegration. |
Erik Kluzek < erik > - 2017-06-02 11:52:20 -0600 Chris is now seeing this same problem in F compsets... ERP_Ln9.f09_f09.F1850_DONOTUSE.cheyenne_intel.cam-outfrq9s_clm5 |
Chris Fischer < fischer > - 2017-06-12 14:39:50 -0600 The only place I'm seeing this error now is in ERP_Ln9.f09_f09_mg17.FHIST_DEV.cheyenne_intel.cam-outfrq9s_clm5 |
Chris Fischer < fischer > - 2017-06-12 14:40:32 -0600 I should also add this is for cesm2_0_alpha06o. |
Cheryl Craig < cacraig > - 2017-06-12 15:40:28 -0600 So did CLM come up with a fix for this bug, or did it just "go away". Looking back over this exchange, it's not sounding like its an easy bug to find and fix. |
Ben Andre < andre > - 2017-06-12 15:49:47 -0600 It went away. It was originally only occurring when ED/fates was on. My original thought was that ED was doing something wrong with the indexing or data to cause the overflow, but after debugging it for a while and talking with Jim, we started thinking it was another issue where a struct was being incorrectly copied and something was getting corrupted or dropped. At some point when I was merging this stopped being reproducible, so I assumed it was actually a fates problem that was fixed by the fates restart refactor. |
Jim Edwards < jedwards > - 2017-06-12 20:40:23 -0600 It's there in the F compset, I've been able to reproduce with the simpler test An error in the enddef statement indicates that there was something wrong with an attribute.
at around lines 527 of file clm_InstMod.F90 commenting out that line causes the enddef to complete. |
Chris Fischer < fischer > - 2017-06-14 14:45:12 -0600 I was able to get FHIST_DEV to run by changing the number of nodes being used from 4 to 10. This matches the layout being used on yellowstone. |
Erik Kluzek < erik > - 2017-06-14 15:03:40 -0600 (In reply to Chris Fischer from comment #16)
This sounds like we just aren't giving enough processors for these cases on cheyenne, since memory (mostly) scales with processor count for CLM. So I'm wondering if the FATES case, might run if we ran it with more processors? |
Chris Fischer < fischer > - 2017-06-15 15:49:08 -0600 Brian E. was able to get f09_f09_mg17.FHIST_DEV.cheyenne_intel to fail using 37: pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 1520 : |
Erik Kluzek < erik > - 2017-06-15 15:54:56 -0600 (In reply to Chris Fischer from comment #18)
Chris if this is a size/memory issue using threads would run into it quicker. Can you try with 30 nodes NTASKS=360 and NTHRDS=3? If that works, it might be useful to see if it seems to consistently fail for smaller layouts and work for larger ones. And then see at what point it starts to fail... |
Chris Fischer < fischer > - 2017-06-15 16:19:31 -0600 NTASKS=360 and NTHRDS=3 worked. |
Bill Sacks < sacks > - 2017-07-12 19:25:14 -0600 Renaming this issue to reflect its more general incarnation |
Bill Sacks < sacks > - 2017-07-12 19:29:32 -0600 It sounds like Jim E was on the track of this bug earlier (comment 15). There are only 10 or so variables in the offending routine. Would it make sense to comment them out one by one to see which one is the problem? |
I'm going to close this, because I think we must have fixed it, or else we wouldn't be running now. |
82d3b247f Merge pull request #178 from johnpaulalex/test_doc 3223f49ea Additional documentation of system tests - global variables, method descriptions. 45b7c01c3 Merge pull request #177 from jedwards4b/git_workflow ace90b2c2 try setting credentials this way f4d6aa933 try setting credentials this way 1d61a6944 use this to set git credentials 7f9d330e1 use this to set git credentials 5ac731b85 add tmate code 836847be7 get git workflow working dcd462d71 Merge pull request #176 from jedwards4b/add_github_testing 2d2479e9d Merge pull request #175 from johnpaulalex/fix 711a53fdf add github testing of prs and automatic tagging of main cfe0f888a fix typos 5665d6140 Fix broken checkout behavior introduced by PR #172. 27909e255 Merge pull request #173 from johnpaulalex/readall 00ad0440b Further tiny refactorings and docs of checkout API (no-op). Remove unused load_all param in _External.checkout(). Rename _External.checkout_externals() to checkout_subexternals(), to remove the ambiguity about whether the main external pointed to by the _External is itelf checked out (it is not) Clarify load_all documentation - it’s always recursive, but applies different criteria at each level. Rename variables in checkout.py (e.g. ext_description) to match the equivalent code in sourcetree.py. 2ea3d1a3a Merge pull request #172 from johnpaulalex/fixit 43bf8092c Merge pull request #171 from johnpaulalex/docstatus e6aa7d21e Merge pull request #170 from johnpaulalex/printdir adbd71557 On checkout, refresh locally installed optional packages regardless of whether -o is passed in. add074593 Comment tweaks, and fix 'ppath' typo 696527cb8 Document the format of various status dictionaries, and the various paths and path components within an _External. c677b9403 When processing an external, print out its path in addition to the base filename (to disambiguate all the externals.cfg's) 975d7fd5a Merge pull request #169 from johnpaulalex/docfix_branch 09709e36d Document _Externals.status(). The original comment was apparently copy-pasted from checkout(). 1d880e090 Merge pull request #167 from billsacks/fix_svn_on_windows 3510da848 Tweak a unit test to improve coverage eb7fc1368 Handle the possibility that the URL already ends with '/' 02ea87e3d Fix svn URLs on Windows b1c02ab54 Merge pull request #165 from gold2718/doc_fix 9f4be8c7b Add documentation about externals = None feature git-subtree-dir: manage_externals git-subtree-split: 82d3b247fc8fb3b70458bf2b4570ee6678754c4d
0f884bfec Merge pull request #205 from jedwards4b/sunset_svn_git_access 82a5edf79 merge in billsacks:svn_testing_no_github 17532c160 Use a local svn repo for testing 9c904341a different method to determine if in tests 539952ebd remove debug print statement cc5434fa7 fix submodule testing 1d7f28840 remove broken tests 04e94a519 provide a meaningful error message 38bcc0a8c Merge pull request #201 from jedwards4b/partial_match b4466a5aa remove debug print statement c3cf3ec35 fix issue with partial branch match 7b6d92ef6 Merge pull request #198 from johnpaulalex/gitdir 927ce3a98 Merge pull request #197 from johnpaulalex/testpath a04f1148f Merge pull request #196 from johnpaulalex/readmod d9c14bf25 Change the rest of the methods to use -C. Still some usage of getcwd in test_unit_repository_git. 332b10640 Fix incorrect logged path of checkout_externals in test_sys_checkout: it was basically the parent of the current directory, which varies throughout the test. (it called abspath with '{0}/../../', which adds arbitrary and not-interpolated subdir '{0}' to the path, then removes it and removes one more level). 932a7499b Remove printlog from read_gitmodules_file since read_externals_description_file() already has a nearly-the-same printlog (but add it to the other caller). 5d13719ed Merge pull request #195 from johnpaulalex/check_repo 423395449 Update utest to mock _git_remote_verbose in a new way, since it is now called via the GitRepository class rather than on the specific GitRepository instance. d7a42ae96 Check that desired repo was actually checked out. 71596bbc1 Merge pull request #194 from johnpaulalex/manic2 4c96e824e Make the MANIC_TEST_BARE_REPO_ROOT env var special - give it a constant for easy tracking, and automatically tear it down after each test. 259bfc04d test_sys_checkout: use actual paths in on-the-fly configs rather than MANIC_TEST_BARE_REPO_ROOT env var. This will make it easier to test (in the near future) that checkout_externals actually checked out the desired repo dir. 557bbd6eb Merge pull request #193 from johnpaulalex/manic 5314eede1 Remove MANIC_TEST_TMP_REPO_ROOT environment variable in favor of module-level variable. 345fc1e14 Merge pull request #191 from johnpaulalex/test_doc12 2117b843c test_sys_checkout: verify that basic by-tag/branch/hash tests actually take us to the correct git tag/branch/hash. 94d6e5f2b Merge pull request #190 from johnpaulalex/test_doc11 3ff33a6a8 Inline local-path-creation methods 47dea7f64 Merge pull request #189 from johnpaulalex/test_doc10 9ea75cbf8 Grab-bag of renamings: Remove redundant _NAME from repo constants, and consistently add _REPO suffix (This causes the majority of diffs). c0c847ec8 Merge pull request #188 from johnpaulalex/test_doc9 2dd5ce0f7 test_sys_checkout.py: only check for correct 'required' or 'optional' state in the test that exercises required vs optional behavior. Removed a lot of boilerplate. eb3085984 Merge pull request #187 from johnpaulalex/test_doc8 1832e1f84 test_sys_checkout: Simplify many tests to only use a single external. 8689d61ec Merge pull request #186 from johnpaulalex/test_doc7 fbee4253e Grab bag of test_sys_checkout cleanups: Doc inside of each test more clearly/consistently. TestSysCheckoutSVN didn’t get the inlining-of-helper-methods treatment, now it has that. Move various standalone repo helper methods (like create_branch) into a RepoUtils class. README.md was missing newlines when rendered as markdown. Doc the return value of checkout.main Fix test_container_exclude_component - it was looking for the wrong key (which is never present); now it looks for the correct key. f0ed44a6e Merge pull request #185 from johnpaulalex/test_doc6 a3d59f5f2 Merge pull request #184 from johnpaulalex/test_doc5 5329c8ba7 test_sys_checkout: Inline config generation functions that are only called once. 464f2c7a7 test_sys_checkout: Inline another layer (per-config-file checks). Rename the 4 methods that are used multiple times, to reflect what they do rather than what they're called. 8872c0df6 Merge pull request #183 from johnpaulalex/doc_test4 c045335f6 Merge pull request #182 from johnpaulalex/doc_test3 c583b956e Merge pull request #181 from johnpaulalex/doc_test2 e01cfe278 test_sys_checkout: less confusing handling of return values from checkout_externals. Specifically, when doing a checkout, don't return tree_status from _before_ the checkout. Make a new wrapper to call checkout_externals a second time, to calculate the new status after a checkout (very frequent pattern). 23286818c test_sys_checkout: Remove another layer (which generates test component names) c3717b6bc Merge pull request #180 from johnpaulalex/doc_test 36d7a4434 test_sys_checkout.py: remove one layer of functions (that check for local status enums). No-op. 2c4584bf7 More documentation about tests: * contents of test repositories (n a new README.md) * various constants in test_sys_checkout.py that point to those contents, and terminology like container/simple/mixed. * in each test method, the scenarios being tested. * The coupling between test methods. 55e74bd0a Merge pull request #179 from johnpaulalex/circ 66be84290 Remove circular dependency by making _External stop doing tricky things with sourcetrees. 82d3b247f Merge pull request #178 from johnpaulalex/test_doc 3223f49ea Additional documentation of system tests - global variables, method descriptions. 45b7c01c3 Merge pull request #177 from jedwards4b/git_workflow ace90b2c2 try setting credentials this way f4d6aa933 try setting credentials this way 1d61a6944 use this to set git credentials 7f9d330e1 use this to set git credentials 5ac731b85 add tmate code 836847be7 get git workflow working dcd462d71 Merge pull request #176 from jedwards4b/add_github_testing 2d2479e9d Merge pull request #175 from johnpaulalex/fix 711a53fdf add github testing of prs and automatic tagging of main cfe0f888a fix typos 5665d6140 Fix broken checkout behavior introduced by PR #172. 27909e255 Merge pull request #173 from johnpaulalex/readall 00ad0440b Further tiny refactorings and docs of checkout API (no-op). Remove unused load_all param in _External.checkout(). Rename _External.checkout_externals() to checkout_subexternals(), to remove the ambiguity about whether the main external pointed to by the _External is itelf checked out (it is not) Clarify load_all documentation - it’s always recursive, but applies different criteria at each level. Rename variables in checkout.py (e.g. ext_description) to match the equivalent code in sourcetree.py. 2ea3d1a3a Merge pull request #172 from johnpaulalex/fixit 43bf8092c Merge pull request #171 from johnpaulalex/docstatus e6aa7d21e Merge pull request #170 from johnpaulalex/printdir adbd71557 On checkout, refresh locally installed optional packages regardless of whether -o is passed in. add074593 Comment tweaks, and fix 'ppath' typo 696527cb8 Document the format of various status dictionaries, and the various paths and path components within an _External. c677b9403 When processing an external, print out its path in addition to the base filename (to disambiguate all the externals.cfg's) 975d7fd5a Merge pull request #169 from johnpaulalex/docfix_branch 09709e36d Document _Externals.status(). The original comment was apparently copy-pasted from checkout(). 1d880e090 Merge pull request #167 from billsacks/fix_svn_on_windows 3510da848 Tweak a unit test to improve coverage eb7fc1368 Handle the possibility that the URL already ends with '/' 02ea87e3d Fix svn URLs on Windows b1c02ab54 Merge pull request #165 from gold2718/doc_fix 9f4be8c7b Add documentation about externals = None feature a3b3a0373 Merge pull request #162 from ESMCI/fischer/python3 d4f1b1e8d Change shebang lines to python3 2fd941abc Merge pull request #158 from billsacks/modified_solution de08dc2ee Add another option for when an external is in a modified state e954582d0 Merge pull request #156 from billsacks/onbranch_show_hash 952e44d51 Change output: put tag/hash before branch name 10288430f Fix pre-existing pylint issues 01b13f78f When on a branch, show tag/hash, too 39ad53263 Merge pull request #150 from gold2718/fix_combo_config 75f8f02f5 Merge pull request #152 from jedwards4b/sort_by_local_path 42687bd53 remove commented code 29e26af81 fix pylint issues 7c9f3c613 add a test for nested repo checkout 75c5353d2 fix spacing 24a3726a1 improve sorting, checkout externals with each comp 29f45b086 remove py2 test and fix super call 880a4e765 remove decode 1c53be854 no need for set call 36c56dbac simplier fix for issue dc67cc682 simpler solution b32c6fca9 fix to allow submodule name different from path 5b5e1c2b0 Merge pull request #144 from billsacks/improve_errmsg c983863c4 Add another option for dealing with modified externals 59ce252cf Add some details to the error message when externals are modified be5a1a4d7 Merge pull request #143 from jedwards4b/add_exclude 2aa014a1b fix lint issue 49cd5e890 fix lint issues 418173ffd Added tests for ExternalsDescriptionDict afab352c8 fix lint issue be85b7d1b fix the test a580a570b push test d43710864 add a test 21affe33c fix formatting issue 72e6b64ae add an exclude option git-subtree-dir: manage_externals git-subtree-split: 0f884bfec8e43d0c02261de858d6ec3f6d855e51
Corrects CIME path in buildnml scripts.
Erik Kluzek < erik > - 2017-04-12 10:57:58 -0600
Bugzilla Id: 2442
Bugzilla CC: andre, cacraig, eaton, fischer, jedwards, mvertens, rfisher, sacks,
The following test fails on cheyenne due to overflow in conversion to a 4-byte integer.
ERS_D_Ld5.f09_g16.ICLM45ED.cheyenne_intel.clm-edTest
The traceback is...
1: RTM river discharge into ocean: ICE
1: pionfput_mod.F90 128 1 4 255 1
1: m3/s
1: pionfput_mod.F90 128 1 4 1 1 A
1: pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 1508 :
1: Overflow when type cast to 4-byte integer.
73: pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 1508 :
73: Overflow when type cast to 4-byte integer.
37: pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 1508 :
37: Overflow when type cast to 4-byte integer.
109: pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 1508 :
109: Overflow when type cast to 4-byte integer.
37:Image PC Routine Line Source
37:cesm.exe 0000000003A595ED Unknown Unknown Unknown
37:cesm.exe 0000000003498576 pio_support_mp_pi 120 pio_support.F90
37:cesm.exe 0000000003496920 pio_utils_mp_chec 59 pio_utils.F90
37:cesm.exe 0000000003465CDD nf_mod_mp_pio_end 1508 nf_mod.F90
37:cesm.exe 0000000000B9E118 ncdio_pio_mp_ncd_ 386 ncdio_pio.F90.in
37:cesm.exe 0000000000C28127 restfilemod_mp_re 731 restFileMod.F90
37:cesm.exe 0000000000C20C30 restfilemod_mp_re 109 restFileMod.F90
37:cesm.exe 00000000008B18E2 clm_driver_mp_clm 1197 clm_driver.F90
The text was updated successfully, but these errors were encountered: