Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fx files are selected differently depending on what OS the code runs on #1159

Closed
valeriupredoi opened this issue Jun 7, 2021 · 17 comments · Fixed by #1169 or #1216
Closed

Fx files are selected differently depending on what OS the code runs on #1159

valeriupredoi opened this issue Jun 7, 2021 · 17 comments · Fixed by #1169 or #1216
Assignees
Labels
bug Something isn't working testing
Milestone

Comments

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jun 7, 2021

Bug summary

this line loops through all possible MIPs - for the test fail the explanation is that on a Debian-based OS (Ubuntu like on my machine and Github Actions) the order in which the tables hence mip list is passes is random (but fixed each time the test runs) and it happens that IyrAnt is the last actual mip in the loop that is valid, whereas for CentOS (JASMIN and the CI machine), the tables hence the mips are ordered by UNIX order type: numerals, capitals then lower case last, so fx is the last valid mip so the test passes. This, however, uncovers a potential vulnerability - this is a rather random process of selecting the mip, and is not solid in my view - iike in the test case, for sftgif we have both fx and IyrAnt files available, and the file selection is performed depending on the OS where the code runs on - this is dodgy. Any ideas how to select the file that we need and is consistent across OS's? This potentially impactful since fx is not time-dependent whereas eg Oyr or IyrAnt are time-dependent!

Test fail that discovered the issue

This is an odd one: that test fails only on Ubuntu and OSX machines. The Ubuntu fail is reported in this closed test PR with the test failing on both my machine and GA both Ubuntu and the OSX fail is here - the test is failing since the filename key returns a file from a list of files that does not have fx in it. I am not sure if this is an issue with the actual preprocessor or if it's just the test object that misbehaves on different platforms. I'd like to investigate more but I need to write some slides now instead 😖 This is, however, a sneaky bug 🪲 @zklaus the anaconda package build may fail coz of this (depending what machine the tests are run on)

@valeriupredoi valeriupredoi added bug Something isn't working testing labels Jun 7, 2021
@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Jun 8, 2021

OK I found the bugger: this line loops through all possible MIPs - for the test fail the explanation is that on a Debian-based OS (Ubuntu like on my machine and Github Actions) the order in which the tables hence mip list is passes is random (but fixed each time the test runs) and it happens that IyrAnt is the last actual mip in the loop that is valid, whereas for CentOS (JASMIN and the CI machine), the tables hence the mips are ordered by UNIX order type: numerals, capitals then lower case last, so fx is the last valid mip so the test passes. This, however, uncovers a potential vulnerability - this is a rather random process of selecting the mip, and is not solid in my view - iike in the test case, for sftgif we have both fx and IyrAnt files available, and the file selection is performed depending on the OS where the code runs on - this is dodgy. Any ideas how to select the file that we need and is consistent across OS's? This potentially impactful since fx is not time-dependent whereas eg Oyr or IyrAnt are time-dependent!

@sloosvel @schlunma could you guys pls give priority to this issue, it's quite important for the FX functionality, and my apologies I've not spotted this at review point. @zklaus you may want to flag this for release, mate

@sloosvel
Copy link
Contributor

sloosvel commented Jun 8, 2021

I don't know, as far as I have seen in our experiments we only generate the fx vars in one of the possible mips. I think this test is more complicated because somehow it returns files in all of the possible mips. Whereas in a "real life" application the code would go through all mips but only find one possible output. Although I am not sure if in other institutes the fx vars are outputed in all the possible mips. But this should be solvable by specifying in the recipe which one you want to use.

@valeriupredoi
Copy link
Contributor Author

@sloosvel the data is inherently different - there are fx mip-ed sftgif's and there are IyrAnt mip-ed stfgif's - they are not mutually exclusive, models produce data for both mip's - the random nature of selecting the file based on how the mip is named on different OS's is NOT the right way

@zklaus zklaus added this to the v2.3.0 milestone Jun 8, 2021
@valeriupredoi valeriupredoi changed the title test_recipe.test_fx_list_mip_change_cmip6 fails on Ubuntu and OSX Fx files are selected differently depending on what OS the code runs on Jun 8, 2021
@zklaus
Copy link

zklaus commented Jun 8, 2021

I have assigned the milestone. I will still cut the release branch soon, but this week is reserved for testing and fixing exactly this kind of issue before next week's release, so I agree that it would be good to address this soon.

I also agree that the order can not be relied upon since it is arbitrary in dicts. You may be interested to learn that it is arbitrary in globbing as well, see, e.g., here.

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Jun 8, 2021

cheers @zklaus - I would suggest we try and understand the functional behaviour of selecting the fx mip first - do we want to use and load the first available fx mip or we want to select a preferred one? This will influence how #1160 gets solved too I think. Unfortunately I don't have the data knowledge to propose a solution, that's why I'm asking you guys that have worked more closely with this sort of data than meself. Definitely not the last come in the loop selection though 😁

@valeriupredoi
Copy link
Contributor Author

possibly ping @ledm since he's been seeing FX stuff for a living, in oceans 🐟

@zklaus
Copy link

zklaus commented Jun 8, 2021

Looking at the data request for (IyrAnt, sftgif) as an example (here), this lives on a polar stereographic grid. In other words, I think the *Ant* and *Gre* files are applicable if and only if the grid_label of the data variable contains the a or g suffix, c.f. last point of note 11 on page 11 of "CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CV’s".

@zklaus
Copy link

zklaus commented Jun 8, 2021

As for fx vs mon/dec, this depends on the model and the data should only appear either in *fx or *mon. This is somewhat documented in the "processing" instructions of the variables in the data request, see for example cell thickness thkcello in Ofx and in Omon.

@valeriupredoi
Copy link
Contributor Author

cheers for the clarifications @zklaus 🍺 Here's an idea - can we set a "preferred" mip dictionary mapping variable mip to a set of preferred fx mips? This way we'd always select the ones that do make sense first then look for some exotic fx mips after?

@schlunma
Copy link
Contributor

schlunma commented Jun 8, 2021

I had a more detailed look at this. I think that #999 introduced a bug here:

def _search_fx_mip(tables, found_mip, variable, fx_info, config_user):
fx_files = None
for mip in tables:
fx_cmor = tables[mip].get(fx_info['short_name'])
if fx_cmor:
found_mip = True
fx_info['mip'] = mip
fx_info = _add_fxvar_keys(fx_info, variable)
logger.debug(
"For fx variable '%s', found table '%s'",
fx_info['short_name'], mip)
fx_files = _get_input_files(fx_info, config_user)[0]
if fx_files:
logger.debug(
"Found fx variables '%s':\n%s",
fx_info['short_name'], pformat(fx_files))
return found_mip, fx_info, fx_files

In the old implementation, there used to be a break at l. 359 if fx files are found. That ensured that the first fx files that are actually found are returned. Right now, fx_files is overwritten over and over for all searched tables and only the last value is returned, regardless if it's empty or not. We should definitely fix that first.

I agree that a default order for checking the mip tables would be good. Maybe fx first and then simply alphabetically to ensure a consistent behavior across machines? In the old implementation we also preferred the original mip of the variable. That could be achieved by modifying this line (e.g. with a sorted()):

project_tables = CMOR_TABLES[var_project].tables

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Jun 8, 2021

@schlunma very cool - I remember now our old implementation (and the discussion we had over it). I agree with your points fully. What do you think @sloosvel - would you be able/willing to do such a thing plss? Cheers guys, this issue looks to be fully elucidated 👻

@zklaus
Copy link

zklaus commented Jun 9, 2021

I think in terms of preferred order, we should use fx files from *Ant* tables for *Ant* variables, from *Gre* tables for *Gre* variables, and *fx, *mon, *dec for the rest. This is because Ant and Gre live on South and North polar stereographic grids, which should be identifiable also from a or g as the last letter in the grid_label, but probably isn't.

@sloosvel
Copy link
Contributor

sloosvel commented Jun 9, 2021

I don't know. The mips are something that depend on the tables, that change from project to project. And even some variables exist in some projects, some others don't. This criteria would work fine for CMIP6, but may be too specific for all the projects.

@valeriupredoi
Copy link
Contributor Author

OK I suggest we fix that test in a way it doesn't fail on any OS, and keep the discussion here. If build tests run on a Debian machine, Klaus will be having issues building the package 👍

@sloosvel
Copy link
Contributor

sloosvel commented Jun 9, 2021

If it's just to fix the test, specifying the mip in the content of the recipe should be enough. Would that work for you?

@valeriupredoi
Copy link
Contributor Author

silly GH closing the issue on its own 🤭 I just merged a temporary fix that skips the tests that are failing in #1169 (cheers for reviewing it, Saskia!). This issue is still ongoing though...

@zklaus
Copy link

zklaus commented Jul 7, 2021

To summarize the issue:

  • This occurs when a requested variable exists in more than one "fx" table (which may even be a yr, mon, or dec table in reality).
  • This ambiguity can not be resolved in general.
  • Specifying the table manually per @sloosvel's comment above resolves the problem.

Is this accurate? If that is the case, I suggest improving the documentation to clarify this and fix the test permanently by including the explicit fx table there as well. I think @schlunma's #1216 is a good starting point for documentation improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment