Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

md5 hash collision in annual climatology files #65

Closed
corviday opened this issue Mar 8, 2018 · 3 comments
Closed

md5 hash collision in annual climatology files #65

corviday opened this issue Mar 8, 2018 · 3 comments

Comments

@corviday
Copy link
Contributor

corviday commented Mar 8, 2018

Some annual-resolution climdex climatology files that have the same model, variable, period, and run but different emissions scenarios are hashing to the same md5sum when only the first MB is hashed. This seems to be due to reserved space in the netCDF header in excess of 1MB. The file structure appears to be:

  1. dimension declarations
  2. variable declarations
  3. reserved space
  4. global attributes
  5. variable data

Hashing the first MB of the file means that only dimension declarations, variable declarations, and empty padding go into the hash, and these can be identical across files that differ only in emissions scenario.

Attempting to index a file whose md5 hash collides with a file already in the database yields this error message:

2018-03-06 11:06:25 INFO: Processing file: /storage/data/projects/comp_support/climate_explorer_data_prep/climatological_means/climdex/rx1dayETCCDI_aClim_BCCAQ_CanESM2_historical+rcp85_r1i1p1_19610101-19901231.nc
2018-03-06 11:06:25 ERROR: Encountered an unanticipated case:
2018-03-06 11:06:25 ERROR: id_match.id = None
2018-03-06 11:06:25 ERROR: hash_match.id = 10833
2018-03-06 11:06:25 ERROR: filename_match.id = None
2018-03-06 11:06:25 ERROR: old_filename_exists = True; normalized_filenames_match = False; index_up_to_date = True

2018-03-06 11:06:25 ERROR: Traceback (most recent call last):
  File "/home/lzeman/Code/modelmeta-generic/modelmeta/venv/lib/python3.5/site-packages/mm_cataloguer/index_netcdf.py", line 1060, in index_netcdf_file
    data_file = find_update_or_insert_cf_file(session, cf)
  File "/home/lzeman/Code/modelmeta-generic/modelmeta/venv/lib/python3.5/site-packages/mm_cataloguer/index_netcdf.py", line 1045, in find_update_or_insert_cf_file
    raise ValueError('Unanticipated case. See log for details.')
ValueError: Unanticipated case. See log for details.

Short term, we've removed the reserved space from the un-indexable files needed for a current project (72 out of the >500 affected) and been able to index them, but this problem will recur and needs a more permanent solution, possibly either

  1. hashing more than 1 MB and updating hashes on all existing files
  2. if the purpose of the reserved header space is to make file modification faster, generated climatologies aren't typically further modified (and if they were, they're small enough to not have the problem reserved space solves) and could have the padding stripped as part of the climatology generating process
@corviday
Copy link
Contributor Author

corviday commented Mar 8, 2018

I don't know why this problem only affects annual files; as I understand the cause, there's no reason these collisions shouldn't happen between seasonal and monthly files as well.

Some climatology files from the HadGEM-CC and HadGEM-ES have md5 hash collisions when the only parameter that differs between the files is the model, as well. I haven't looked into this.

@rod-glover
Copy link
Contributor

Just FYI:

  1. According to Unidata, "NetCDF-4 files are created with the HDF5 library, and are HDF5 files in every way."
  2. HDF5 files have complex low-level structure and file format. I'm not sure their documentation advances this investigation much, but in case it does, here it is.

@corviday
Copy link
Contributor Author

corviday commented Aug 29, 2019

This bug has been accidentally resolved, hooray! I recalculated some of the climatologies affected, for unrelated reasons as part of updating climdex variables that represent minimums and maximums. (pacificclimate/climate-explorer-data-prep#81)

With the most recent versions of modelmeta and climate-explorer-data-prep, datasets that previously caused this error no longer do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants