Implement `DataValidator.apply()` #368

danielhuppmann · 2024-08-19T08:17:16Z

This is the implementation for the apply method of the DataValidator.

The current implementation writes each validation-item to the log separately, together with the failing data rows.

2024-08-19 10:10:33 ERROR    Failed data validation (file validate_data/validate_data_fails.yaml):
  Criteria: variable: ['Primary Energy'], upper_bound: 5.0
         model scenario region        variable   unit  year  value
    0  model_a   scen_a  World  Primary Energy  EJ/yr  2010    6.0
    1  model_a   scen_b  World  Primary Energy  EJ/yr  2010    7.0

  Criteria: variable: ['Primary Energy|Coal'], lower_bound: 2.0
         model scenario region             variable   unit  year  value
    0  model_a   scen_a  World  Primary Energy|Coal  EJ/yr  2005    0.5

  Criteria: variable: ['Primary Energy'], year: [2005], upper_bound: 1.9, lower_bound: 1.1
         model scenario region        variable   unit  year  value
    0  model_a   scen_a  World  Primary Energy  EJ/yr  2005    1.0
    1  model_a   scen_b  World  Primary Energy  EJ/yr  2005    2.0

I initially had all failing validations (per yaml file) concatenated to one DataFrame, but I then thought that it would not be helpful if users can't see which criteria they specifically do not pass.

If you agree with this approach in principle, I can make the error-message a bit nicer (indentation, more concise criteria representation).

phackstock

Looks good to me, some smaller comments in line and one file to delete (I think).
After that, good to merge.

tests/data/validation/Untitled.ipynb

nomenclature/processor/data_validator.py

phackstock · 2024-08-19T10:46:06Z

nomenclature/processor/data_validator.py

+        error_list = []
+
+        with adjust_log_level():
+            for item in self.criteria_items:


You could turn this into a single list comprehension and use the walrus operator:

if error_list := [ " Criteria: " + ", ".join([f"{key}: {value}" for key, value in item.criteria.items()]) + "\n" + textwrap.indent(str(df.validate(**item.criteria)), prefix=" ") + "\n" for item in self.criteria_items if df.validate(**item.criteria) is not None ]: logger.error( "Failed data validation (file %s):\n%s", get_relative_path(self.file), "\n".join(error_list), )

not sure if that's more readable though.
Feel free to keep whatever is most readable to you.

This seems less readable than the current implementation, so suggest to get it unless we make a utility-function that moves the string-concat somewhere else - maybe together with a refactor of the RequiredDataValidator?

danielhuppmann added 9 commits August 16, 2024 12:14

Remove empty line at end of file

2db6a18

Make sure that validation-with-codelist passes if no criteria are given

17017e8

Harmonize notation

31f7baa

Add criteria attribute

402817d

Add initial apply implementation

433415d

Add a test for showing how to fail validation

1dcfa11

Add a test for showing how to fail validation

cb37ff3

Make black

8d3d822

Write failing validation for each item to log with criteria

68accb8

danielhuppmann requested a review from phackstock August 19, 2024 08:17

danielhuppmann self-assigned this Aug 19, 2024

danielhuppmann added 3 commits August 19, 2024 10:38

Don't add upper/lower bound columns explicitly

710497b

Make more concise log error messages

b3288b9

Add a test

b190f96

danielhuppmann marked this pull request as ready for review August 19, 2024 09:18

danielhuppmann and others added 4 commits August 19, 2024 11:34

Fix failing test

856a412

Simplify test to one assertion

766787e

Check if console-with is causing the problems

37f689a

Fix validate data path

a61b7c0

phackstock approved these changes Aug 19, 2024

View reviewed changes

danielhuppmann added 3 commits August 19, 2024 13:28

Remove dev notebook

7868fd5

Implement review suggestion by @phackstock

0f19b56

Remove unnecessary todo

9ae85dc

danielhuppmann merged commit 18c0b12 into main Aug 20, 2024
12 checks passed

danielhuppmann deleted the feature/validate-data-apply branch August 20, 2024 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `DataValidator.apply()` #368

Implement `DataValidator.apply()` #368

danielhuppmann commented Aug 19, 2024 •

edited

Loading

phackstock left a comment

phackstock Aug 19, 2024

danielhuppmann Aug 19, 2024

Implement DataValidator.apply() #368

Implement DataValidator.apply() #368

Conversation

danielhuppmann commented Aug 19, 2024 • edited Loading

phackstock left a comment

Choose a reason for hiding this comment

phackstock Aug 19, 2024

Choose a reason for hiding this comment

danielhuppmann Aug 19, 2024

Choose a reason for hiding this comment

Implement `DataValidator.apply()` #368

Implement `DataValidator.apply()` #368

danielhuppmann commented Aug 19, 2024 •

edited

Loading