Export aggregation difference to excel #237

phackstock · 2023-04-13T13:29:34Z

Started by a feature request from @orichters (here: https://github.com/iiasa/ngfs-phase-4-internal-workflow/discussions/6) I have implemented a first version of exporting the differences found in region aggregation to excel.
To be precise, this tackles the case where a model natively reports a region, e.g. World that is also re-calculated as part of our region aggregation. So far only the top rows of the difference data frame are printed as part of the log file (e.g. https://data.ece.iiasa.ac.at/ngfs-phase-4-internal/#/uploads/jobs/21/log):

2023-04-11 16:55:14 WARNING  Difference between original and aggregated data:
                                                                                                              original  aggregated
model                 scenario  region variable                                        unit            year                       
REMIND-MAgPIE 3.1-4.6 d_delfrag World  Agricultural Demand|Crops|Energy                million t DM/yr 2005  25.324800   25.325200
                                       Agricultural Demand|Crops|Energy|2nd generation million t DM/yr 2005   0.666700    0.667200
                                       Agricultural Demand|Livestock|Other             million t DM/yr 2015  24.059500   24.059800
                                       Agricultural Production|Energy|Crops            million t DM/yr 2005   0.680300    0.680400
                                       Capacity Additions|Electricity|Biomass          GW/yr           2060   0.000038    0.000038
...                                                                                                                ...         ...
                      o_lowdem  World  Yield|Sugarcrops                                t DM/ha/yr      2060  42.070800   46.233420
                                                                                                       2070  43.330200   48.201466
                                                                                                       2080  44.277000   49.474088
                                                                                                       2090  45.033500   49.908712
                                                                                                       2100  43.943600   50.042956

I have added an export_difference option to RegionProcessor.apply() that writes the data to a file called difference.xlsx.

So far, the workflow I envision, should work as follows:

A user uploads data to a Scenario Explorer and the processing finds differences in aggregated and model native data.
There is a warning issued with the top few rows of the differences.
After that warning there is another message:

If you want to get the differences, run the following command on your local machine:
`nomenclature run-region-processing your_data.xlsx --export-differences`
from the processing workflow directory. 
The differences will be exported to `difference.xlsx`.

The user then clones the corresponding workflow directory locally and installs the requirements.
Using nomenclature run-region-processing your_data.xlsx --export-differences, they then get the differences into a file called difference.xlsx for more close inspection.

@orichters and @danielhuppmann, does this sound like a good workflow to you?

Update

For the data frame of the differences between original and aggregated data I now added a relative (absolute) difference column, calculated as 100*abs((original-aggregated)/original).
I also sort the data frame by the newly created column to show the user the biggest differences first.
Finally, I changed the relative tolerance from 1e-5 to 1% to reduce the rate of false positives.

orichters · 2023-04-13T13:41:30Z

Workflow seems fine from my side. "clones the corresponding workflow directory locally" refers in my case to ngfs-phase-4-internal-workflow? Where is the explanation how I can "installs the requirements"? The rest sounds feasible :)

phackstock · 2023-04-13T13:47:48Z

"clones the corresponding workflow directory locally" refers in my case to ngfs-phase-4-internal-workflow?

Exactly, as this is just a first quick draft. I'll work on making the instructions better.

Where is the explanation how I can "installs the requirements"?

You would run python3 -m pip install -r requirements.txt to ensure that you can run the workflow locally and have the corresponding nomenclature command line option run-region-processing installed.

danielhuppmann

Thanks @phackstock, a few suggestions below.

nomenclature/cli.py

nomenclature/processor/region.py

nomenclature/cli.py

nomenclature/processor/region.py

tests/test_core.py

danielhuppmann · 2023-04-14T04:44:59Z

One more comment: please make sure to reset_index() before writing compare to xlsx, otherwise it is very cumbersome to do some analysis later in Excel (e.g., filtering hides the model/etc index values).

phackstock · 2023-04-14T10:06:49Z

One more comment: please make sure to reset_index() before writing compare to xlsx, otherwise it is very cumbersome to do some analysis later in Excel (e.g., filtering hides the model/etc index values).

Not sure I understand the point of why do to this, the output that I currently get of the difference file looks as attached. That should be fine, no?
difference.xlsx

danielhuppmann · 2023-04-14T10:11:53Z

Thanks @phackstock for showing the file, I guess that pandas changed their output structure recently...

But I still think that df.reset_index().to_excel(index=False) will avoid the bold formatting, so that looks cleaner.

phackstock · 2023-04-18T08:10:37Z

Not sure why test tests are failing, they're running fine on my local machine ...
I'll check.

phackstock · 2023-04-18T11:08:35Z

Tests are passing now. There are however, some API changes that I've had to make that I'm not super happy with:

In order to get the differences all the way "out" of the region processing I've had to adjust the interface of RegionProcessor.apply() from returning an IamDataFrame to returning a tuple of an IamDataFrame and a pandas DataFrame.
As having variable validation for the cli run-processing command is a nice feature and to avoid code duplication I used the core.process function directly. For this I had to adjust the return value of core.process to return the same tuple.
This is not ideal as we're planning on adding different processors (e.g. for required variables) soon so this messes up the unified Processor.apply() -> IamDataFrame.
I could of course create something like RegionProcessor.apply_and_capture_differences() -> tuple[IamDataFrame, pd.DataFrame] and change the apply function to:

def apply(self, ...):
   return self.apply_and_capture_differences()[0]

this would preserve the interface.

The issue with that is for running the processing locally, with variable validation and potentially a number of different validators using the process function is quite convenient.
However, then piping out the differences is difficult as there could be a situation where we don't have any region processing, i.e. no differences data frame.

I can also keep it as is for now and cross that bridge when we get there, which will be very soon.

Co-authored-by: Daniel Huppmann <[email protected]>

phackstock · 2023-05-25T14:45:25Z

I have implemented the following updates:

added a function called check_region_aggregation, it takes a pyam dataframe and returns the differences between aggregated and original data.
RegionProcessor.apply() now always returns only one pyam data frame which is the processing results.
Attributes for altering RegionProcessor.apply behavior are gone
Added a user guide to the documentation on how to locally get the differences
Updated the log message that now links to the documentation

phackstock · 2023-05-25T14:47:34Z

@danielhuppmann since you think the cli is too much for the non-expert user should I just remove the option?

danielhuppmann · 2023-05-26T06:40:38Z

@danielhuppmann since you think the cli is too much for the non-expert user should I just remove the option?

No, keep the CLI - I think having both the CLI implementation and the Python-code-example is the most user-friendly approach!

danielhuppmann

Thanks, a few minor suggestions and clarifications below…

doc/source/user_guide/model-mapping.rst

nomenclature/cli.py

doc/source/user_guide/model-mapping.rst

tests/test_core.py

danielhuppmann

Thanks, a few minor suggestions and clarifications below…

Co-authored-by: Daniel Huppmann <[email protected]>

phackstock · 2023-05-30T13:05:10Z

Updated the return value of check_region_aggregation, it now returns both the processing results and the differences.
Changed the width of the pandas console output so that it will write the full length of the differences data frame to the console.

doc/source/user_guide/model-mapping.rst

nomenclature/cli.py

doc/source/user_guide/model-mapping.rst

nomenclature/cli.py

doc/source/user_guide/model-mapping.rst

danielhuppmann · 2023-05-30T13:38:53Z

Almost there...

nomenclature/processor/region.py

Co-authored-by: Daniel Huppmann <[email protected]>

danielhuppmann

Looks good, thanks!

phackstock added the enhancement New feature or request label Apr 13, 2023

phackstock requested a review from danielhuppmann April 13, 2023 13:29

phackstock self-assigned this Apr 13, 2023

danielhuppmann reviewed Apr 13, 2023

View reviewed changes

phackstock requested a review from danielhuppmann April 18, 2023 10:56

phackstock force-pushed the feature/export-aggregation-diff-to-excel branch from 624f364 to 6adf8ca Compare May 17, 2023 15:02

phackstock and others added 17 commits May 23, 2023 15:51

Add option to export differences from region aggregation

6c64257

Add test for exporting differences

b3d6fe9

Add cli command run-region-processing with export-differences option

79b9c25

Set limit for relative tolerance to 1%

7348dc3

Add and sort by relative difference

1e26c83

Adjust tests for relative difference

9983501

Make rtol a keyword argument

247fe58

Export result and difference

d066297

Change process to return both result and difference

cb54a6a

Update cli to use process function

26af5f1

Adjust tests

f427ced

Apply suggestions from code review

cb804fc

Co-authored-by: Daniel Huppmann <[email protected]>

Add debug info for failing GitHub tests

df5309f

Clean up difference output

c31a6f0

Adjust tests

eee75cd

Shorten difference column name

0292bff

Adjust tests

16455ce

phackstock added 2 commits May 25, 2023 16:40

Update differences log

295c2f8

Add check_region_aggregation docstring

3fbe03f

phackstock requested a review from danielhuppmann May 25, 2023 14:46

Appease stickler

3001c38

danielhuppmann reviewed May 26, 2023

View reviewed changes

phackstock and others added 4 commits May 26, 2023 10:11

Apply suggestions from code review

c7d8193

Co-authored-by: Daniel Huppmann <[email protected]>

Update run-region-processing cli

2e546ff

Update return value of check_region_aggregation

c8b7ed2

Update cli

27c5678

phackstock requested a review from danielhuppmann May 30, 2023 13:04