Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export aggregation difference to excel #237

Conversation

phackstock
Copy link
Contributor

@phackstock phackstock commented Apr 13, 2023

Started by a feature request from @orichters (here: https://github.com/iiasa/ngfs-phase-4-internal-workflow/discussions/6) I have implemented a first version of exporting the differences found in region aggregation to excel.
To be precise, this tackles the case where a model natively reports a region, e.g. World that is also re-calculated as part of our region aggregation. So far only the top rows of the difference data frame are printed as part of the log file (e.g. https://data.ece.iiasa.ac.at/ngfs-phase-4-internal/#/uploads/jobs/21/log):

2023-04-11 16:55:14 WARNING  Difference between original and aggregated data:
                                                                                                              original  aggregated
model                 scenario  region variable                                        unit            year                       
REMIND-MAgPIE 3.1-4.6 d_delfrag World  Agricultural Demand|Crops|Energy                million t DM/yr 2005  25.324800   25.325200
                                       Agricultural Demand|Crops|Energy|2nd generation million t DM/yr 2005   0.666700    0.667200
                                       Agricultural Demand|Livestock|Other             million t DM/yr 2015  24.059500   24.059800
                                       Agricultural Production|Energy|Crops            million t DM/yr 2005   0.680300    0.680400
                                       Capacity Additions|Electricity|Biomass          GW/yr           2060   0.000038    0.000038
...                                                                                                                ...         ...
                      o_lowdem  World  Yield|Sugarcrops                                t DM/ha/yr      2060  42.070800   46.233420
                                                                                                       2070  43.330200   48.201466
                                                                                                       2080  44.277000   49.474088
                                                                                                       2090  45.033500   49.908712
                                                                                                       2100  43.943600   50.042956

I have added an export_difference option to RegionProcessor.apply() that writes the data to a file called difference.xlsx.

So far, the workflow I envision, should work as follows:

  1. A user uploads data to a Scenario Explorer and the processing finds differences in aggregated and model native data.
  2. There is a warning issued with the top few rows of the differences.
  3. After that warning there is another message:
If you want to get the differences, run the following command on your local machine:
`nomenclature run-region-processing your_data.xlsx --export-differences`
from the processing workflow directory. 
The differences will be exported to `difference.xlsx`.
  1. The user then clones the corresponding workflow directory locally and installs the requirements.
  2. Using nomenclature run-region-processing your_data.xlsx --export-differences, they then get the differences into a file called difference.xlsx for more close inspection.

@orichters and @danielhuppmann, does this sound like a good workflow to you?

Update

For the data frame of the differences between original and aggregated data I now added a relative (absolute) difference column, calculated as 100*abs((original-aggregated)/original).
I also sort the data frame by the newly created column to show the user the biggest differences first.
Finally, I changed the relative tolerance from 1e-5 to 1% to reduce the rate of false positives.

@phackstock phackstock added the enhancement New feature or request label Apr 13, 2023
@phackstock phackstock self-assigned this Apr 13, 2023
@orichters
Copy link

Workflow seems fine from my side. "clones the corresponding workflow directory locally" refers in my case to ngfs-phase-4-internal-workflow? Where is the explanation how I can "installs the requirements"? The rest sounds feasible :)

@phackstock
Copy link
Contributor Author

phackstock commented Apr 13, 2023

"clones the corresponding workflow directory locally" refers in my case to ngfs-phase-4-internal-workflow?

Exactly, as this is just a first quick draft. I'll work on making the instructions better.

Where is the explanation how I can "installs the requirements"?

You would run python3 -m pip install -r requirements.txt to ensure that you can run the workflow locally and have the corresponding nomenclature command line option run-region-processing installed.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phackstock, a few suggestions below.

nomenclature/cli.py Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
nomenclature/cli.py Show resolved Hide resolved
nomenclature/cli.py Outdated Show resolved Hide resolved
nomenclature/cli.py Outdated Show resolved Hide resolved
nomenclature/cli.py Outdated Show resolved Hide resolved
nomenclature/processor/region.py Outdated Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
@danielhuppmann
Copy link
Member

One more comment: please make sure to reset_index() before writing compare to xlsx, otherwise it is very cumbersome to do some analysis later in Excel (e.g., filtering hides the model/etc index values).

@phackstock
Copy link
Contributor Author

One more comment: please make sure to reset_index() before writing compare to xlsx, otherwise it is very cumbersome to do some analysis later in Excel (e.g., filtering hides the model/etc index values).

Not sure I understand the point of why do to this, the output that I currently get of the difference file looks as attached. That should be fine, no?
difference.xlsx

@danielhuppmann
Copy link
Member

Thanks @phackstock for showing the file, I guess that pandas changed their output structure recently...

But I still think that df.reset_index().to_excel(index=False) will avoid the bold formatting, so that looks cleaner.

@phackstock
Copy link
Contributor Author

Not sure why test tests are failing, they're running fine on my local machine ...
I'll check.

@phackstock
Copy link
Contributor Author

phackstock commented Apr 18, 2023

Tests are passing now. There are however, some API changes that I've had to make that I'm not super happy with:

  • In order to get the differences all the way "out" of the region processing I've had to adjust the interface of RegionProcessor.apply() from returning an IamDataFrame to returning a tuple of an IamDataFrame and a pandas DataFrame.
  • As having variable validation for the cli run-processing command is a nice feature and to avoid code duplication I used the core.process function directly. For this I had to adjust the return value of core.process to return the same tuple.
  • This is not ideal as we're planning on adding different processors (e.g. for required variables) soon so this messes up the unified Processor.apply() -> IamDataFrame.
  • I could of course create something like RegionProcessor.apply_and_capture_differences() -> tuple[IamDataFrame, pd.DataFrame] and change the apply function to:
def apply(self, ...):
   return self.apply_and_capture_differences()[0]

this would preserve the interface.

  • The issue with that is for running the processing locally, with variable validation and potentially a number of different validators using the process function is quite convenient.
    However, then piping out the differences is difficult as there could be a situation where we don't have any region processing, i.e. no differences data frame.

I can also keep it as is for now and cross that bridge when we get there, which will be very soon.

@phackstock phackstock force-pushed the feature/export-aggregation-diff-to-excel branch from 624f364 to 6adf8ca Compare May 17, 2023 15:02
@phackstock
Copy link
Contributor Author

I have implemented the following updates:

  • added a function called check_region_aggregation, it takes a pyam dataframe and returns the differences between aggregated and original data.
  • RegionProcessor.apply() now always returns only one pyam data frame which is the processing results.
  • Attributes for altering RegionProcessor.apply behavior are gone
  • Added a user guide to the documentation on how to locally get the differences
  • Updated the log message that now links to the documentation

@phackstock
Copy link
Contributor Author

@danielhuppmann since you think the cli is too much for the non-expert user should I just remove the option?

@danielhuppmann
Copy link
Member

@danielhuppmann since you think the cli is too much for the non-expert user should I just remove the option?

No, keep the CLI - I think having both the CLI implementation and the Python-code-example is the most user-friendly approach!

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few minor suggestions and clarifications below…

doc/source/user_guide/model-mapping.rst Outdated Show resolved Hide resolved
nomenclature/cli.py Outdated Show resolved Hide resolved
nomenclature/cli.py Outdated Show resolved Hide resolved
nomenclature/cli.py Outdated Show resolved Hide resolved
doc/source/user_guide/model-mapping.rst Outdated Show resolved Hide resolved
doc/source/user_guide/model-mapping.rst Outdated Show resolved Hide resolved
tests/test_core.py Show resolved Hide resolved
tests/test_core.py Show resolved Hide resolved
Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few minor suggestions and clarifications below…

@phackstock
Copy link
Contributor Author

Updated the return value of check_region_aggregation, it now returns both the processing results and the differences.
Changed the width of the pandas console output so that it will write the full length of the differences data frame to the console.

nomenclature/cli.py Outdated Show resolved Hide resolved
nomenclature/cli.py Outdated Show resolved Hide resolved
@danielhuppmann
Copy link
Member

Almost there...

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@danielhuppmann danielhuppmann merged commit c9978f2 into IAMconsortium:main May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants