Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GCBM API #163

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Refactor GCBM API #163

wants to merge 8 commits into from

Conversation

aornugent
Copy link
Contributor

@aornugent aornugent commented Jul 31, 2022

Description

@ankitaS11 has been developing a new API for creating GCBM simulations. Here is how it applies in the FLINTcloud context.

There are a few notable changes between the implementation at https://github.com/ankitaS11/gcbm_preprocessing

  1. The rest_api_gcbm endpoints store uploaded files in a flat directory structure, rather than the nested directories of GCBM_Demo_Run.zip
  2. The FLINTcloud config templates are incomplete and require additional modifications. We wish to accept variable numbers of input files (e.g. with and without disturbances) rather than hard coding the names of the input files into our templates.
  3. Not all input configuration can be autogenerated. User specified is enabled by the set_*_attributes methods.

Testing:

@ankitaS11 has thoughtfully written a short script to mock up the API and compare it to a known reference. This process is described in local/rest_api_gcbm/tests/

This approach to testing is fast. It will allow us to iterate quickly on the API methods necessary to match the reference.

However, it is important to bear in mind that the ultimate target is to surface the API through REST endpoints via Flask (in app.py). Please ensure that the proposed changes are aligned with the workflow of FLINTui.

Lastly, it is important to verify that the generated config. is valid meaning that the GCBM Demo Run simulation will run. This is a final integration test - we assume in the meantime that config that matches the reference is also valid. A working FLINT installation is not needed for development.

aornugent and others added 2 commits July 31, 2022 13:55
@aornugent
Copy link
Contributor Author

aornugent commented Jul 31, 2022

@ankitaS11 - I should have made this update on #152 sorry 🤦

Are you able to reconcile the two sets of changes?

@ankitaS11
Copy link
Contributor

@ankitaS11 - I should have made this update on #152 sorry : facepalm:

Are you able to reconcile the two sets of changes?

Hi, @aornugent - No issues at all. I appreciate the details in the description as well as your efforts in creating this PR. Let me try merging these changes into mine, or if you prefer, I can push any changes in this PR only, whatever that you suggest.

YashKandalkar
YashKandalkar previously approved these changes Jul 31, 2022
Copy link

@YashKandalkar YashKandalkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome!

@aornugent
Copy link
Contributor Author

aornugent commented Jul 31, 2022

Let me try merging these changes into mine, or if you prefer, I can push any changes in this PR only, whatever that you suggest.

Whichever is easiest for you.

Have you been able to checkout this branch? There's quite a few little fixes needed to get compare_json to pass but I don't think it'll take long. 🤞 Your new API should be fast to extend.

@YashKandalkar - are you able to advise on how we manage the sim = GCBMSimulation() object in Flask? Can we expect to persist the object between REST API calls or do we need to create a new instance every time?

@YashKandalkar
Copy link

Hey @aornugent, creating a global dict with simulation titles as keys and instances of the GCBMSimulation class as values should work till the server runs. If, for some reason, the server restarts, we'd lose all the instances.

Maybe, in the future, we can have a function run when the server starts which will construct these instances again depending on the files stored from the previous uploads.

@Namyalg
Copy link
Member

Namyalg commented Aug 1, 2022

Maybe, in the future, we can have a function run when the server starts which will construct these instances again depending on the files stored from the previous uploads.

As per my discussion with @HarshCasper , the next steps in the GCBM API development will most likely handle aspects like caching, error handling etc, so we can develop a robust system over all

Also sharing a few of my observations and queries here :

  1. The OO implementation of the pre-processing is a good idea in comparison to the existing structure with very little modularisation and abstraction

  2. The class GCBMSimulation handles all the operations related to the adding/editing of the files while the interaction with these functions is implemented by subclasses for disturbances, classifiers, misc

  3. The current usage of the flat structure will save us the effort of copying the files from the folders created currently to the path input/name-of-sim , and as I understand, here, the folder templates/config will be used as the folder

  4. With respect to editing the JSON, are we providing a provision only to edit the attributes section in the classifier and disturbance files?

  5. The function to account for hardcoded values, https://github.com/ankitaS11/gcbm_preprocessing/blob/3decee75dd7bec8f549fd4e0e85126070892c249/gcbm.py#L56, can this be generalised ? Doesn't this hold good only for the current set of inputs?

@Namyalg
Copy link
Member

Namyalg commented Aug 1, 2022

How exactly will the compare-json functionality be generalised ?
For these set of inputs, we have the required configuration to compare against, but what if the user has a different set of input files and configurations?

@Crystalsage
Copy link
Contributor

Hi @aornugent nice work!

For the GCBMSimulation object, in case if a server restarts - we can have these instances recreated when the server launches, as @YashKandalkar suggested.

I think a good way to store the GCBMSimulation object would be to serialize it by pickling it with the pickle library. This way, we can do a deserialization routine when the server launches, and recover the object.

@ankitaS11
Copy link
Contributor

@aornugent - Thanks for sharing your observations, and @Namyalg @YashKandalkar for your review and questions.

I'll make an attempt to address all of them, in case I miss any, please feel free to point me to them again. Let me start with some context on what I'm thinking:

The current state of the GCBM API only supports categories: disturbances, classifiers and miscellaneous, and it will support the input configuration in the next few commits, hopefully by this week. So the API might be tinkered a bit to support generating the study_area config as well, although the design should remain the same. The test file is not really testing anything, but just comparing the two JSONs - one that was existing before, and the other that was generated, as rightly pointed by @Namyalg. My idea of writing tests was:

  1. Use pytest for testing.
  2. There are two types of tests we need to do:
    • To ensure that the correct meta-data is written to the JSON files. Example: if we read the width as X from the raster file, it should be correctly to the JSON file.
    • To ensure that in case of pre-existing config, the API does or does not overwrite (depending on what's the right action) the config.
    • To ensure that correct meta-data is read from the raster file. (IMO: we can skip this one, as we can safely rely on the rasterio library to help us do it.)

I'll be working on this as well, hopefully have some results this week, and rest of them in the next.

With respect to editing the JSON, are we providing a provision only to edit the attributes section in the classifier and disturbance files?

Good question, after taking a look at the config, (I may lack some context here, so please correct me if I'm wrong):

The following should not be allowed to be edited by the user:

"tileLatSize"
"tileLonSize"
"blockLatSize
"blockLonSize
"cellLatSize"
"cellLonSize"

We can allow the following for edits:

"layer_type"
"layer_data"
"nodata",
"attributes": {...}

The API will be flexible enough to do this, please let me know if this is the way we want it to behave.

The function to account for hardcoded values, https://github.com/ankitaS11/gcbm_preprocessing/blob/3decee75dd7bec8f549fd4e0e85126070892c249/gcbm.py#L56, can this be generalised ? Doesn't this hold good only for the current set of inputs?

I hope I answered this question above in my testing plan. This was just a temporary solution to help us validate that the API doesn't break what we had before :) @Namyalg - Please let me know if my testing plan doesn't answer your concern. One help I would need is an example of a custom (whose hard-coded values don't exist in this PR) TIFF file and expected JSON config from it - just to validate my design further. Can anyone help with this?

@Crystalsage:

I think a good way to store the GCBMSimulation object would be to serialize it by pickling it with the pickle library. This way, we can do a deserialization routine when the server launches, and recover the object.

We don't want to pickle everything we have in the object. This will take me some time to make some data unpickable, I think instead - we can pickle only some data (files - which stores the config as well as the paths) - and have an API in the GCBMSimulation class that allows re-loading if the pickle file already exists. This is more like a UX thing, and I'll also be curious to hear from @YashKandalkar and others involved for their opinion.

@aornugent - If you don't mind, I would like to keep my opened PR as the source of further discussion as well as changes, and I can merge all the changes that you have made in the codebase to my branch - only if that's fine by you? (PR: #152)

@Namyalg
Copy link
Member

Namyalg commented Aug 1, 2022

Thanks for the detailed explanation @ankitaS11
It answers all my questions, to summarise, this comparison is done to ensure we're generating the correct configs, and the input is as expected (and can be removed at a later stage) and the existing test suite can be extended to perform the testing of the new API design.
Regarding the modification of the attributes, I agree with your observations

@aornugent
Copy link
Contributor Author

@aornugent - If you don't mind, I would like to keep my opened PR as the source of further discussion as well as changes, and I can merge all the changes that you have made in the codebase to my branch - only if that's fine by you? (PR: #152)

Fine by me! I hope it's useful to demonstrate the desired target.

A few quick comments:

File structure

The current usage of the flat structure will save us the effort of copying the files from the folders created currently to the path input/name-of-sim , and as I understand, here, the folder templates/config will be used as the folder

I do not think we should use templates/config as the root directory for a simulation. These are templates and should be read only. My temporary hack to demonstrate this is to create input/test-run and copy files there to be modified. A folder per simulation can be created using gcbm/new. My preference for naming would be /simulations/simulation-title/input so that we can also have simulations/simulation-title/output.

Caching state

Creating a global dict with simulation titles as keys and instances of the GCBMSimulation class as values should work till the server runs. If, for some reason, the server restarts, we'd lose all the instances.

This sounds great. Fine to clear cache on restart for now.

We can pickle only some data (files - which stores the config as well as the paths) - and have an API in the GCBMSimulation class that allows re-loading if the pickle file already exists.

Agree - input files can be large but their config will always be small. We can keep a cache of prior simulations and indicate to the user where additional files need to be uploaded. But this is an advanced feature - first lets get one simulation running.

Simulation requirements

So the API might be tinkered a bit to support generating the study_area config as well, although the design should remain the same.

I think you can ignore study_area. I don't think it's used as an input (try Ctrl + F study_area.json and see if it's mentioned anywhere in the config.)

One help I would need is an example of a custom (whose hard-coded values don't exist in this PR) TIFF file and expected JSON config from it - just to validate my design further. Can anyone help with this?

Check out GCBM.Belize and GCBM.Colombia. These have different inputs to the GCBM New Demo Run.

Thanks everyone!
In order of priority, I would say we want to demonstrate:

  • GCBM New Demo Run
  • GCBM New Demo Run with no disturbances
  • GCBM Belize

@Namyalg
Copy link
Member

Namyalg commented Aug 2, 2022

Yes, we can create a folder per simulation.
Are we planning to test the GCBM example API on the Belize inputs ?

Copy link
Contributor

@ankitaS11 ankitaS11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all!

@aornugent - I realized it will be better if I make changes here only, as a lot of discussions happened here + I noticed you made significant and important changes. (Thanks for that)

Made some updates, notably:

  1. Added tests_simulation.py - which tests 2 tasks:
    • Whether all the files are added successfully for all the runs.
    • Whether all the files are updated when set attributes is called.
  2. Updated gcbm.py to allow passing category as an argument while adding files, (as some file names may not have the category in their file names). This will allow us support runs like GCBM.Belize.

In order to run the tests:

pip install pytest pytest-mock
cd local/rest_api_gcbm/tests && pytest tests_simulations.py

Looking forward to your feedback! Need to fix a few more TODOs, but hopefully this is a step in right direction.

local/rest_api_gcbm/gcbm.py Outdated Show resolved Hide resolved
local/rest_api_gcbm/gcbm.py Outdated Show resolved Hide resolved
self.json_paths = {}

# TODO: Once categories are changed from strings to Enums, we should find a better way to have supported categories
self.supported_categories = ["classifiers", "disturbances", "miscellaneous"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support input databases also. If you rename /tests/tiffs/new_demo_run $\rightarrow$ tests/reference/new_demo_run then it makes sense to include files in addition to tiffs.

Copy link
Contributor Author

@aornugent aornugent Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS. thank you for changing this to lowercase - CamelCase filenames drive me nuts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.P.S I also got sick of the unnecessary _moja suffix and removed it from the linux-demo reference simulation on #165

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support input databases also.

Got it! I understand that it's a TODO, and I'll add this feature in the next commit. Thanks for letting me know about the _moja suffix removal, since it's merged now - I'll rebase and fix these suffixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! :)

# We can also raise an error here
raise UserWarning(f"Given category {category} not supported, supported categories are: {disturbances, classifiers, miscellaneous}")
else:
print(f"Category wasn't provided, attempting to deduce it from the file path: {file_path}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario is that categories are provided as part of an HTTP request, via call to the /gcbm/upload REST endpoint.

@YashKandalkar - would you prefer to support individual file uploads on the front end, or the bulk upload option we currently use an an example (see: curl.md)

local/rest_api_gcbm/gcbm.py Outdated Show resolved Hide resolved
Comment on lines 387 to 389
# TODO: Check how to handle multiple attributes entries (L442-451 of `app.py:master`)
# sim.set_disturbance_attributes("disturbances_2013_moja.tiff",
# {"year": 2013, "disturbance_type": "Wildfire", "transition": 1})
Copy link
Contributor Author

@aornugent aornugent Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important too - the shortest fix for adding multiple attributes might be to combine both into a single payload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken a note of this, and will fix this TODO in the next commit. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! :) Hope it looks good?

Copy link
Contributor

@ankitaS11 ankitaS11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @aornugent and everyone else, the recent commit updates this branch with the following new features:

  1. Passing attributes (payloads) as nested dictionaries is now allowed. The script also checks if year/{category}_type are in the payloads, and sets has_year/has_type attributes in the output config.
  2. The API is now able to generate provider_config.json, modules_cbm.json, variables.json correctly. One needs to call sim.generate_configs() when they want this.
  3. Tests are also passing, so nothing breaks so far. However, I've changed the categories from disturbances -> disturbance, classifiers -> classifier - I just felt it's better, but this might break what we have in app.py, so I'm thinking to revert this particular change, but open to any feedback or comments.

I'll appreciate it, if anyone wants to test this, please do: python3 gcbm.py in local/rest_api_gcbm folder of this branch, and check the input/test-run folder for all the output files generated. Please let me know if I missed anything.

Tests, and more features/my attempts to solve the inline TODOs will follow in the next couple of days, and I'll update whenever I've any news :)

Comment on lines +487 to +498
sim.set_attributes(
file_path="disturbances_2013.tiff",
category="disturbance",
payload={
"1": {
"year": 2013,
"disturbance_type": "Mountain pine beetle — Very severe impact",
"transition": 1,
},
"2": {"year": 2013, "disturbance_type": "Wildfire", "transition": 1},
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing attributes in nested dictionary is now supported. It also checks if year/{category}_type are anywhere in the nested dictionary, and sets has_year/has_type keys in the config file respectively.

Comment on lines +378 to +381
def generate_configs(self):
self._generate_provider_config(self.dirpath)
self._generate_modules_cbm_config(self.dirpath)
self._add_variables(self.dirpath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be called if someone wants to update the provider/modules_cbm/variables files. I cross-checked that all the files are generated correctly, but I'll appreciate another set of eyes.

TODO for me: write tests to cross-check this :)

Comment on lines +174 to +176
def _update_config(self):
for idx, _ in enumerate(self.config):
self.config[idx]["SQLite"] = {"path": self.files[idx], "type": "SQLite"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this, I'll remove in the next commit.

@staticmethod
def safe_read_json(file_path):

# TODO: add read method for gcbm_config.cfg and logging.conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aornugent - Thanks for adding these TODOs in the code, they are extremely useful for me to track what needs to be done. However, I'm a bit confused about this one - do we want to just read these files, and store them in another folder? Or in a JSON? Or do we want to extract any information, and use it somewhere else? I'm not sure as I couldn't find a relevant part of it in app.py, so I'll really appreciate it if you can help me with this.

Copy link
Contributor Author

@aornugent aornugent Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitaS11 - this one is a little cryptic. The two files need to be copied from templates and the default behaviour was to traverse all the config. files in the folder with safe_read_json, read their contents into the files dictionary, and then save a copy in the simulation folder.

But safe_read_json() only reads JSON, and was erroring out when given the .cfg and .conf files.

The contents of these files don't change very often (if ever!) but presently we have no way of reading them as part of the simulation configuration.

I'd appreciate your advice on how they can be handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aornugent. I'm not Ankita, but I think configparser library would be a good choice for parsing the .conf files. They provide easy parsing for .ini style syntax.

For .cfg files, we can prepend a pseduo-section and then use configparser.

The code would look something like this:

import configparser

# For conf files
parser = configparser.ConfigParser()
parser.read('what.conf')
log = parser['Core']['DisableLogging'])

# For cfg files
parser = ConfigParser()
with open("gcbm_config.cfg") as config:
    parser.read_string("[top]\n" + config.read())

The problem is that all keys in the gcbm_config.cfg have the same name. This causes an error too.

config=localdomain.json
config=pools_cbm.json
config=modules_cbm.json
config=modules_output.json
config=spinup.json
config=variables.json
config=internal_variables.json

we can maybe rename some fields? Like localdomain_config=localdomain.json? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Crystalsage - that's a good suggestion. I don't think we can change the field names, because they're in the format that the GCBM understands.

Maybe we can just read these files in as strings for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aornugent Ah I see. In that case - what do you think about preparing a makeshift .cfg config in memory for better parsing and using that throughout the API? This way, we won't need to restructure the config file - and we can always change a few things around the API. A very rough sketch looks something like this:

import configparser

parser = configparser.ConfigParser()

gcbm_config_file = open("gcbm_config.cfg").read()
config_file_new = ""

# Build a makeshift config
for config in config_file.splitlines():
    # Get the json file name
    file_name = config.split('=')[1]

    # Trim off the extension to get the config type
    config_name = file_name[:-5]

    # Format into .ini style
    # e.g. localdomain_config=localdomain.json
    config_file_new += f"{config_name}_config={file_name}\n"

parser.read_string("[top]\n" + config_record)

# Returns 'localdomain.json'
parser['top']['localdomain_config']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Crystalsage - sorry for the delay. Your proposal looks great!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Crystalsage - are you append your suggested changes to this branch please?

@aornugent aornugent mentioned this pull request Dec 20, 2022
12 tasks
@aornugent
Copy link
Contributor Author

@YashKandalkar @Namyalg - I need some help remembering where this got to. Is anyone able to run this locally for me please? I don't have access to my usual workstation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants