DeveloperNotes

Developer Notes

Additional stuff for contributors not covered elsewhere

Nbdev and Pip

We use a combination of nbdev and pip to maintain the package dependencies of the project. Please make sure to upgrade to the latest version of pip (at least or greater than version 22.1) to ensure support of developing geowrangler using an editable install. Follow the instructions in the DEVELOPMENT and CONTRIBUTING documents to setup a local development environment.

Jupyter Notebooks

One of the things that make geowrangler different from other geospatial packages (or even other projects) is its use of nbdev which makes it possible to develop python packages alongside their documentation using Jupyter notebooks See the section on the Documentation site below.

All the module code in geowrangler is built from Jupyter notebooks residing in the notebooks folder. However, not all notebooks in the notebooks folder contribute a code module for geowrangler. Some of these other notebooks are tutorials or provide an overview of the geowrangler project
- The implementation notebooks have a 'XX_<module_name>.ipynb' format, where XX is an arbitrary (can be repeating) number
- The tutorial notebooks have a 'tutorial..ipynb' format, and are usually show in the documentation side bar under the Tutorials section.
- The overview notebook name 'index.ipynb' is the notebook that becomes the overview (index.html) page in the documentation site.
To extract the code from implementation notebooks into the module code residing the geowrangler folder, run nbdev_export. If the module code is already existing, it is overwritten by the latest copy of the notebook. However, if the notebook was deleted, the module code for that deleted notebook is not automatically deleted -- it must be manually deleted.

This creates the possibility of creating module code directly (without a matching notebook) if need be (and as long as no notebook tags the same module in their default_exp comment)

The tutorial notebooks are an important component of geowrangler's documentation -- along with the reference documentation, they provide examples as to how the geowrangler modules can be used in wrangling with geospatial data.

Open in Colab

We also want to encourage exploration of the geowrangler package by providing an Open in Colab button for all the jupyter notebooks (especially the tutorials).
- This will automatically load the notebook from the github repo into Colab using a simple conversion recipe: for any url in the github repo pointing to a notebook (e.g. https://github.com/thinkingmachines/geowrangler/blob/master/notebooks/index.ipynb) just replace the https:// part with https://colab.research.google.com/ and remove the .com/ (e.g. https://colab.research.google.com/github/thinkingmachines/geowrangler/blob/master/notebooks/index.ipynb)
Note: Currently, each notebook must be edited manually to include an "Open in Colab" button.
- To make the notebook runnable in Colab, there has to be some additional steps taken, such as doing a pip install of the geowrangler package.
  - We also need to make sure to run these extra steps only in the context of running the notebook in Colab. This is done by bash test expression [ -e /content ] which assumes that only in the Colab environment is there a root directory folder named content. So to add a bash command like pip install <my-package> in Colab, we add a ! [ -e /content] && pip install <my-package> which checks if there is a directory /content/ and executes the pip install only if it finds it.

Sample Tutorial Datasets

Another way to encourage exploration is by providing sample datasets (usually .geojson files) for the tutorials. In the repository, these sample datasets are stored the data directory. So if we clone the repo, there is a ../data directory (relative to the notebooks directory where the tutorial and implementation notebooks reside).
- If the notebooks were copied and loaded individually in a jupyter environment (like in Colab), the ../data directory might not been created and the datasets may also not have been downloaded. Another check in the tutorial notebooks is added to see if the datasets in the ../data directory have already been downloaded (and downloads them if they haven't been downloaded yet).

Unit tests

As mentioned earlier, the geowrangler maintains a set of unit tests in the tests folder. We also check that our unit tests cover at least 80 percent of the module code and highly encourage the contributors to maintain it as close as possible to 100 percent code coverage. To check if the unit tests are passing and that test coverage is at least 80 percent, run the following command:

pytest --cov --cov-config=.coveragerc --cov-fail-under=80 -n auto --cov-report=html

This will not only check if the tests are passing and if the code coverage is greater than 80 percent, but will also show which lines of code have not been executed during the test run in the folder htmlcov.

CI/CD Pipelines

The project also has several automated CI/CD pipelines enabled (see the .github/workflows folder):
- pytest.yaml which checks if the code being merged to master has passing unit tests and a code coverage greater than 80 percent
- deploy.yaml which creates the final version of the documentation and publishes it to geowrangler's documentation site (geowrangler.thinkingmachin.es)

The Documentation site

Providing good and updated documentation is a top priority for the geowrangler project and is one of the primary reasons why we adopted nbdev -- in our workflow, the same implementation notebooks that generate geowrangler's modules are also the same notebooks that generate the reference documentation. This means that all the classes, methods and parameters in geowrangler's are easily kept in sync with the documentation.

This also has the benefit of enabling the all the tutorials and reference docs to be "executable" and (along with the Open on Colab button) significantly lowers the barrier to exploring geowrangler's modules.

Since the documentation is generated from the notebooks anyway, the geowrangler project has been setup so that the generated pages are NOT checked into the repository. We have also adopted the latest version of nbdev (nbdev2) which uses quarto to generate the site.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly