Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/datacommonsorg/data into …
Browse files Browse the repository at this point in the history
…statvar
  • Loading branch information
ajaits committed Nov 16, 2023
2 parents f55b6eb + f34efbf commit 6fa62ff
Show file tree
Hide file tree
Showing 34 changed files with 1,644 additions and 2,029 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[submodule "scripts/un/sdg/sdg-dataset"]
path = scripts/un/sdg/sdg-dataset
url = https://code.officialstatistics.org/undata2/data-commons/sdg-dataset.git
[submodule "scripts/un/sdg/sssom-mappings"]
path = scripts/un/sdg/sssom-mappings
url = https://code.officialstatistics.org/undata2/sssom-mappings.git
5 changes: 3 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ google-cloud-scheduler==2.10.0
gspread
lxml==4.9.1
matplotlib==3.3.0
netCDF4
netCDF4==1.6.4
numpy
openpyxl==3.0.7
pandas==1.3.5
pandas
pylint
pytest
rasterio
Expand All @@ -39,3 +39,4 @@ xlrd==1.2.0
yapf
zipp
beautifulsoup4
ratelimit
1 change: 1 addition & 0 deletions scripts/un/sdg/.gitattributes
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
csv/* filter=lfs diff=lfs merge=lfs -text
schema/* filter=lfs diff=lfs merge=lfs -text
dc_generated/* filter=lfs diff=lfs merge=lfs -text
geography/* filter=lfs diff=lfs merge=lfs -text
50 changes: 44 additions & 6 deletions scripts/un/sdg/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,36 @@
# UN Stats Sustainable Development Goals

This import includes country, city, and select region-level data from the [UN SDG Global Database](https://unstats.un.org/sdgs/dataportal). Data is read from the submodule `sdg-dataset` which is managed by UN Stats.
This import includes data from the [UN SDG Global Database](https://unstats.un.org/sdgs/dataportal). Data is read from the submodule `sdg-dataset` which is managed by UN Stats. Geography mappings are read from the submodule `sssom-mappings` which is also managed by UN Stats. Please ensure the submodules stay up to date.

## One-time Setup

To generate city dcids:
Initialize submodules:
```
python3 cities.py <DATACOMMONS_API_KEY>
git submodule update --init --remote sdg-dataset
git submodule update --init --remote sssom-mappings
```
(Note: many of these cities will require manual curation, so this script likely should not be rerun.)

To process data and generate artifacts:
## Data Refresh

Update submodules:
```
git submodule update --remote sdg-dataset
git submodule update --remote sssom-mappings
```

Generate place mappings:
```
python3 geography.py
```
Produces:
* geography/ folder:
* un_places.mcf (place mcf)
* un_containment.mcf (place containment triples)
* place_mappings.csv (map of SDG code -> dcid)

Note that the `place_mappings.csv` is required before running the `process.py` script.

Process data and generate artifacts:
```
python3 process.py
```
Expand All @@ -23,9 +44,26 @@ Produces:
* unit.mcf
* csv/ folder:
* [CODE].csv
(Note that the `schema/` folder is not included in the repository but can be regenerated by running the script.)

(Note that these folders are not included in the repository but can be regenerated by running the script.)

When refreshing the data, the `geography`, `schema`, and `csv` folders might all get updated and will need to be resubmitted to g3. The corresponding TMCF file is `sdg.tmcf`.

To run unit tests:
```
python3 -m unittest discover -v -s ../ -p "*_test.py"
```

Notes:
* We currently drop certain series and variables (refer to `util.py` for the list) which have been identified by UN as potentially containing outliers.

## SDMX

As reference, we provide an earlier version of the import scripts that utilized the UN API (which uses SDMX) in the `sdmx/` folder. Please note that these scripts may have errors and do not use the most up-to-date schema format, so should only be used as an illustration of the SDMX -> MCF mapping and **should not actually be run**.

As a quick overview:
* `preprocess.py` downloads all the raw input CSVs to an `input/` folder as well as adds all dimensions and attributes to a `preprocessed/` folder.
* `cities.py` reads the input CSVs and matches cities with dcids.
* `process.py` reads the input CSVs and concepts and generates a cleaned CSV and schema.
* `util.py` has various shared util functions and constants.
* `m49.csv` has country code mappings.
Loading

0 comments on commit 6fa62ff

Please sign in to comment.