- Environment Setup:
- Clone the master branch of the repo
- Create the conda environment from bash with
conda env create -f environment.yml
- Activate the conda environment from bash with
conda activate philmappingenv
- You're good to go
- If you update / add packages, once the environment is activated, add them to the
environment.yml
, save the file, and from bash runconda env update --file environment.yml
- Contributions:
- To contribute to this repo, please create a development branch (i.e.
myname_dev
) and open a PR against master for any contributions:- Best practices for PR creation and management:
- Ensure you're working on the branch you created with (from bash)
git checkout myname_dev
- Your working branch at this stage should be in sync with master exactly
- Make you code changes and, once ready to open the PR, execute the below (from bash):
git add .
git commit -m "<your commit message here>"
git push
- At this point, open GitHub in browser, and go to "branches" tab of the repo
- Click "New Pull Request"
- Fill out details, add a reviewer unless your changes are known to the repo managers, and click through to open the PR
- After it has been merged, execute the below (from bash) to ensure your branch is synced with master:
git pull
git merge origin/master
git push
- Rinse and repeat; you're good to go
- Ensure you're working on the branch you created with (from bash)
- Best practices for PR creation and management:
- To contribute to this repo, please create a development branch (i.e.
The goal of this workstream is to use a single source of truth (SSOT) file of province/city/barangay pairings to clean up another file -- also containing these three levels of geographic granularity -- matching as many occurences in the second (unclean) file as possible to SSOT file.
- Repo Setup and Environment Management:
- Create environment management files for repo:
- Create branching setup for the repo:
- Create a paul_dev branch to ensure we're working with best SDLC practices
- Build Geo Label Matching Logic:
- Import and manage file that screens out non-ICM regions --
raw_data/non_icm_loc.csv
:- Set Batangas and Bulacan to NOT be removed (done in Excel)
- Add First, Second, Third, and Fourth (districts of Manila) to file and set the them to be removed (done in Excel)
- Import file and drop all NaNs (so that now all provinces listed can be used as a negative screen)
- Create SSOT file:
- Import
raw_data/new_locations.csv
file (raw data taken from this source here) and remove all rows that are in a province that is contained in theraw_data/non_icm_loc.csv
file accompanied by the value: True (meaning it should be removed as it is not a part of the regions ICM serves) - Save out the newly created SSOT file --
processed_data/ssot_df.csv
-- for reference and future use
- Import
- Clean up the unclean file --
raw_data/original_locations.csv
-- and add new correct geo-mapping fields:- As done with the SSOT file, remove all rows from the unclean file that are in a province contained in the
raw_data/non_icm_loc.csv
file accompanied by the value: True (meaning it should be removed as it is not a part of the regions ICM serves) - Create
under_construction_df
-- a new go-forward DF which will be a copy of the unclean file with the new cleaned columns added - Create a new column --
province_cleaned
-- to be appended to theunder_construction_df
, with the correct name for the province associated with each row:- Create
province_mapping_df
, which will eventually serve as a mapping dictionnary of unclean to clean names, but will start by simply storing all unique province names in the unclean file - Iterate over each unique province name in the unclean file, and check if the value in the
province
column matches a value contained in theprovince
column of the SSOT file (accounting for capitalization differences) - After having done the automated matches possible, perform the manual matching necessary based on additional research:
- Manually match "City of Isabela (Capital)", "Cotabato", and "Davao Occidental"
- Use the
province_mapping_df
(and any other custom logic needed) to create the newprovince_cleaned
column
- Create
- Write out
processed_data/under_construction_df.csv
to log the work done so far - City and Barangay are trickier and so we need to solve them all at once with more detailed analysis:
- Prior to jumping in, we'll prep by cleaning up some object formatting issues.
- First we go for all low-hanging fruit -- cities that we can match to the
ssot_df
because we can find an exact pairing between sets of province, city, and barangay between thejust_geo_names_df
df and thessot_df
. We'll perform this matching via a left join of thessot_df
onto thejust_geo_names_df
. We'll then flag all the rows that were matched successfully with this simple method - Create a df with just the geo names we couldn't match to the
ssot_df
across all 3 geos so we can count the records still left to match. We'll do this multiple times from here on out until we arrive at 0 records we can't match - (Round 1 of ad hoc research) Now let's make any fixes we noticed through ad hoc exploration and see how that affects our match rate
- (Round 1 of ad hoc research) It appears we've spotted one trend that can be corrected algorithmically -- we should look for instances of the city names that use the formulation "CITY OF xxxxxxx" and replace them with the formulation "xxxxxxx CITY"
- (Round 1 of ad hoc research) Looks like the formulation change from "CITY OF xxxxxxx" to "xxxxxxx CITY" fixed 952 -- (8451-7499) -- records!
- (Round 2 of ad hoc research) It appears we've spotted another trend that can be corrected algorithmically -- we should look for instances where the baranguy name doesn't match the SSOT because of the addition of the word or abbreviation for "population" -- poblacion.
- (Round 2 of ad hoc research) Looks like the change to strip all barangays of the (POB.) string fixed 1,375 -- (7499-6124) -- records.
- (Round 3 of ad hoc research) It appears we've spotted another trend that can be corrected algorithmically -- we should look for instances where the baranguy name doesn't match the SSOT because the barangay name is "empty" and delete the row.
- (Round 3 of ad hoc research) Looks like the change to delete all rows where the "barangay" value was "EMPTY" fixed 203 -- (6124-5921) -- records.
- (Round 4 of ad hoc research) It appears that my previous logic for cleaning up province names failed slightly, as it didn't account for instances of duplication (i.e. it tagged the province name of "LEYTE\n LEYTE" as correct because it does CONTAIN "LEYTE". It should be a quick fix to manually remove these instances.
- (Round 4 of ad hoc research) Looks like the change to remedy the duplicated "LEYTE" cleaned province names fixed 1,499 -- (5921-4422) -- records.
- (Round 5 of ad hoc research) ... whatever I discover next in digging into the
problematic_geo_names
df to identify trends in mismatches between theunder_construction_df
and thessot_df
will go here.
- Write out
processed_data/under_construction_df.csv
to log the work done so far
- As done with the SSOT file, remove all rows from the unclean file that are in a province contained in the
- Import and manage file that screens out non-ICM regions --
- The file
Region-Province-Names.pdf
(downloaded from this link here) is the complete official list of the current names for geographies as of 12/31/2019 per the Official PSA (philippines statistical authority)- We need to follow these names as the official region/province names. Neither
original_locations.csv
ornew_locations.csv
may follow this naming schema, but it is the official philippines naming convention (SSOT)
- We need to follow these names as the official region/province names. Neither
- LUZON, VISAYAS, and MINDINAO are not official regions but actually just the 3 subsections of the Philippines (Top, Middle, Bottom in that order)
- We should focus on the Province-City-Barangay match; but Regions can help us subsection the data