-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indian Districts for COVID19 #328
Conversation
0e49568
to
c8b4623
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review the script after getting a better idea from the current comments!
"Arunachal Pradesh":"Q1162", | ||
"Andhra Pradesh":"Q1159", | ||
"Andaman and Nicobar Islands":"Q40888" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with this map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The places' dcids are resolved by wikidataId.
The map of State->District->wikidataId was generated using the following:
- The wikidataId for each place is queried using the place_resolver.go script.
- A script is used against wikidata.org/wiki/${wikidataId} that verifies that the place is both a District and part of India.
- Manual check has been performed to ensure that the name matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I see. Does the data have any better IDs? This is fine if it's name only.
- Do you have that code too? Should we check it in?
- How many places did you manually check? Ballpark is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pradh do you have any opinions about manifesting this import to prod if we are using place name resolver to get the Wikidata IDs of Indian districts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjann I have added the script that checks against wikidataId to ensure the wikidataId is correct.
It basically goes through all the wikidataIds and exports a CSV with wikidata name and whether it belongs to India.
Then it's very easy to manually check that all are correct. I added a README.md too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dataset does not have any IDs right? So the approach to mapping to wikidataId via some heuristics (including place name resolver) and manual check sounds reasonable...
…e name, now it uses state abbreviation instead for simplicity
Ready for re-review!
|
scripts/covid19indiaORG/run_tests.py
Outdated
# Read the CSV file and generate a DataFrame with it. | ||
actual_df = pd.read_csv(output_path) | ||
expected_df = pd.read_csv(expected_path) | ||
|
||
# Assert that both dataframes are equal, regardless of order and dtype. | ||
pd.testing.assert_frame_equal( | ||
actual_df.sort_index(axis=1), | ||
expected_df.sort_index(axis=1), | ||
check_dtype=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just do string checking instead of reading into pd dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that but then I have to ensure the column order is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. This is a result of using pd df's in the library? We can keep this then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is but I changed it, now it always exports the columns by alphabetical order. So CSV files should be identical.
downloaded_data: Dict[str, Dict] = _download_data(data_source) | ||
|
||
# If there is no wikidataId for the state, skip it. | ||
if iso_code not in STATES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me about how much this happens? Maybe also leave a note in README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't really happen at all, it's just an edge case in case they add some other form of state and we don't have that state in the hashmap.
Ready for re-review!
Thanks |
Import for Indian Districts and States from covid19india.org.
Each state has its own API.
Place names are resolved by wikidataId.
This change is