Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indian Districts for COVID19 #328

Closed
wants to merge 25 commits into from

Conversation

edumorlom
Copy link
Contributor

@edumorlom edumorlom commented Oct 26, 2020

Import for Indian Districts and States from covid19india.org.
Each state has its own API.
Place names are resolved by wikidataId.


This change is Reviewable

@google-cla google-cla bot added the cla: yes label Oct 26, 2020
Copy link
Contributor

@tjann tjann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review the script after getting a better idea from the current comments!

"Arunachal Pradesh":"Q1162",
"Andhra Pradesh":"Q1159",
"Andaman and Nicobar Islands":"Q40888"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with this map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The places' dcids are resolved by wikidataId.
The map of State->District->wikidataId was generated using the following:

  1. The wikidataId for each place is queried using the place_resolver.go script.
  2. A script is used against wikidata.org/wiki/${wikidataId} that verifies that the place is both a District and part of India.
  3. Manual check has been performed to ensure that the name matches.

Copy link
Contributor

@tjann tjann Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I see. Does the data have any better IDs? This is fine if it's name only.
  2. Do you have that code too? Should we check it in?
  3. How many places did you manually check? Ballpark is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pradh do you have any opinions about manifesting this import to prod if we are using place name resolver to get the Wikidata IDs of Indian districts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjann I have added the script that checks against wikidataId to ensure the wikidataId is correct.
It basically goes through all the wikidataIds and exports a CSV with wikidata name and whether it belongs to India.
Then it's very easy to manually check that all are correct. I added a README.md too.

Copy link
Contributor

@pradh pradh Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dataset does not have any IDs right? So the approach to mapping to wikidataId via some heuristics (including place name resolver) and manual check sounds reasonable...

scripts/covid19indiaORG/Config.py Outdated Show resolved Hide resolved
scripts/covid19indiaORG/README.md Show resolved Hide resolved
scripts/covid19indiaORG/README.md Outdated Show resolved Hide resolved
scripts/covid19indiaORG/README.md Outdated Show resolved Hide resolved
scripts/covid19indiaORG/README.md Outdated Show resolved Hide resolved
scripts/covid19indiaORG/README.md Outdated Show resolved Hide resolved
scripts/covid19indiaORG/Config.py Outdated Show resolved Hide resolved
@edumorlom edumorlom requested a review from tjann November 25, 2020 21:49
@edumorlom
Copy link
Contributor Author

Ready for re-review!

  1. Added 3 unit tests, instead of API calls, it reads JSON from the file.
  2. Minor modifications to code to allow testing.
  3. Modifications to README.md.
  4. I no longer use the State name, only State isoCode and district name.

@edumorlom edumorlom requested a review from tjann December 2, 2020 21:38
Comment on lines 67 to 75
# Read the CSV file and generate a DataFrame with it.
actual_df = pd.read_csv(output_path)
expected_df = pd.read_csv(expected_path)

# Assert that both dataframes are equal, regardless of order and dtype.
pd.testing.assert_frame_equal(
actual_df.sort_index(axis=1),
expected_df.sort_index(axis=1),
check_dtype=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just do string checking instead of reading into pd dataframe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that but then I have to ensure the column order is the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. This is a result of using pd df's in the library? We can keep this then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is but I changed it, now it always exports the columns by alphabetical order. So CSV files should be identical.

downloaded_data: Dict[str, Dict] = _download_data(data_source)

# If there is no wikidataId for the state, skip it.
if iso_code not in STATES:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me about how much this happens? Maybe also leave a note in README

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't really happen at all, it's just an edge case in case they add some other form of state and we don't have that state in the hashmap.

@tjann tjann requested a review from shifucun December 4, 2020 19:17
@edumorlom edumorlom requested a review from tjann December 7, 2020 05:39
@edumorlom
Copy link
Contributor Author

Ready for re-review!

  1. I added the script that checks against wikidata to ensure that the wikidataIds are correct, contains README.md
  2. Removed default parameters.
  3. Tests will use StringIO.
  4. Minor changes to code and documentation.

Thanks

@edumorlom edumorlom requested a review from pradh December 18, 2020 06:34
@google-cla google-cla bot added cla: no and removed cla: yes labels Oct 18, 2023
@edumorlom edumorlom closed this Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants