feat(ingest): Use pycountry and fuzzywuzzy to format countries #3026

anna-parker · 2024-10-18T16:08:36Z

resolves #

preview URL: https://use-pycountry.loculus.org/

Discussion: https://docs.google.com/document/d/1TQiE66Hk6WjgkMvvMMhu8uMSEhHFP2AwINMAIsyODCw/edit

Summary

Pycountry is a popular python library with lists of all ISO-3166 accepted countries, subdivisions and historic countries. It includes their official names, country codes and subdivision level. Fuzzywuzzy is another popular python library which matches strings using the Levenshtein distance (an edit distance algorithm), it can be modified to weight string matches higher when individual words in the string match.

Takes INSDC geolocation string in format country: division
Returns country and attempts to split division into geoLocAdmin1 and geoLocAdmin2.

Use pycountry for official list of geoLocAdmin1 options and abbreviations
Attempt exact match of division substring (split by ",") to geoLocAdmin1
Attempt exact match of division substring (split by "\s" or ",") to geoLocAdmin1 abbr
Attempt fuzzy match of division substring (after removing common administrative_divisions substrings) to geoLocAdmin1
If no match, return division as geoLocAdmin2

Tests

Can use comparison script in https://github.com/loculus-project/loculus/blob/show_geoloc_diffs/ingest/scripts/comparison.py to compare against main and augur curate

Here you can see sorted tsv of what changes the two options will make:

PR Checklist

Update version_comment field with previous geolocation information if changed due to fuzzy matching
Add additional geoLocAdmin1 curation list for not caught cases e.g. Attica etc.

More clean up Move checks from snakefile to config fix config update deployment update tests ci Add trigger from db option Fix cronjob Fix link to config-file fix deployment install package in dockerfile install at correct location Remove snakemake as no longer needed Add missing dependency try to debug Create an XmlNone dataclass - this is required since package update test threads stop revert exception test test upload to ena dev still works on preview Make sure test is set correctly!!! remove debug print statements Improve logs Fix merge errors Update ena-submission/README.md Co-authored-by: Cornelius Roemer <[email protected]> Apply suggestions from code review Co-authored-by: Cornelius Roemer <[email protected]> Cronjob: create results directory before writing to it format authors in prepro Fix ingest try to fix pattern simplify regex fix check Add tests # Conflicts: # preprocessing/nextclade/tests/test.py Add to ena submission fix fix other edge case Update ena-submission/scripts/ena_submission_helper.py Co-authored-by: Cornelius Roemer <[email protected]> Update ena-submission/scripts/ena_submission_helper.py Update ena-submission/scripts/ena_submission_helper.py Co-authored-by: Cornelius Roemer <[email protected]> Update ena-submission/scripts/ena_submission_helper.py Update ingest/scripts/prepare_metadata.py Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py rename Update reformat_authors_from_genbank_to_loculus Additionally format authors with correct white space Improve error message add tests fix missing pattern improve error logs fix error Update preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py improve logging more feat(ingest): Do not use processed tsv but raw jsonl when ingesting data from NCBI Virus (#2990) * Use raw jsonl instead of generated tsv when ingesting data from NCBI virus * Do not require authors list to end in ';', capitalize names correctly. * Add tests for capitalization * Add a warning if author list might be in wrong format * Add ascii specific warning * Add tests for warnings and errors * Only capitalize if full authors string is upper case * Properly capitalize initial * Move titlecase option to ingest only - add ingest tests Move author formatting functions to format_ncbi_metadata as this is a more logical location Remove duplicate group name # Conflicts: # ena-submission/scripts/get_ena_submission_list.py # ena-submission/src/ena_deposition/config.py

…LocAdmin2 as well)

anna-parker added 2 commits October 18, 2024 16:58

Use pycountry to check if geoLocAdmin1 is valid

3c78e72

anna-parker added the preview Triggers a deployment to argocd label Oct 18, 2024

anna-parker added 16 commits October 18, 2024 18:40

fix

9f51abf

wupps

235ec75

Kosovo is not in library

8632ca8

add abbreviations

eca9109

Use fuzzy matching

5da13c2

import fuzzywuzzy

166bed0

Only accept ASCII characters in region names

e67f204

Be stricter

78e4aa2

decrease again

4c736a7

Use fuzz.partial_ratio

cc8f1e8

Improve match by removing administrative regions from string

633e69d

fix logic bug

d0b63e6

make match stricter

787d74a

tweak

f8f52a0

improve

8be30c7

Only use top level subdivision (some countries have ISO codes for geo…

58176c9

…LocAdmin2 as well)

anna-parker mentioned this pull request Oct 20, 2024

feat(ingest): Format ingested country: division into geoLocCountry, geoLocAdmin1, geoLocAdmin2 #2991

Closed

2 tasks

anna-parker added 5 commits October 20, 2024 20:27

Refactor and also check historical country list

5ab4d10

Add compare script

b44ba1a

small fix

8938ab9

Fix tests

61821b7

Add country codes of historic countries

2f9a9c6

anna-parker changed the title ~~Use_pycountry~~ feat(ingest): Use pycountry and fuzzywuzzy to format countries Oct 21, 2024

delete comparison script

0211de6

anna-parker force-pushed the format_authors branch 2 times, most recently from c3cb79c to 6e05226 Compare October 22, 2024 09:38

anna-parker removed the preview Triggers a deployment to argocd label Oct 24, 2024

anna-parker force-pushed the format_authors branch from 8fc0a47 to 0f60fd4 Compare October 25, 2024 16:11

anna-parker mentioned this pull request Oct 29, 2024

Curate NCBI geolocation metadata #3105

Open

Base automatically changed from format_authors to main October 29, 2024 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): Use pycountry and fuzzywuzzy to format countries #3026

feat(ingest): Use pycountry and fuzzywuzzy to format countries #3026

anna-parker commented Oct 18, 2024 •

edited

Loading

feat(ingest): Use pycountry and fuzzywuzzy to format countries #3026

Are you sure you want to change the base?

feat(ingest): Use pycountry and fuzzywuzzy to format countries #3026

Conversation

anna-parker commented Oct 18, 2024 • edited Loading

Summary

Tests

PR Checklist

anna-parker commented Oct 18, 2024 •

edited

Loading