Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize package names #24

Open
cmungall opened this issue Sep 29, 2020 · 4 comments
Open

normalize package names #24

cmungall opened this issue Sep 29, 2020 · 4 comments
Assignees

Comments

@cmungall
Copy link
Collaborator

This should be done as a pre-processing step, part of overall ETL pipeline, such that each individual analysis does not need to do normalization

Currently done for water packages here:
https://nbviewer.jupyter.org/github/INCATools/biosample-analysis/blob/master/src/notebooks/water-package-profiling.ipynb

I am envisioning a general toolkit that performs this kind of repair on the whole TSV

@wdduncan
Copy link
Collaborator

My workflow for normalizing the water package names was like this:

  • find all env_packages containing the string 'water'
  • drop packages that don't seem to be for water; e.g. 'freshwater sediment', 'wastewater sludge'
  • Map obvious names to water; e.g., 'MIGS/MIMS/MIMARKS.water' => 'water'

I put the mapped env_values in field name 'norm_env_package'. Perhaps we should reference the MIxS package label; e.g., 'mixs5_env_package'. That make it more clear which env_package name we are normalizing on.

Also, I left some env_package values as their original value in the norm_env_package field; e.g. 'sea water', 'waste water'. On reflection, I think I should I have mapped these to 'water' b/c that is the name of the MIxS package. The original values are still in the env_package field.

But do we want to normalize on the subset of 'water' packages to a normalized name? For example, do we want to normalize 'wastewater' and 'waste water'?

My proposal for normalization mappings:

  1. Standardize spelling differences in spelling, capitalization, etc. in the normalized_env_package field. E.g.; map 'waste water', wastewater', and 'MIGS/MIMS/MIMARKS.wastewater' to normalized_env_package: 'waste water'.

  2. Create a 'mixs5_env_package' field to map to the env packages in the standard. E.g., 'waste water' would map to water.

cc @cmungall @realmarcin

@wdduncan
Copy link
Collaborator

@cmungall In the short term we can normalize on the controlled terms in the mixs standard. But, in the long term it would be good to normalize the package names by referencing URIs in the mixs-rdf project. We haven't created URIs for package names yet, but these seems like the next logical step.

@wdduncan
Copy link
Collaborator

Ignore the link the ENVO issue about medical infrastructure. I accidentally posted it here.

@wdduncan
Copy link
Collaborator

wdduncan commented Apr 7, 2021

Results of package names provided by @cmungall in file target/distinct-env_package.tsv

49254  host-associated
47921  human-gut
16367  water
13706  human-skin
12391  built environment
11976  soil
11715  misc environment
8453  missing
7882  human-oral
5969  sediment
3786  MIGS/MIMS/MIMARKS.soil
3167  human-associated
2988  MIGS/MIMS/MIMARKS.host-associated
2499  MIGS/MIMS/MIMARKS.human-gut
2129  microbial mat/biofilm
2076  plant-associated
1837  MIGS/MIMS/MIMARKS.water
1417  MIGS/MIMS/MIMARKS.human-associated
1189  MIGS/MIMS/MIMARKS.sediment
1154  MIGS/MIMS/MIMARKS.human-oral
1077  human-vaginal
1063  MIGS/MIMS/MIMARKS.plant-associated
741   MIGS/MIMS/MIMARKS.microbial
611   miscellaneous natural or artificial environment
558   MIGS/MIMS/MIMARKS.miscellaneous
479   mimarks
417   wastewater|sludge
406   mouse-gut
385   MIGS/MIMS/MIMARKS.wastewater
357   wastewater/sludge
283   unknown
212   Human-associated
206   MIGS/MIMS/MIMARKS.air
201   Human-oral
172   gut
171   host_associated
152   air
135   MIGS/MIMS/MIMARKS.human-skin
114   biofilm
111   Human-gut
107   human-not providedsopharyngeal
90   wastewater sludge
87   mice gut
61   built
60   CV
59   human gut
59   Human_Gut
51   default
48   microbial mat|biofilm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants