-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3
Comments
@wdduncan @cmungall @elishawc @jagadishcs @mslarae13 We could meet to discuss these findings at the Wednesday July 14th NMDC Sync meeting, or sooner if you prefer. Here's my analysis of the EnvO terms recommended per package and MIxS slot, vs the count of EnvO terms found in the INSDC BioSample metadata, after being repaired/mapped. Out of 921 rows in my table, you will find that there are ~ 200 EnvO terms that were recommended for a package/slot combo, but never used in that way in the INSDC BioSample metadata
and ~ 550 EnvO terms that appeared in the INSDC BioSample metadata in combination with a package/slot combo at least twice, but not explicitly recommended in the recent review.
In summary, for 921 combinations of packages, slots and EnvO terms, only 80 were both recommended by the review team and observed at least two times in the INDC dataset. 200 + 550 + 80 != 921 due to rounding and my exclusion of combinations that were only observed once in the INDC dataset. The analysis was implemented in this notebook |
Confirmation, especially regarding the repair process? Run something like this against the July build of the harmonized data biosample SQLite database: SELECT
scoping_col,
scoping_value,
biosample_col_to_map,
raw,
consensus_id,
consensus_lab,
count(1) as sample_count
from
repaired_long rl1
where
scoping_col = 'env_package_normalization.EnvPackage'
and scoping_value = 'soil'
and biosample_col_to_map = 'env_medium'
and consensus_id = 'ENVO:00001998'
group by
scoping_col,
scoping_value,
biosample_col_to_map,
raw,
consensus_id,
consensus_lab
order by
count(1) desc;
|
PS I only used the |
AKA
https://www.ncbi.nlm.nih.gov/biosample/
vs
EnvO_triad_terms_MIxS_soil_package_review_05182021
@cmungall @elishawc @wdduncan There was some unfinished discussion about which repo was the best home for this issue
The text was updated successfully, but these errors were encountered: