Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3

Open
turbomam opened this issue Jun 30, 2021 · 3 comments
Assignees

Comments

@turbomam
Copy link
Member

AKA

https://www.ncbi.nlm.nih.gov/biosample/

vs

EnvO_triad_terms_MIxS_soil_package_review_05182021

@cmungall @elishawc @wdduncan There was some unfinished discussion about which repo was the best home for this issue

@turbomam
Copy link
Member Author

turbomam commented Jul 8, 2021

@wdduncan @cmungall @elishawc @jagadishcs @mslarae13

We could meet to discuss these findings at the Wednesday July 14th NMDC Sync meeting, or sooner if you prefer.


Here's my analysis of the EnvO terms recommended per package and MIxS slot, vs the count of EnvO terms found in the INSDC BioSample metadata, after being repaired/mapped.

Out of 921 rows in my table, you will find that there are ~ 200 EnvO terms that were recommended for a package/slot combo, but never used in that way in the INSDC BioSample metadata

package slot class label count reccomended
water env_local_scale ENVO:00000061 underground water body 0.0 True

and ~ 550 EnvO terms that appeared in the INSDC BioSample metadata in combination with a package/slot combo at least twice, but not explicitly recommended in the recent review.

package slot class label count reccomended
soil env_medium ENVO:00001998 soil 13249.0 False

In summary, for 921 combinations of packages, slots and EnvO terms, only 80 were both recommended by the review team and observed at least two times in the INDC dataset.

200 + 550 + 80 != 921 due to rounding and my exclusion of combinations that were only observed once in the INDC dataset.

The analysis was implemented in this notebook

@turbomam
Copy link
Member Author

turbomam commented Jul 8, 2021

Confirmation, especially regarding the repair process?

Run something like this against the July build of the harmonized data biosample SQLite database:

SELECT
	scoping_col,
	scoping_value,
	biosample_col_to_map,
	raw,
	consensus_id,
	consensus_lab,
	count(1) as sample_count
from
	repaired_long rl1
where
	scoping_col = 'env_package_normalization.EnvPackage'
	and scoping_value = 'soil'
	and biosample_col_to_map = 'env_medium'
	and consensus_id = 'ENVO:00001998'
group by
	scoping_col,
	scoping_value,
	biosample_col_to_map,
	raw,
	consensus_id,
	consensus_lab
order by
	count(1) desc;
scoping_col scoping_value biosample_col_to_map raw consensus_id consensus_lab sample_count
env_package_normalization.EnvPackage soil env_medium soil ENVO:00001998 soil 12596
env_package_normalization.EnvPackage soil env_medium Soil ENVO:00001998 soil 439
env_package_normalization.EnvPackage soil env_medium ENVO:soil ENVO:00001998 soil 214

@turbomam
Copy link
Member Author

turbomam commented Jul 8, 2021

PS I only used the Subset_EnvO_Broad_Local_Medium_terms_062221 tab from EnvO_triad_terms_MIxS_soil_package_review_05182021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant