Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3

turbomam · 2021-06-30T20:56:36Z

AKA

https://www.ncbi.nlm.nih.gov/biosample/

vs

EnvO_triad_terms_MIxS_soil_package_review_05182021

@cmungall @elishawc @wdduncan There was some unfinished discussion about which repo was the best home for this issue

turbomam · 2021-07-08T14:40:54Z

@wdduncan @cmungall @elishawc @jagadishcs @mslarae13

We could meet to discuss these findings at the Wednesday July 14th NMDC Sync meeting, or sooner if you prefer.

Here's my analysis of the EnvO terms recommended per package and MIxS slot, vs the count of EnvO terms found in the INSDC BioSample metadata, after being repaired/mapped.

Out of 921 rows in my table, you will find that there are ~ 200 EnvO terms that were recommended for a package/slot combo, but never used in that way in the INSDC BioSample metadata

package	slot	class	label	count	reccomended
water	env_local_scale	ENVO:00000061	underground water body	0.0	True

and ~ 550 EnvO terms that appeared in the INSDC BioSample metadata in combination with a package/slot combo at least twice, but not explicitly recommended in the recent review.

package	slot	class	label	count	reccomended
soil	env_medium	ENVO:00001998	soil	13249.0	False

In summary, for 921 combinations of packages, slots and EnvO terms, only 80 were both recommended by the review team and observed at least two times in the INDC dataset.

200 + 550 + 80 != 921 due to rounding and my exclusion of combinations that were only observed once in the INDC dataset.

The analysis was implemented in this notebook

turbomam · 2021-07-08T15:12:07Z

Confirmation, especially regarding the repair process?

Run something like this against the July build of the harmonized data biosample SQLite database:

SELECT
	scoping_col,
	scoping_value,
	biosample_col_to_map,
	raw,
	consensus_id,
	consensus_lab,
	count(1) as sample_count
from
	repaired_long rl1
where
	scoping_col = 'env_package_normalization.EnvPackage'
	and scoping_value = 'soil'
	and biosample_col_to_map = 'env_medium'
	and consensus_id = 'ENVO:00001998'
group by
	scoping_col,
	scoping_value,
	biosample_col_to_map,
	raw,
	consensus_id,
	consensus_lab
order by
	count(1) desc;

scoping_col	scoping_value	biosample_col_to_map	raw	consensus_id	consensus_lab	sample_count
env_package_normalization.EnvPackage	soil	env_medium	soil	ENVO:00001998	soil	12596
env_package_normalization.EnvPackage	soil	env_medium	Soil	ENVO:00001998	soil	439
env_package_normalization.EnvPackage	soil	env_medium	ENVO:soil	ENVO:00001998	soil	214

turbomam · 2021-07-08T15:29:47Z

PS I only used the Subset_EnvO_Broad_Local_Medium_terms_062221 tab from EnvO_triad_terms_MIxS_soil_package_review_05182021

turbomam self-assigned this Jun 30, 2021

This was referenced Jul 6, 2021

Can't find issue for reconciling curated NMDC ENVO subsets with INSDC ustilized terms INCATools/biosample-analysis#66

Open

Issue #61 xquery INCATools/biosample-analysis#72

Merged

turbomam mentioned this issue Feb 15, 2024

Identify what terms need to be included in this ontology by querying the NMDC database #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3

Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3

turbomam commented Jun 30, 2021

turbomam commented Jul 8, 2021 •

edited

Loading

turbomam commented Jul 8, 2021 •

edited

Loading

turbomam commented Jul 8, 2021

Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3

Set analysis of ENVO terms used in INSDC Biosample Metadata vs curated subsets spreadsheet #3

Comments

turbomam commented Jun 30, 2021

turbomam commented Jul 8, 2021 • edited Loading

turbomam commented Jul 8, 2021 • edited Loading

turbomam commented Jul 8, 2021

turbomam commented Jul 8, 2021 •

edited

Loading

turbomam commented Jul 8, 2021 •

edited

Loading