Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Usage of slots and columns across authorities #398

Open
2 of 3 tasks
turbomam opened this issue Sep 27, 2021 · 2 comments
Open
2 of 3 tasks

Usage of slots and columns across authorities #398

turbomam opened this issue Sep 27, 2021 · 2 comments
Assignees

Comments

@turbomam
Copy link
Member

turbomam commented Sep 27, 2021

Gather input, code and output for

  • MIxS soil vs Montana's Example Soil Google Sheet
  • Per-package slot usage within INSDC Biosamples vs MIxS guidance
  • review of @mslarae13's Index of Terms for a potiential EMSL DataHarmonizer template
@turbomam turbomam self-assigned this Sep 27, 2021
@turbomam
Copy link
Member Author

turbomam commented Sep 27, 2021

MIxS soil vs Montana's Example Soil Google Sheet

I did an accounting of the differences between Montana's Example Soil Google Sheet and my (MIxS-based) DataHarmonizer Soil template. It started out as a Python script but became pretty manual

  • Determining which columns are "associated" with soil samples in Montana's sheet. I don't have a definition for associated. It could mean required, recommended, optional. I would say that agrochem_addition is associated with soil samples in cells A292:B292 of the MenuTerms tab. I made the associations with manual inspection of the MenuTerms tab. I have some preliminary code to do that with searching the sheet for keywords clustering the locations to find dense mentions of agrochem_addition or soil . I think the associations could also be determined by analyzing the formulae in tabs like Metadata and EnvironmentalMetadata but I don't know how to do that.
  • Looking for patterns that explain the mismatches between Montana's soil-associated columns and MIxS' soil slots.
    Since it started out programmatically, the output is JSON. I added structure and notes to explain the mismatches. (edited)

code: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis.ipynb

computed output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis.json

curated output: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/soil_slot_column_analysis_curated.json

@turbomam
Copy link
Member Author

turbomam commented Sep 28, 2021

Per-package slot usage within INSDC Biosamples vs MIxS guidance

code: https://github.com/microbiomedata/nmdc-metadata/blob/issue-398/notebooks/insdc_per_package_column_usage.ipynb

Three sub-steps are included in the notebook

  1. Determine what kind of annotations are applied to the INSDC Biosamples on a package-by-package basis. These are called "columns" in the code, but that isn't quite right. Technically, they start out as attributes in biosample_set.xml.gz. Only those attributes that have a harmonized name are included. This is then merged with a table of which MIxS slots (which largely overlap with the attributes) are associated with each package. With that, you can query for the slots that MIxS associates with packages but are least frequently used in INSDC Biosamples, or for the attributes that are most frequently used for samples from some package in INSDC, even though MIxS doesn't associate that slot with that package.
  2. PCA plot of the per-package Biosample attribute usage. Intended as a conversation starter: there are 20M biosamples but only 251k are annotated with an env_package. Could we use the attribute usage alone to predict the packages for all of those other samples?
  3. What are the common values for all of those attributes? Determined with Pandas Profiling

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant