Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get repaired environmental triad values back into MongoDB or DuckDB #31

Open
turbomam opened this issue Nov 16, 2024 · 2 comments · May be fixed by #32
Open

get repaired environmental triad values back into MongoDB or DuckDB #31

turbomam opened this issue Nov 16, 2024 · 2 comments · May be fixed by #32
Assignees

Comments

@turbomam
Copy link
Member

i.e. extracting ontology class CURIes from env_broad_scale, env_loall_scale and env_medium

followup with Jasper, Peter, Paramvir and Mikaela

https://www.kbase.us/team/

  • prototype in Jupyter if necessary, but create an application, possibly in another (NMDC or BBOP)? repo
  • take an 80/20 low hanging fruit approach
  • save intermediate data; make it easy to backtrack the results of the repairs
  • keep track of possible losses "already added by me"
    • NCBI XML -> JSON/MongoDB (Jasper's or Mark's similar approaches) -> DuckDB
    • not all paths are added to DuckDB yet
    • attributes may be concatenated with ||| when writing into DuckDB ???
  • new potential losses for benefit of performance/manageability
    • lowercase and normalize whitespaces?
  • are there user-provided attributes that NCBI "failed" to harmonize
  • take a stance on what the reasonable values are (see submission schema)
  • suggest opportunities to route unreasonable values into other slots/fields/attributes?
  • split wherever curies were found and make a unique collection of strings. annotate with multiple ontologies using OAK... using one combined backend or iteratively?

try to include drag remediation?

assume 0 or more occurrences of CURIes with {prefix}{delimiter}{local_id}

where prefix and local_id can be letters, numbers or both, with a min and max len

delimiters: mostly expect : or _. may include soem whitespace

there are some without delimiters... how many? ENVO1234567

just ENV instead of ENVO

ENVO:label pattern

NCIT etc with different length local_ids or with letters

some prefixes with numbers

@turbomam turbomam self-assigned this Nov 16, 2024
@turbomam
Copy link
Member Author

dev tools

  • dbeaver
  • a mongodb GUI browser
  • data scince tool like DataGrip?

@turbomam
Copy link
Member Author

include breakdown of ncbi package values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant