Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-annotation and by curie extraction of environmental context values into in duckdb #32

Conversation

turbomam
Copy link
Member

@turbomam turbomam commented Nov 19, 2024

can get (and update) NCBI Biosample DuckDB files at https://portal.nersc.gov/project/m3408/biosamples_duckdb/

which corresponds to /global/cfs/cdirs/m3408/www/biosamples_duckdb on the host


[mam@<HOST> biosamples_duckdb]$ ls -lh

total 5.4G
-rw-r--r-- 1 mam m3408 2.7G Sep 23 09:37 ncbi_biosamples_2024-09-23.duckdb.gz
-rw-r--r-- 1 mam m3408 2.7G Sep 23 09:53 ncbi_biosamples_2024-09-23.duckdb.zip
drwxrwsrwx 2 mam m3408 4.0K Sep 24 12:39 new
drwxrwsrwx 2 mam m3408 4.0K Sep 24 12:38 old
-rwxrwxr-x 1 mam m3408  987 Sep  9 11:53 README

Newest output:

@turbomam turbomam linked an issue Nov 19, 2024 that may be closed by this pull request
@turbomam
Copy link
Member Author

The code in this PR and the resulting database are very preliminary. There is (or needs to be) a lot of joining on string columns now. The performance is remarkably good in DuckDB (at least on my system) but a lot of best practices are being stretched or broken.

@turbomam
Copy link
Member Author

Should probably column-normalize and row-combine these tables:

  • by_curie_extraction
  • by_re_annotation

These are the intermediate tables from which the above were derived

  • context_content_count
  • curie_free_string_annotations
  • curie_free_string_counts
  • curie_free_strings
  • curie_labels
  • extracted_curies
  • normalized_context_content_count

These are the tables that were in previous builds:

  • attributes
  • harmonized_attributes_wide
  • ids
  • links
  • organism
  • overview

@turbomam
Copy link
Member Author

I think most people will be interested in the ontology_class_candidates view

this query of it is interesting too

select
	EnvPackage ,
	harmonized_name ,
	concluded_curie ,
	concluded_label ,
	count(1)
from
	main.ontology_class_candidates occ
group by
	EnvPackage ,
	harmonized_name ,
	concluded_curie ,
	concluded_label ;

@turbomam
Copy link
Member Author

  • followups: what namespaces are most prevalent? this will be biased by the CURIe parsing rules and by the ontologies that were used for annotation
    • for now, the extraction rules exclude long prefixes or local_ids like snomed, and I am only annotating with ENVO
  • what concluded_curies don't have a label because the extracted CURIe is invalid or not defined in the provided ontologies

not making any effort to remove "drag stretch" CURIes, ie annotated curies that correspond to long stretches of sequences extracted CURIes

@turbomam turbomam marked this pull request as ready for review January 8, 2025 13:48
@turbomam turbomam merged commit 55ba4cb into main Jan 8, 2025
@turbomam turbomam deleted the 31-get-repaired-environmental-triad-values-back-into-mongodb-or-duckdb branch January 8, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get repaired environmental triad values back into MongoDB or DuckDB
1 participant