re-annotation and by curie extraction of environmental context values into in duckdb #32

turbomam · 2024-11-19T15:40:35Z

can get (and update) NCBI Biosample DuckDB files at https://portal.nersc.gov/project/m3408/biosamples_duckdb/

which corresponds to /global/cfs/cdirs/m3408/www/biosamples_duckdb on the host

[mam@<HOST> biosamples_duckdb]$ ls -lh

total 5.4G
-rw-r--r-- 1 mam m3408 2.7G Sep 23 09:37 ncbi_biosamples_2024-09-23.duckdb.gz
-rw-r--r-- 1 mam m3408 2.7G Sep 23 09:53 ncbi_biosamples_2024-09-23.duckdb.zip
drwxrwsrwx 2 mam m3408 4.0K Sep 24 12:39 new
drwxrwsrwx 2 mam m3408 4.0K Sep 24 12:38 old
-rwxrwxr-x 1 mam m3408  987 Sep  9 11:53 README

Newest output:

https://portal.nersc.gov/project/m3408/biosamples_duckdb/ncbi_biosamples_2024-11-19T1545EST-pre-alpha.duckdb.gz

turbomam · 2024-11-19T15:59:32Z

The code in this PR and the resulting database are very preliminary. There is (or needs to be) a lot of joining on string columns now. The performance is remarkably good in DuckDB (at least on my system) but a lot of best practices are being stretched or broken.

turbomam · 2024-11-19T17:07:31Z

Should probably column-normalize and row-combine these tables:

by_curie_extraction
by_re_annotation

These are the intermediate tables from which the above were derived

context_content_count
curie_free_string_annotations
curie_free_string_counts
curie_free_strings
curie_labels
extracted_curies
normalized_context_content_count

These are the tables that were in previous builds:

attributes
harmonized_attributes_wide
ids
links
organism
overview

turbomam · 2024-11-19T22:18:56Z

I think most people will be interested in the ontology_class_candidates view

this query of it is interesting too

select
	EnvPackage ,
	harmonized_name ,
	concluded_curie ,
	concluded_label ,
	count(1)
from
	main.ontology_class_candidates occ
group by
	EnvPackage ,
	harmonized_name ,
	concluded_curie ,
	concluded_label ;

turbomam · 2024-11-19T22:33:33Z

followups: what namespaces are most prevalent? this will be biased by the CURIe parsing rules and by the ontologies that were used for annotation
- for now, the extraction rules exclude long prefixes or local_ids like snomed, and I am only annotating with ENVO
what concluded_curies don't have a label because the extracted CURIe is invalid or not defined in the provided ontologies

not making any effort to remove "drag stretch" CURIes, ie annotated curies that correspond to long stretches of sequences extracted CURIes

…ack-into-mongodb-or-duckdb

by re-annotation and by curie extraction in duckdb

15b2483

turbomam linked an issue Nov 19, 2024 that may be closed by this pull request

get repaired environmental triad values back into MongoDB or DuckDB #31

Closed

ontology_class_candidates view with packagge names

bad76e6

Merge branch 'main' into 31-get-repaired-environmental-triad-values-b…

baf2dfd

…ack-into-mongodb-or-duckdb

turbomam marked this pull request as ready for review January 8, 2025 13:48

turbomam mentioned this pull request Jan 8, 2025

see followup actions from PR #32 #40

Open

turbomam merged commit 55ba4cb into main Jan 8, 2025

turbomam deleted the 31-get-repaired-environmental-triad-values-back-into-mongodb-or-duckdb branch January 8, 2025 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-annotation and by curie extraction of environmental context values into in duckdb #32

re-annotation and by curie extraction of environmental context values into in duckdb #32

turbomam commented Nov 19, 2024 •

edited

Loading

turbomam commented Nov 19, 2024

turbomam commented Nov 19, 2024

turbomam commented Nov 19, 2024

turbomam commented Nov 19, 2024

re-annotation and by curie extraction of environmental context values into in duckdb #32

re-annotation and by curie extraction of environmental context values into in duckdb #32

Conversation

turbomam commented Nov 19, 2024 • edited Loading

turbomam commented Nov 19, 2024

turbomam commented Nov 19, 2024

turbomam commented Nov 19, 2024

turbomam commented Nov 19, 2024

turbomam commented Nov 19, 2024 •

edited

Loading