Superseded by https://github.com/turbomam/biosample-xmldb-sqldb (which, despite it's name, doesn't use an XML database)
Using XQueries in a BaseX database to convert NCBI's BioSample database from XML to SQLite.
Querying the native BioSample database requires familiarity with, and access to tools designed for XPath, XQuery or XSLT. The SQLite database generated by this software can be queried from Python programs or a multitude of commercial, open-source, CLI and GUI tools, using the most common query language. No citation provided :-)
When a scientist submits data (like taxonomic identification or gene expression) to NCBI, they are also required to submit metadata about the BioSamples from which that data was collected.
NCBI uses the term attribute to describe certain named information about the BioSamples. Submitters can tag this information about their BioSamples with any attribute names they choose. The NCBI performs curation of the BioSamples, and attributes that appear to share the same meaning as controlled terms from standards like the GSC's MIxS
are given harmonized names.
These BioSample attribute
s should not be confused with XML markup construct that is also called attribute
. In fact, the BioSample attributes are actually XML element
s (from the path BioSampleSet/BioSample/Attributes/Attribute
) although @attribute_name
and @harmonized_name
are XML attribute
s.
There are several other paths with element
s, and attribute
s of elements that describe the BioSamples. For example, see the Path Index section of reports/biosample_set_1_info_index.txt
As of early January, 2022, the number of XML nodes necessary to model the 22,786,924 BioSamples was 2,397,333,775. This repo uses the BaseX open-source XML database, which is feature rich and performs well. However, BaseX does have per-database limits, such as no more than 2,147,483,648 nodes per database. Therefore, NCBI's biosample_set.xml.gz is split into two chunks with sed
and then loaded in two BaseX databases. Queries written in the XQuery
language are then executed over the two databases, generating various TSV
files.
After running the XQueries, the generated TSV
s are loaded into a SQLite database. SQLite was selected because the entire, indexed and ready-to-use database can be shared as a single, compressible file and because of the plethora of compatible tools, which do not require setting up a persistent database server. Examples include Python's built-in sqlite module, the sqlite
command line client (which can be downloaded from https://www.sqlite.org/download.html or installed with package managers that are included in many Unix-like operating systems), or with a graphical tool like DBeaver.
Note: some column names like id
and temp
may be reserved words. These should be wrapped in double quotes when writing queries.
- all_attribs
- All "attribute" elements from all BioSamples, whether they have a harmonized name or not, in a long format.
- Pretty slow to query, despite indexing several columns.
CREATE INDEX all_attribs_idx on all_attribs("harmonized_name", "raw_id", "attribute_name");
- Maybe the indexes weren't set up properly? (I haven't tried querying without indexes.)
- 324,806,062 rows
- columns (
TEXT
unless otherwise specified):- raw_id (
INTEGER
) - attribute_name
- harmonized_name
- value
- raw_id (
- env_package_repair
- Standardizes the user-submitted
env_package
values - From env_package_repair_curated.tsv
- The
Makefile
has a step to create atarget/env_package_repair_new.tsv
. No tools are provided yet for automatically reconciling the new file with the previous curated file. - 99 rows
- columns (
TEXT
unless otherwise specified):- env_package_raw
- count (
INTEGER
) - env_package (indexed)
- checklist_fragment
- curation_confidence
- notes
- curation_complete
- Standardizes the user-submitted
- harmonized_wide
- Just the values of those attributes that have harmonized names, as
TEXT
, in wide format. Also includes the BioSampleraw_id
s, asINTEGER
s. - 22,608,173 rows
- 516 columns including
raw_id
- Provided with indices on
raw_id_idx
andenv_package
. More could be added for better queries at the cost of disk footprint.
- Just the values of those attributes that have harmonized names, as
- harmonized_wide_repaired
- Exact same shape as
harmonized_wide
. Serves as a place to collect values that have gone though some cleanup/repair process. Currentlyenv_package
is the only column repaired by this repo, by way of theenv_package_repair
table. - Provided with an index on
raw_id_idx
- Exact same shape as
- non_attribute_metadata
- Metadata about the BioSamples that doesn't come from "attribute" entities. External IDs and links have been simplified compared to previous versions. Among other things,
DOI
s have been omitted , since they are only available for ~ 400 out of the 20+ million BioSamples. - 24,580,658 rows. That's greater than the number of
harmonized_wide
rows as some BioSamples don't have any attributes with harmonized names. - columns (
TEXT
unless otherwise indicated):- id (Not native to the
biosample_set.xml
file. Built from the prefix "BIOSAMPLE:" and theprimary_id
) - accession
- raw_id (
INTEGER
, primary key) - primary_id
- sra_id
- bp_id (while these are stored as
TEXT
, they are numerical BioProject IDs and should really be joined with the BioProject XML database to get BioProject accessions.) - model
- package
- package_name
- status
- status_date
- taxonomy_id
- taxonomy_name
- title
- paragraph
- id (Not native to the
- Metadata about the BioSamples that doesn't come from "attribute" entities. External IDs and links have been simplified compared to previous versions. Among other things,
- biosample_basex_merged view
non_attribute_metadata
+harmonized_wide
tables
As of 2022-01-03
- biosample_basex.db.gz: 7.7 GB
- biosample_basex.db: 67.9 GB
BaseX and SQLite can be installed with homebrew on Macs or apt-get
on Ubuntu Linux machines. They're both open-source software. SQLite3 3.32 or greater is required.
Downloading the BaseX .zip
archive makes it a little more straightforward to increase the amount of memory allocated by the launch scripts. For example, on a 32 GB machine, basex/bin/basex
might look like this:
#!/usr/bin/env bash
# Path to this script
FILE="${BASH_SOURCE[0]}"
while [ -h "$FILE" ] ; do
SRC="$(readlink "$FILE")"
FILE="$( cd -P "$(dirname "$FILE")" && \
cd -P "$(dirname "$SRC")" && pwd )/$(basename "$SRC")"
done
MAIN="$( cd -P "$(dirname "$FILE")/.." && pwd )"
# Core and library classes
CP=$MAIN/BaseX.jar:$MAIN/lib/custom/*:$MAIN/lib/*:$CLASSPATH
# Options for virtual machine (can be extended by global options)
BASEX_JVM="-Xmx24g $BASEX_JVM"
# Run code
exec java -cp "$CP" $BASEX_JVM org.basex.BaseX "$@"
https://docs.basex.org/wiki/Main_Page
make all
takes roughly 8 hours.
cd /global/cfs/cdirs/m3513/endurable/biosample
git clone [email protected]:turbomam/biosample-basex.git
wget
the latest BaseX zip archive- 2022-01-02:
wget https://files.basex.org/releases/9.6.4/BaseX964.zip
- 2022-01-02:
unzip BaseX964.zip
- increase the Java memory allocation in
basex/bin/basex
BASEX_JVM="-Xmx96g $BASEX_JVM"
- That may be more than necessary. Seems to run OK on an Intel MacBook Pro with 24 GB allocated.
cp biosample-basex/template.env biosample-basex/.env
- edit
biosample-basex/.env
BASEXCMD = ../basex/bin/basex
- The value for
del_from
should be roughly half the number of BioSamples. The suggested value of 12500001 should be reasonable but could be revised after downloading and unpacking the BioSample XML file. Run something liketail -n 50 target
at the shell prompt, note theid
attribute of the last<BioSample>
, and setdel_from
to an integer close to one half of that lastid
value.
- Start a session manager like
screen
. If you get disconnected, it may be tricky to log back into the same cori node because of load balancing, but scripts started before the disconnection should run to completion. It doesn't hurt to force your client computer to stay awake with something likecaffeine
. module load python
- On cori, this makes common Python packages like
pandas
available. On other systems, users should create avenv
virtual environment, enter it, and runpip install -r requirments.txt
- On cori, this makes common Python packages like
- cd
biosample-basex/
make all
After running make final_sqlite_gz_dest
, the SQLite will be available at https://portal.nersc.gov/project/m3513/biosample/