This repo contains scripts that facilitate the anwsers to the following questions:
- how many taxa at a specific taxonomic level?
- how many samples for MGnify?
- how many samples from MG-RAST?
- how many annotated genomes from JGI?
- how many processes?
- how many envinronments?
- how many associations (in total, taxon-process, taxon-environmets)?
The components of this repo:
- pubmed statistics
- filtering taxa (unicellular, multicellular, higher taxonomy)
- entities statistics (distinct ids per type i.e NCBI, ENVO, GO, KO)
- association statistics and associations summaries per id (e.g. the channels and association types of each id)
- visualization and other statistics
The pubmed_statistics.awk
script summarises some info of the PubMed tsv files.
date ; gunzip -c /data/databases/pubmed/pubmed*.tsv.gz | ./pubmed_statistics.awk ; dat
e
PubMed tsv files contain 6 fields. There are case that some of these are empty or even duplicated. The above script finds these inconcistencies and calculates the number of abstracts.
Fields: PMID|DOI Authors Journal.volume:pages year title Abstract
Questions:
- how many abstracts?
- how many unique abstracts
- how many no-empty PMIDs
This group of scripts takes as input the /data/dictionary/database_groups.tsv
file which contains child - parent
relationships of all the ids of PREGO. It is fully flatted, so a parent Id is associated with all its' children directly.
For the unicellular organisms an additional file is used : METdb_GENOMIC_REFERENCE_DATABASE_FOR_MARINE_SPECIES.csv
which contains
NCBI ids for some eukaryotic microbes in marine environements curated by METdb.
The script filter-unicellular_ncbi.awk
is important because it sets the scope of PREGO.
At the BEGIN
section a manually curated array of eukaryotic unicellular NCBI ids is populated.
Lated this array is further enriched with all Bacteria and Archaea children ids from the /data/dictionary/database_groups.tsv
file.
How to run
./filter_unicellular_ncbi.awk \
/data/dictionary/METdb_GENOMIC_REFERENCE_DATABASE_FOR_MARINE_SPECIES.csv \
/data/dictionary/database_groups.tsv > /data/dictionary/prego_unicellular_ncbi.tsv
There are some variations of this script with different scope. The approach remains the same.
These scripts are for statistics purposes; summarise the phyla and other higher taxa contents of PREGO
Additional file of ncbi taxonomy is required to obtain the ranks. This file is /data/dictionary/ncbi/ncbi_taxonomy/nodes.dmp
As of 2021-10-02
Text Mining, MGnify markegene, MG-RAST markegene and metagenome, JGI isolated genemes, Struo genomes.
Goal of this script is to summarize the different entities that PREGO has associations between them.
It takes as input the prego_unicellular_ncbi.tsv
, the taxonomy ranks file
/data/dictionary/ncbi/ncbi_taxonomy/nodes.dmp
and the database_pairs.tsv
files.
The latter are must follow the schema that is used throughout PREGO.
Then it automaticaly exports the number of distinct entities per entity types, per source and per file.
Multiple files are supported.
How to run:
./entities_statistics.awk /data/dictionary/prego_unicellular_ncbi.tsv \
/data/dictionary/ncbi/ncbi_taxonomy/nodes.dmp \
/data/experiments/database_pairs.tsv /data/textmining/database_pairs.tsv \
/data/knowledge/database_pairs.tsv
The basic script is the associations_statistics.awk
. It functions similarly with the entities
script.
It automaticaly exports the number of associations entities per entity types pairs, per source and per file.
Additionaly it summarises the associations with the ncbi taxonomic rank.
How to run:
./associations_statistics.awk /data/dictionary/prego_unicellular_ncbi.tsv \
/data/dictionary/ncbi/ncbi_taxonomy/nodes.dmp \
/data/textmining/database_pairs.tsv /data/experiments/database_pairs.tsv \
/data/knowledge/database_pairs.tsv
This section is about summarising the types of associations and channels of information per id.
To reach that level of summary the basic script is associations_intersection.awk
.
How to run:
./associations_intersection.awk \
/path/to/prego_unicellular_ncbi.tsv \
/path/to/ncbi/ncbi_taxonomy/nodes.dmp \
/path/to/database_pairs.tsv /path/to/database_pairs.tsv \
/path/to/database_pairs.tsv > associations_intersection_all.tsv
As of 2022-09-07 the execution time is 56 minutes for the whole knowledge base.
This creates a file with this structure:
type1 id #association type2 channel rank(if type1==-2)
To add the names of the ids
gawk -F"\t" 'FNR==NR{names[$2]=$3;next}($2 in names){print $0 "\t" names[$2]}' \
/data/dictionary/database_preferred.tsv associations_intersection_all.tsv > associations_intersection.tsv
To gather the multiple channels of each association (if any) type run the following one-liner:
gawk -F"\t" '{names[$2"\t"$4]=$7; association[$2"\t"$4]+=$3; channels[$2"\t"$4]=(channels[$2"\t"$4]?channels[$2"\t"$4]"|"$5:$5)}END{for (i in channels){print i FS channels[i] FS association[i] FS names[i]}}' associations_intersection.tsv > associations_intersection_channels.tsv
To gather the multiple association types of an id (if any) :
gawk -F"\t" '{name[$2"\t"$5]=$7; association[$2"\t"$5]+=$3; types[$2"\t"$5]=(types[$2"\t"$5]?types[$2"\t"$5]"|"$4:$4)}END{for (i in types){print i FS types[i] FS association[i] FS name[i]}}' associations_intersection.tsv > associations_intersection_types.tsv
This document aims to complement the R script for the plots and statistics of Mgnify data.
Mgnifymarker gene data have a layered structure of studies, samples and runs/experiments. Each study has multiple samples and each sample can have multiple runs. All this data are stored in JSON files as downloaded from the API to our oxygen server. The basic unit for our species data is samples. This means that for each sample the extract_mgnify_markergene_associations.py script summarizes the organisms found in this sample with the metadata describing them. Metadata are extracted from the tagger executed in the samples metadata and they contain term ids from ENVO (type -27), BTO tissues (type -25), GO biological process (type -21) and host organism (type -2).
Samples are the connecting element of species with their metadata. This makes samples the currency of all the distributions that arise, thereby everything is being counted in terms of samples abundance. Hence there are 3 levels of information : samples associated with species, samples associated with metadata and samples that metadata and species co-occur. Not all samples have species and not all samples have metadata, hence the background distributions are different for metadata and species in terms of samples.
The background of each NCBI id is the number of samples it appears. Respectively, the background of each metadatum term is the abundance of samples it has been found. Evidence, on the other hand, refers to a specific association of a NCBI id with a specific metadatum term id. More specifically, evidence is the abundance of samples that these specific terms co-occur. Using terminology from set theory, evidence is the intersection of samples of NCBI ids with the samples with metadata ids.
We downloaded ~60000 samples from MGnify which contain 3391 NCBI ids. From these ids the 1081 belong to genera and the 1790 belong to species. Regarding metadata, the tagger assigned 442 different terms to samples. Using samples as the medium, we found 365997 associations of NCBI ids with metadata. From these associations, the ~130000 are species-level and more specifically the 124583 are bacterial.
The size of samples with NCBI ids spans 3 orders of magnitude. There is a peak of 390 samples with 94 NCBI ids. Fifty percent of the samples have less than 500 NCBI ids.
The score of Lars for MGnify uses the background of terms to normalize the intersection of the associations. The plot of the number of associations against the score value shows a very dense area until score=5 (mean=4.5, median=2.5 and 3rd quantile=5.2) with a maximum of ~100. Also in cases where the background of metadata is 1 the score is 2 which is very high. Statistically it would be preferable to spread the score distribution across it's range of values, reduce oversampling and undersampling biases.
For these reasons we implemented a different score, mutual information (MI). MI is a measure of the dependence between the two random variables and is insensitive to the size of the data sets. In addition, MI is zero if and only if the two random variables are strictly independent (Mutual Information between Discrete and Continuous Data Sets).
The distribution of MI is more spread across associations compared with the previous score. Is it more relevant? This is the next thing to explore.