-
Notifications
You must be signed in to change notification settings - Fork 24
MongoDB schema
The database technology used to store the variation data is MongoDB. The schema is comprised of 2 collections described below.
It is recommended to store a single genome assembly per database: when processing a set of variants, they can be annotated using Ensembl Variant Effect Predictor. If variants in the same genomic coordinates, but different assemblies or species, were stored in the same database, the annotation would be incorrect for some of them.
The root of this collection contains the core data about a reported variant: the genomic position where it was observed, the reference and alternate alleles, identifiers such as RS IDs (if any) and, for simple cases, the HGVS representation.
{
"_id" : "1_14147_T_C",
"chr" : "1",
"start" : 14147,
"end" : 14147,
"len" : 1,
"ref" : "T",
"alt" : "C",
"type" : "SNV",
"ids" : [ "TBGI000006" ],
"hgvs" : [
{
"type" : "genomic",
"name" : "1:g.14147T>C"
}
],
...
}
The files
nested document contains an array, with one entry per VCF file that reported the variant. A unique study ID (sid
) and file ID (fid
) must be specified at the time of running the pipeline. In the EVA, we use study and analysis accessions issued by the ENA.
The attrs
sub-document stores the information from the QUAL, FILTER and INFO columns, as well as an src
field containing the first 8 columns as a compressed blob.
fm
contains the FORMAT column, and samp
the sample genotypes. The most frequently observed genotype is stored under the def
(default) key, whereas the rest point to the (0-based) position a sample with a non-default genotype was in the VCF. For instance, if the first sample was heterozygous, it would be represented as "0/1" : [ 0 ]
. The -1 value is a replacement for the missing character in VCF, because Mongo doesn't support dots in keys well.
{
"_id" : "1_14147_T_C",
...
"files" : [
{
"fid" : "ERZ312859",
"sid" : "PRJEB14378",
"attrs" : {
"PR" : "",
"src" : BinData(0,"H4sIAAAAAAAAADPkNDQxNDHnDHFy9zQAATPOEE5nTj0gDAgCAB7kZJgdAAAA")
},
"fm" : "GT",
"samp" : {
"def" : "0/0",
"-1/-1" : [ 3, 4, 8, 11, 12, 13, 15, 16, 18 ],
"1/1" : [ 5, 14 ]
}
},
{
"fid" : "ERZ123186",
"sid" : "PRJEB10964",
"attrs" : {
"QUAL" : "255.0",
"CNV" : "8",
"TA" : "Synonymous",
"TCO" : "428.1",
"TGN" : "LOC_Os01g01030",
"TID" : "LOC_Os01g01030.1",
"src" : BinData(0,"H4sIAAAAAAAAADPkNDQxNDHn1OMM4XTmNDI1BbKc/cJsLaxDHG2DK/Py8ypz80uLrUOc/W1NjCz0DK1D3P1sffyd4/2LDQzTDQwNjA2sQzxd0IT0DAEoRmL1WQAAAA==")
},
"fm" : "GT:GL:GP:GQ:DP:AAC:LP",
"samp" : {
"def" : "0/0",
"-1/-1" : [ 0, 22, 26 ],
"1/1" : [ 34, 37, 38, 43, 46, 48, 49, 50, 53, 54, 59, 60, 63, 68, 69, 70, 79, 82, 88 ],
"0/1" : [ 33, 40, 81 ]
}
}
],
...
}
The st
nested document contains another array with population statistics. These statistics are uniquely identified by a study ID (sid
), file ID (fid
) and cohort/population ID (cid
).
Values stored include genotype counts, Minor Allele Frequency (MAF), Minor Genotype Frequency (MGF), number of missing alleles and missing genotypes.
A default population "ALL" will be always created, in 2 different ways depending on the data available in the VCF:
- For VCF with genotypes, all the column samples will be taken as input.
- For aggregated VCF files, INFO fields must contain the relevant information (AC, AF and AN).
In order to create extra population, a PED file listing how they are grouped must be provided.
{
"_id" : "1_14147_T_C",
...
"st" : [
{
"maf" : 0.1818181872367859,
"mgf" : 0,
"mafAl" : "C",
"mgfGt" : "0/1",
"missAl" : 18,
"missGt" : 9,
"numGt" : {
"0/0" : 9,
"1/1" : 2,
"-1/-1" : 9
},
"cid" : "ALL",
"sid" : "PRJEB14378",
"fid" : "ERZ312859"
},
{
"maf" : 0.20297029614448547,
"mgf" : 0.029702970758080482,
"mafAl" : "C",
"mgfGt" : "0/1",
"missAl" : 6,
"missGt" : 3,
"numGt" : {
"0/1" : 3,
"1/1" : 19,
"0/0" : 79,
"-1/-1" : 3
},
"cid" : "ALL",
"sid" : "PRJEB10964",
"fid" : "ERZ123186"
}
],
...
}
The annot
nested document contains 2 arrays with information as reported by Ensembl VEP:
-
ct
: Consequence type, including the Sequence Ontology (SO) number -
xrefs
: Transcripts and genes where the variant was placed
At the moment there is no explicit tracking of which version of VEP was used to generate this information.
{
"_id" : "1_14147_T_C",
...
"annot" : {
"ct" : [
{
"ensg" : "OS01G0100400",
"enst" : "OS01T0100400-01",
"codon" : "Tta/Cta",
"strand" : "+",
"bt" : "protein_coding",
"cDnaPos" : 1335,
"cdsPos" : 1282,
"aaPos" : 428,
"aaChange" : "L",
"so" : [
1819
]
},
...
{
"ensg" : "OS01G0100466",
"enst" : "OS01T0100466-00",
"codon" : "-",
"strand" : "-",
"bt" : "protein_coding",
"aaChange" : "-",
"so" : [
1631
]
}
],
"xrefs" : [
{
"id" : "OS01T0100500-01",
"src" : "ensemblTranscript"
},
...
{
"id" : "OS01G0100100",
"src" : "ensemblGene"
}
]
},
The _at
nested document contains fields used for query optimization.
{
"_id" : "1_14147_T_C",
...
"_at" : {
"chunkIds" : [
"1_14_1k",
"1_1_10k"
]
}
}
This collection contains information about the files, including the study they belong to, the filename, the unique identifiers and the date when it was processed.
The samp
nested document contains a list of samples matching their name with the position they occupied in the VCF file.
{
"_id" : ObjectId("5763bf69a748d1a4374ec016"),
"fname" : "merged_ref_v7_2_garys_cut_sorted_nochrprefix.vcf.gz",
"fid" : "ERZ312859",
"sname" : "Genomewide SNP variation reveals relationships among landraces and modern varieties of rice",
"sid" : "PRJEB14378",
"stype" : "Collection",
"date" : ISODate("2016-06-30T13:46:54.275Z"),
"samp" : {
"Aswina" : 0,
"Azucena" : 1,
"Cypress" : 2,
"Dom_Sufid" : 3,
"Dular" : 4,
"FR_13_A" : 5,
"IR_64" : 6,
"LTH" : 7,
"M_202" : 8,
"Minghui_63" : 9,
"Moroberekan" : 10,
"N_22" : 11,
"Nippon_Bare" : 12,
"Pokkali" : 13,
"Rayada" : 14,
"SHZ2" : 15,
"Sadu_Cho" : 16,
"Swarna" : 17,
"Tainung_67" : 18,
"Zhen_Shan_97" : 19
},
The meta
nested document contains the VCF header in 2 different forms:
- In plain text for direct consumption
- Split in multiple sub-documents, to simplify the separate retrieval of some of them
{
"_id" : ObjectId("5763bf69a748d1a4374ec016"),
...
"meta" : {
"header" : "##fileformat=VCFv4.2\n##fileDate=20150721\n##source=PLINKv1.90\n##INFO=<ID=PR,Number=0,Type=Flag,Description=\"Provisional reference allele, may not be based on real reference genome\">\n##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tAswina\tAzucena\tCypress\tDom_Sufid\tDular\tFR_13_A\tIR_64\tLTH\tM_202\tMinghui_63\tMoroberekan\tN_22\tNippon_Bare\tPokkali\tRayada\tSHZ2\tSadu_Cho\tSwarna\tTainung_67\tZhen_Shan_97\n",
"FILTER" : [ ],
"FORMAT" : [
{
"id" : "GT",
"number" : "1",
"type" : "String",
"description" : "Genotype"
}
],
"fileformat" : "VCFv4.2",
"fileDate" : "20150721",
"source" : "PLINKv1.90",
"INFO" : [
{
"id" : "PR",
"number" : "0",
"type" : "Flag",
"description" : "Provisional reference allele, may not be based on real reference genome"
}
]
},
...
}
The st
nested document contains the statistics associated to the file:
- Number of samples
- Number of variants
- Number of SNPs
- Number of short insertions and deletions (INDELs)
- Number of variants that passed all the filters (if any filters applied)
- Number of transitions
- Number of transversions
- Average quality (if quality scores were provided)
{
"_id" : ObjectId("5763bf69a748d1a4374ec016"),
...
"st" : {
"nSamp" : 20,
"nVar" : 163811,
"nSnp" : 163811,
"nIndel" : 0,
"nPass" : 0,
"nTi" : 103609,
"nTv" : 60202,
"meanQ" : -1
}
}
Pipeline design
Database
Tutorials