Skip to content

MongoDB schema

Cristina Yenyxe Gonzalez Garcia edited this page Jan 11, 2017 · 1 revision

The database technology used to store the variation data is MongoDB. The schema is comprised of 2 collections described below.

It is recommended to store a single genome assembly per database: when processing a set of variants, they can be annotated using Ensembl Variant Effect Predictor. If variants in the same genomic coordinates, but different assemblies or species, were stored in the same database, the annotation would be incorrect for some of them.

Variants

The root of this collection contains the core data about a reported variant: the genomic position where it was observed, the reference and alternate alleles, identifiers such as RS IDs (if any) and, for simple cases, the HGVS representation.

{
    "_id" : "1_14147_T_C",
    "chr" : "1",
    "start" : 14147,
    "end" : 14147,
    "len" : 1,
    "ref" : "T",
    "alt" : "C",
    "type" : "SNV",
    "ids" : [ "TBGI000006" ],
    "hgvs" : [
        {
            "type" : "genomic",
            "name" : "1:g.14147T>C"
        }
    ],
    ...
}

The files nested document contains an array, with one entry per VCF file that reported the variant. A unique study ID (sid) and file ID (fid) must be specified at the time of running the pipeline. In the EVA, we use study and analysis accessions issued by the ENA.

The attrs sub-document stores the information from the QUAL, FILTER and INFO columns, as well as an src field containing the first 8 columns as a compressed blob.

fm contains the FORMAT column, and samp the sample genotypes. The most frequently observed genotype is stored under the def (default) key, whereas the rest point to the (0-based) position a sample with a non-default genotype was in the VCF. For instance, if the first sample was heterozygous, it would be represented as "0/1" : [ 0 ]. The -1 value is a replacement for the missing character in VCF, because Mongo doesn't support dots in keys well.

{
    "_id" : "1_14147_T_C",
    ...
    "files" : [
        {
            "fid" : "ERZ312859",
            "sid" : "PRJEB14378",
            "attrs" : {
                "PR" : "",
                "src" : BinData(0,"H4sIAAAAAAAAADPkNDQxNDHnDHFy9zQAATPOEE5nTj0gDAgCAB7kZJgdAAAA")
            },
            "fm" : "GT",
            "samp" : {
                "def" : "0/0",
                "-1/-1" : [ 3, 4, 8, 11, 12, 13, 15, 16, 18 ],
                "1/1" : [ 5, 14 ]
            }
        },
        {
            "fid" : "ERZ123186",
            "sid" : "PRJEB10964",
            "attrs" : {
                "QUAL" : "255.0",
                "CNV" : "8",
                "TA" : "Synonymous",
                "TCO" : "428.1",
                "TGN" : "LOC_Os01g01030",
                "TID" : "LOC_Os01g01030.1",
                "src" : BinData(0,"H4sIAAAAAAAAADPkNDQxNDHn1OMM4XTmNDI1BbKc/cJsLaxDHG2DK/Py8ypz80uLrUOc/W1NjCz0DK1D3P1sffyd4/2LDQzTDQwNjA2sQzxd0IT0DAEoRmL1WQAAAA==")
            },
            "fm" : "GT:GL:GP:GQ:DP:AAC:LP",
            "samp" : {
                "def" : "0/0",
                "-1/-1" : [ 0, 22, 26 ],
                "1/1" : [ 34, 37, 38, 43, 46, 48, 49, 50, 53, 54, 59, 60, 63, 68, 69, 70, 79, 82, 88 ],
                "0/1" : [ 33, 40, 81 ]
            }
        }
    ],
    ...
}

The st nested document contains another array with population statistics. These statistics are uniquely identified by a study ID (sid), file ID (fid) and cohort/population ID (cid).

Values stored include genotype counts, Minor Allele Frequency (MAF), Minor Genotype Frequency (MGF), number of missing alleles and missing genotypes.

A default population "ALL" will be always created, in 2 different ways depending on the data available in the VCF:

  • For VCF with genotypes, all the column samples will be taken as input.
  • For aggregated VCF files, INFO fields must contain the relevant information (AC, AF and AN).

In order to create extra population, a PED file listing how they are grouped must be provided.

{
    "_id" : "1_14147_T_C",
    ...
    "st" : [
        {
            "maf" : 0.1818181872367859,
            "mgf" : 0,
            "mafAl" : "C",
            "mgfGt" : "0/1",
            "missAl" : 18,
            "missGt" : 9,
            "numGt" : {
                "0/0" : 9,
                "1/1" : 2,
                "-1/-1" : 9
            },
            "cid" : "ALL",
            "sid" : "PRJEB14378",
            "fid" : "ERZ312859"
        },
        {
            "maf" : 0.20297029614448547,
            "mgf" : 0.029702970758080482,
            "mafAl" : "C",
            "mgfGt" : "0/1",
            "missAl" : 6,
            "missGt" : 3,
            "numGt" : {
                "0/1" : 3,
                "1/1" : 19,
                "0/0" : 79,
                "-1/-1" : 3
            },
            "cid" : "ALL",
            "sid" : "PRJEB10964",
            "fid" : "ERZ123186"
        }
    ],
    ...
}

The annot nested document contains 2 arrays with information as reported by Ensembl VEP:

  • ct: Consequence type, including the Sequence Ontology (SO) number
  • xrefs: Transcripts and genes where the variant was placed

At the moment there is no explicit tracking of which version of VEP was used to generate this information.

{
    "_id" : "1_14147_T_C",
    ...
    "annot" : {
        "ct" : [
            {
                "ensg" : "OS01G0100400",
                "enst" : "OS01T0100400-01",
                "codon" : "Tta/Cta",
                "strand" : "+",
                "bt" : "protein_coding",
                "cDnaPos" : 1335,
                "cdsPos" : 1282,
                "aaPos" : 428,
                "aaChange" : "L",
                "so" : [
                    1819
                ]
            },
            ...
            {
                "ensg" : "OS01G0100466",
                "enst" : "OS01T0100466-00",
                "codon" : "-",
                "strand" : "-",
                "bt" : "protein_coding",
                "aaChange" : "-",
                "so" : [
                    1631
                ]
            }
        ],
        "xrefs" : [
            {
                "id" : "OS01T0100500-01",
                "src" : "ensemblTranscript"
            },
            ...
            {
                "id" : "OS01G0100100",
                "src" : "ensemblGene"
            }
        ]
    },

The _at nested document contains fields used for query optimization.

{
    "_id" : "1_14147_T_C",
    ...
    "_at" : {
        "chunkIds" : [
            "1_14_1k",
            "1_1_10k"
        ]
    }
}

Files

This collection contains information about the files, including the study they belong to, the filename, the unique identifiers and the date when it was processed.

The samp nested document contains a list of samples matching their name with the position they occupied in the VCF file.

{
    "_id" : ObjectId("5763bf69a748d1a4374ec016"),
    "fname" : "merged_ref_v7_2_garys_cut_sorted_nochrprefix.vcf.gz",
    "fid" : "ERZ312859",
    "sname" : "Genomewide SNP variation reveals relationships among landraces and modern varieties of rice",
    "sid" : "PRJEB14378",
    "stype" : "Collection",
    "date" : ISODate("2016-06-30T13:46:54.275Z"),
    "samp" : {
        "Aswina" : 0,
        "Azucena" : 1,
        "Cypress" : 2,
        "Dom_Sufid" : 3,
        "Dular" : 4,
        "FR_13_A" : 5,
        "IR_64" : 6,
        "LTH" : 7,
        "M_202" : 8,
        "Minghui_63" : 9,
        "Moroberekan" : 10,
        "N_22" : 11,
        "Nippon_Bare" : 12,
        "Pokkali" : 13,
        "Rayada" : 14,
        "SHZ2" : 15,
        "Sadu_Cho" : 16,
        "Swarna" : 17,
        "Tainung_67" : 18,
        "Zhen_Shan_97" : 19
    },

The meta nested document contains the VCF header in 2 different forms:

  • In plain text for direct consumption
  • Split in multiple sub-documents, to simplify the separate retrieval of some of them
{
    "_id" : ObjectId("5763bf69a748d1a4374ec016"),
    ...
    "meta" : {
        "header" : "##fileformat=VCFv4.2\n##fileDate=20150721\n##source=PLINKv1.90\n##INFO=<ID=PR,Number=0,Type=Flag,Description=\"Provisional reference allele, may not be based on real reference genome\">\n##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tAswina\tAzucena\tCypress\tDom_Sufid\tDular\tFR_13_A\tIR_64\tLTH\tM_202\tMinghui_63\tMoroberekan\tN_22\tNippon_Bare\tPokkali\tRayada\tSHZ2\tSadu_Cho\tSwarna\tTainung_67\tZhen_Shan_97\n",
        "FILTER" : [ ],
        "FORMAT" : [
            {
                "id" : "GT",
                "number" : "1",
                "type" : "String",
                "description" : "Genotype"
            }
        ],
        "fileformat" : "VCFv4.2",
        "fileDate" : "20150721",
        "source" : "PLINKv1.90",
        "INFO" : [
            {
                "id" : "PR",
                "number" : "0",
                "type" : "Flag",
                "description" : "Provisional reference allele, may not be based on real reference genome"
            }
        ]
    },
    ...
}

The st nested document contains the statistics associated to the file:

  • Number of samples
  • Number of variants
  • Number of SNPs
  • Number of short insertions and deletions (INDELs)
  • Number of variants that passed all the filters (if any filters applied)
  • Number of transitions
  • Number of transversions
  • Average quality (if quality scores were provided)
{
    "_id" : ObjectId("5763bf69a748d1a4374ec016"),
    ...
    "st" : {
        "nSamp" : 20,
        "nVar" : 163811,
        "nSnp" : 163811,
        "nIndel" : 0,
        "nPass" : 0,
        "nTi" : 103609,
        "nTv" : 60202,
        "meanQ" : -1
    }
}
Clone this wiki locally