Mucor3 an iteration on the original Mucor. Mucor3 encompasses a range of processes involved with not only creation of VCF variant reports but also line-delimited JSON manipulation. Mucor3 translates VCF files into tabular data and aggregates it into useful pivoted tables. VCFs are converted to line-delimited json objects. This allows for great flexibility in filtering the VCF data before pivoting the data. After combining all variant jsonl into one file Mucor3 can convert it to a tabular format and generate pivoted tables that show by default each variant pivoted by sample while display the allele frequency of that variant for a particular sample. depthgauge serves to create a pivoted spreadsheet that shows the read depth at all positions in the pivoted allele frequency table. Mucor3 is broken into several steps that can be performed by a variety of programs to suit your needs. Generally the steps involve annotation, atomization, filtering, manipulation, and report generation.
git clone --recurse-submodules https://github.com/blachlylab/mucor3.git
make
cd mucor3
python setup.py install
In order for VCFs to contain useful information about the mutations they contain, annotation is usually a neccessary step not included in most variant callers. Annotation can be performed by a variety of programs, however contrary to previous versions of this software, mucor3 does not. This is to keep the goals of this project smaller and within scope. Also there are many existing programs to do this that are optimized for this purpose. Some notable annotation software we use is:
- snpEff
- snpSift
- vcfanno
- vep
To allow for greater flexibility in the tools we can use with mucor3, we decided to use JSON as an intermediate representation. So we have atomizers to convert tables and VCFs to line-delimited JSON. The VCF atomizer will convert your vcfs into line delimited json objects. A single json object represents an individual VCF record for a single sample or an individual annotation for a single VCF record for a single sample (if you intend on using elasticsearch for filtering). Read more about the VCF atomizer here.
atomization/atomize_vcf/atomize_vcf sample1.vcf.gz >sample1.jsonl
vcf_atomizer sample2.vcf >sample2.jsonl
cat sample1.jsonl sample2.jsonl ... > data.jsonl
You can use the table atomizer to create JSON records from a sample spreadsheet.
table_atomizer samples.tsv > samples.jsonl
table_atomizer samples.xlsx >samples.jsonl
This data can then be used with the previously generated VCF data to link sample information to VCF variant data. After this data is linked, it can be used for filtering in later steps.
code for linking here
VCFs are often filled with millions of variants, and consequently can make the tables generated by Mucor3 very large. Filtering allows us to potentially reduce the spreadsheets to on variants that are important to us. A number of different tools can be used to approach this. Some programs that can be used to perform filtering:
- jq
- elasticsearch
- apache drill
- couchdb
We provide python scripts to aid in the use of elasticsearch for filtering. We also provide a program called varquery that can perform filtering in a similar way to elasticsearch without the bulk of a full database.
varquery index data.jsonl > data.index
varquery query data.index data.jsonl "/AF > 0.5 AND /INFO/ANN/EFFECT=(missense OR 5_prime_utr)"
More info on using varquery for filtering can be found here. More info on using elasticsearch for filtering can be found here.
Provide Mucor3 with your combined data and an output folder.
mucor3 data.jsonl output_folder
Mucor3 will output a pivoted table that is every variant pivoted by sample and should have this general format:
CHROM | POS | REF | ALT | ANN_gene_name | ANN_effect | sample1 | sample2 |
---|---|---|---|---|---|---|---|
chr1 | 2 | G | T | foo | missense | . | 0.7 |
chr1 | 5 | C | T | foo | synonymous | 1 | 0.25 |
chr1 | 1000 | TA | T | bar | ... | 0.45 | . |
chr1 | 3000 | G | GATAGC | oncogene | ... | 0.01 | . |
The values under sample1 and sample2 are the values from the AF field of FORMAT region of the VCF.
The master table however would represent this same data in this format:
CHROM | POS | REF | ALT | AF | sample | ANN_gene_name | ANN_hgvs_p | ANN_effect |
---|---|---|---|---|---|---|---|---|
chr1 | 2 | G | T | 0.7 | sample2 | foo | p.Met1Ala | missense |
chr1 | 5 | C | T | 1 | sample1 | foo | ... | synonymous |
chr1 | 5 | C | T | 0.25 | sample2 | foo | ... | ... |
chr1 | 1000 | TA | T | 0.45 | sample1 | bar | ... | ... |
chr1 | 3000 | G | GATAGC | 0.01 | sample1 | oncogene | ... | ... |
Note: The ANN_ fields will not be present for VCFs that have not been annotated using SnpEff.
Before running depthgauge we need to know what the first sample name column is in our AF.tsv spreadsheet.
In the above data the column number is 7 for column sample1. We also provide a folder which contains the BAM files
needed to run depthgauge. Important: The bams must have the same name as sample in the spreadsheet and VCFs and must be sorted and indexed.
For our example we would expect the BAMs folder to contain sample1.bam, sample2.bam, sample1.bam.bai, and sample2.bam.bai
.
depthgauge -t 4 output_folder/AF.tsv 7 BAMS/ depthgauge.tsv
This will create an identical table to our first except with read depths instead of allele frequencies.
CHROM | POS | REF | ALT | ANN_gene_name | ANN_effect | sample1 | sample2 |
---|---|---|---|---|---|---|---|
chr1 | 2 | G | T | foo | missense | 10 | 37 |
chr1 | 5 | C | T | foo | synonymous | 100 | 4 |
chr1 | 1000 | TA | T | bar | ... | 20 | 45 |
chr1 | 3000 | G | GATAGC | oncogene | ... | 300 | 78 |
The key advancement of using JSONL as a intermediate data type is it flexibility and use in noSQL datastores. When using a large number of samples or a more permanent dataset that may be analyzed several times, a noSQL database may offer more flexibility and robustness. We have provided python scripts that can be used to upload data to an Elasticsearch instance and query VCF data from an Elasticsearch instance. Other JSONL querying mechanisms can be used i.e. Apache Drill, AWS Athena, newer versions of PostgreSQL, and many others.
The main mucor3 python script creates a pivot table by taking the jsonl directly from the vcf_atomizer and setting the fields CHROM, POS, REF, ALT
as an index, pivoting on sample
, and displaying the AF
for the combination of "index" and "pivot on" value. Using the mucor scripts directly allows for greater flexibility and manipulation. All scripts with the exception of jsonlcsv.py take jsonl as input and output jsonl.
merge.py helps combine rows together to ensure that when a pivot is performed that rows are unique to avoid duplications. The main mucor3 script uses this to ensure we have unique rows for any given variant so we should only have one occurrence of any combination of CHROM, POS, REF, ALT, and sample. merge.py can be used to combine rows in other ways, simply by specifying what column combinations should define a unique row. mucor3's merge would appear as such using the script directly:
cat data.jsonl | python merge.py sample CHROM POS REF ALT
merge.py will concatenate columns for rows that are duplicate based on the provided indices.
CHROM | POS | REF | ALT | AF | sample | ANN_gene_name | ANN_hgvs_p | ANN_effect | ANN_transcript_id |
---|---|---|---|---|---|---|---|---|---|
chr1 | 2 | G | T | 0.7 | sample2 | foo | p.Met1Ala | missense | 1 |
chr1 | 2 | G | T | 0.7 | sample2 | foo | synonymous | 2 | |
chr1 | 5 | C | T | 1 | sample1 | foo | synonymous | 3 |
The above table would be changed to this:
CHROM | POS | REF | ALT | AF | sample | ANN_gene_name | ANN_hgvs_p | ANN_effect | ANN_transcript_id |
---|---|---|---|---|---|---|---|---|---|
chr1 | 2 | G | T | 0.7 | sample2 | foo | missense;synonymous | 1;2 | |
chr1 | 5 | C | T | 1 | sample1 | foo | synonymous | 3 |
This step is neccesary as the vcf_atomizer reports duplicate variant results for multiple SnpEff annotations as this is most efficient for filtering data using Elasticsearch or jq. We must use merge.py to later coelesce the rows back to representing a single variant.