Bug Reports and Issue Tracking
mucor is software to aggregate variant information sourced from multiple VCF files (and some others, see below) into a variety of summary files with varying levels of detail and statistics. The outputs range from high-level summaries to full, detailed reports in text and Microsoft Excel formats. The intended audience is both biologists and bioinformaticians.
The following modules are required:
- numpy (http://www.numpy.org/)
- pandas (http://pandas.pydata.org/)
- HTSeq (http://www-huber.embl.de/HTSeq/)
The following modules are optional:
- pytabix (https://github.com/slowkow/pytabix)
- XlsxWriter (https://github.com/jmcnamara/XlsxWriter)
- xlwt (https://pypi.python.org/pypi/xlwt)
- bgzip and Tabix (optional, see Databases, below; https://github.com/samtools/htslib)
mucor requires a reference annotation file (in GTF or GFF3 format) for feature definition. Contig (chromosome) names must match between the input VCF and reference annotation file; for example, for Homo sapiens, UCSC uses the form chr17 while Ensembl uses the form 17.
Protip: If your human data are aligned and variants called with chr style contig names but you wish to use Ensembl genes, you can try the GENCODE annotation which contains most Ensembl genes but using chr style contig numbering.
mucor can optionally check variants against any number of supplied databases and report presence (including identifier) or absence in the database. Common choices would be one or more releases of dbSNP (e.g., dbSNP 137 and latest dbSNP; dbSNP clinvar and dbSNP all), 1000 Genomes, NHLBI 6500 Exomes, or COSMIC. Users may also supply their own custom databases.
Databses must meet the following requirements: * Conform to VCF standard format * be compressed with bgzip (not gzip; see below) * be indexed with tabix (see below).
bgzip and tabix provide massive speedups looking up variants in these
external files. The database lookup feature is not available if files
are not correctly bgzipped or tabix indexed, or if the Pytabix module is
not installed. bgzip
and tabix
are available as part of
htslib
(https://github.com/samtools/htslib)
Beginning with an example VCF dbSNP.vcf
, execute the following:
$ bgzip dbSNP.vcf $ tabix -p vcf dbSNP.vcf.gz
Note that bgzipped files can be treated as regular .gz files by other tools.
These data are the data that you wish to aggregate and summarize. In theory, mucor will work with any well-formed generic VCF, but it has been tested with and contains specific code to handle VCF files generated by the following tools:
- snpEff
- Ion Torrent PGM (default machine output)
- Illumina Miseq (default machine output)
- GATK
- GATK SomaticIndelDetector
- GATK HaplotypeCaller
- MuTect
- VarScan
- FreeBayes
- Samtools
In addition, mucor can read the more detailed .out files produced by MuTect.
clone from https://github.com/blachlylab
Running mucor to aggregate and summarize variants is a two step process:
- Project setup / configuration
mucor_config.py
creates a JSON config file
- Variant Aggregation!
mucor.py
The configuration step is completed using the provided
mucor_config.py
utility. It accepts the following command line
arguments and creates a JSON file that will be passed to the main mucor
script.
-ex, --example
Print a valid, example JSON config file and exit.
Function that will write a template of the JSON config. It can be edited
manually and supplied to mucor.
-g GFF, --gff GFF
Reference annotation GFF/GTF for feature binning.
Required
-f FEATURETYPE, --featuretype FEATURETYPE
Feature type into which to
bin. Gencode GTF example: gene_name, gene_id, transcript_name,
transcript_id, etc. Required
-db DATABASES, --databases DATABASES
Colon delimited name and path
to variant database in bgzipped VCF format. Can be declared >= 0 times.
Ex: -db name1:/full/user/path/name1.vcf.gz. Optional
-s SAMPLES, --samples SAMPLES
Text file containing sample names. One
sample per line. mucor_config.py
attempts to guess which files
belong with which sample IDs using globbing (wildcard filename
matching). This means that the auto-configuration may be incorrect if
any sample names are contained within another sample name. Ex: U-23 and
U-238. U-23 would erroneously identify U-238 files, requiring manual
modification of the JSON file. Sample IDs U-023 and U-238 would not
exhibit this problem. Required
-d PROJECT_DIRECTORY, --project_directory PROJECT_DIRECTORY
Working/project directory, in which to find variant call files to
aggregate. Variant calls can be in the provided directory, or any of its
subdirectories. Default: current working directory
-vcff VCF_FILTERS, --vcf_filters VCF_FILTERS
Comma separated list of
VCF filters to allow. Default: PASS
-a ARCHIVE_DIRECTORY, --archive_directory ARCHIVE_DIRECTORY
Specify
directory in which to read/write archived annotations. This step will
significantly speed up future runs that use the same annotation and
feature type, even if the sample data changes. Undefined will prevent
using the annotation archive features. Optional
-r REGIONS, --regions REGIONS
Comma separated list of bed regions
and/or bed files by which to limit output. Bed regions can be specific
positions, or entire chromosomes. Ex:
chr1:10230-10240,chr2,my_regions.bed. Optional
-u, --union
Join all items with same ID for feature_type (specified
by -f) into a single, continuous bin. For example, if you want intronic
variants counted with a gene, use this option. WARNING, this will lead
to spurious results for features that are duplicated on the same contig.
When feature names are identical, the bin will range from the beginning
of the first instance to the end of the last, even if they are several
megabases apart. Refer to the documentation for a resolution using
'detect_union_bin_errors.py.' Optional.
-jco JSON_CONFIG_OUTPUT, --json_config_output JSON_CONFIG_OUTPUT
Name of JSON configuration output file. This is the configuration file
fed into mucor. Required
-outd OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Name of
directory in which to write mucor output. Required
-outt OUTPUT_TYPE, --output_type OUTPUT_TYPE
Comma separated list of
desired output types. Options include: counts, txt, longtxt, xls,
longxls, bed, featXsamp, featmutXsamp, all. Default: counts,txt (See the
detailed description of Output File Formats,
below)
Example:
python ./mucor_config.py -g ~/references/human/gencode/gencode.v19.annotation.gtf -f gene_name -s samples.txt -a ./fast -u -jco mucor_config.json -outd ./mucor_output -outt all -db 1000G:~/references/human/1000_genomes/chrm_1000Genomes.20130502.genotypes.vcf.gz -db dbsnp:~/references/human/dbsnp/common_all.hg19.sorted.leftalign.vcf.gz
mucor_config.py
produces a JSON-formatted configuration file
embodying the selected options for the subsequent mucor run. The
configuration could be edited manually (e.g., to tweak a database name
or path) or programmatically (e.g., if mucor_config.py
's guesses
about sample IDs were incomplete) at this stage.
Alternatively, mucor_config.py -ex
produces a syntactically valid
example JSON file for editing.
Producing a configuration file is intended to facilitate the following: * Consistency between runs * Documentation of settings * Easier use with dozens to thousands of input files
Execution: mucor.py <config.json>
mucor is executed by launching the main mucor.py
script, passing as
a single parameter the name of the previously generated/edited JSON
file.
Output files of the specified type are placed in the output directory specified during configuration.
Output types are specified at the time of configuration. The user may select any number and combination of output types from the list below.
all Execute all output types
counts Print counts of mutations per feature. Output: counts.txt
txt Print all information about each variant, with one-per-row,
irrespective of how many samples in which it appears. Useful for
variant-centric projects. Identical to xls in layout. Output:
variant_details.txt
longtxt Similar to txt above, but writes each instance of a variant
to a new row. Each variant is written once per source file, instead of
combining recurrent variations into 1 unique row. Identical to longxls
in layout. Output: long_variant_details.txt
xls Print all information about each variant, with one-per-row,
irrespective of how many samples in which it appears. Useful for
variant-centric projects. Identical to txt in layout. Output:
variant_details.xls/xlsx
longxls Similar to xls above, but writes each instance of a variant
to a new row. Each variant is written once per source file, instead of
combining recurrent variations into 1 unique row. Identical to longtxt
in layout. Output: long_variant_details.xls/xlsx
NB: The XLS format has a hard limit of 2^16 rows; in long record
format, a moderate sized study could exceed this (2,000 total
variants/sample * 32 samples = 65,536 rows). mucor can use Python's
xlwt
module to write .xls format, but it is preferrable to have
XlsxWriter
or openPyxl
installed for .xlsx support.
bed Print bed file of the variant locations Output:
variant_locations.bed
vcf Print vcf file of the variant locations, features, depths, and
variant frequencies. Output: variant_locations.vcf
featXsamp Print table of mutation counts per feature per sample.
Samples are in columns, while features are in rows. The count of unique
mutations per sample per feature are the table values. This output is
useful for examining patterns in variation across samples, for example,
to look at combinatoric mutation status for selected recurrently mutated
genes. This output could be used directly to make a heatmap. Output:
feature_by_sample.txt
mutXsamp Print table of mutations per sample. Unlike featXsamp,
this differentiates among different variants within the same features.
For example, in acute leukemia, the functional effect of mutations in
DNMT3A depends on whether it is an R882 mutation or non-R882 mutation.
As before, samples are in columns, with features in rows. However, rows
2-4 contain information about chromosome, position, ref, and alt. The
table values are boolean: 1 for present mutation, 0 for missing
mutation. This output could be used directly or with appropriate
filtering to make an Oncoprint. Output:
feature_and_mutation_by_sample.txt
mutXsampVAF Print table of mutations per sample. Identical to
mutXsamp except boolean values are replaced by the respective
variant VAF. Output: feature_and_mutation_by_sample.txt
DepthGauge, a companion utility to mucor, measures coverage at regions of interest to increase confidence in variant calls. The purpose of this tool is not only to verify sufficient depth at mutations, but more importantly, to verify sufficient depth in places that are not called mutant. The lack of a mutation call may indicate no mutation, or simply that the region of interest had insufficient coverage for analysis. Like mucor, DepthGauge is dependent only on the JSON formatted configuration file which may be entirely hand created, entirely automatically generated by mucor_config.py, or automatically generated and hand-tuned. DepthGauge has three additional options, the latter of which can override the regions, if any, specified in the JSON configuration file.
depth_gauge.py config_file.json [-p] [-c] [-r REGIONS]
-p, --point Instead of reporting an average depth for every location within a window (default), take the middle coordinate within the range and calculate the depth at that point as a surrogate for the entire region
-c, --count Instead of reporting an average depth for every location within a window (default), or taking a point estimate from the middle (-p), instead count the total number of reads mapped within the region.
-r REGIONS This option specifies a list of regions to query for depth. If present, it overrides the region(s) specified in the JSON configuration file. As an alternative, the JSON configuration file could be edited before running DepthGauge.
Output: Depth_of_Coverage.xlsx
The --union feature will behave inappropriately when genomic feature names are duplicated on the same contig. For example, if gene "ABC" is duplicated on the beginning and the end of chromosome 1, the feature bin for gene "ABC" will cover the whole contig (from the beginning of the first copy of "ABC", to the end of the last copy). Users may select another feature type, such as gene_id, which is unique to every copy of a gene. Otherwise, users may run the included python script [detect_union_bin_errors.py] to detect potential bin errors. It accepts a feature_type, a GTF/GFF annotation, and an output directory. The output is a text file list of feature names likely to cause large bin errors. Place this text document into the working directory where mucor will be executed. mucor will automatically search for the text file by name ['union_incompatible_genes.txt'] in the current directory and print a message indicating when the file is found.
python ./detect_union_bin_errors.py -o ~/projects/mucor -g ~/references/human/gencode/gencode.v19.annotation.gtf -f gene_name
There is an issue when a variant file presents a contig that the pickled
(archived) annotation does not have. This will throw a warning that
shows how many contigs were unknown, and how many mutations were
encountered on these contigs. The solution is to disable the archive
feature by omitting the -a
or --archive_directory
option. This
will permit the unknown contig in output, but all mutations on the
unknown contig will be shown as having no feature.
*** WARNING: 18 Contigs and 39 mutations are in areas unknown to the genomic array of sets. If using --fast, perhaps try again without it. ***
The VCF files need to have columns #CHROM, POS … etc. The configuration script checks each VCF file for proper columns and will print a warning if any are missing or wrong. However, it does not halt execution and will include any malformed column VCF files and attempt to process them regardless. The main script may finish execution with the malformed VCF, but the output may be perturbed or useless.
File "mucor/inputs.py", in parse_VarScan position = int(row[fieldId['POS']]) KeyError: 'POS'
Users may not supply vcf files that have inconsistent 'effect' and/or
'functional consequence' for the same variant. Presumably, if the
variant has the same location and reference and alternate allele, the
effect and functional consequences would be predicted to be the same.
The problem lies in collapsing mutations in the variant_details
output type(s). This issue may arise when different platforms or
pipelines are used for samples containing a common variant within a
single run of mucor. The current solution is to annotate all VCF files
with the same functional consequence decoration software.
Check the issue tracker at the github repository. Please report bugs and request new features there for better tracking.