neoepiscope
is peer-reviewed open-source software for predicting neoepitopes from DNA sequencing (DNA-seq) data. Where most neoepitope prediction software confines attention to neoepitopes arising from at most one somatic mutation, often just an SNV, neoepiscope
uses assembled haplotype output of HapCUT2 to also enumerate neoepitopes arising from more than one somatic mutation. neoepiscope
also takes into account frameshifting from indels and permits personalizing the reference transcriptome using germline variants.
Read our paper in Bioinformatics
neoepiscope
v0.2.x has a critical bug where homozygous variants are not phased with heterozygous variants. Please update to the most recent version.
neoepiscope
is licensed under the MIT license. See LICENSE
for more details.
Portions of neoepiscope---specifically, segments of code in transcript.py
, bowtie_index.py
, and download.py
---are taken from Rail-RNA, which is copyright (c) 2015 Abhinav Nellore, Leonardo Collado-Torres, Andrew Jaffe, James Morton, Jacob Pritt, José Alquicira-Hernández, Christopher Wilks, Jeffrey T. Leek, and Ben Langmead and licensed under the MIT License.
or email [email protected].
neoepiscope
is compatible with Python 3.6 and higher. To install, run
pip install neoepiscope
Note: if this fails on macOS 10 (Catalina) or newer, the required pysam installation may be unable to find the C compiler. To solve this, you can try either 1) running xcode-select --install
or 2) installing pysam via conda (e.g. conda install -c bioconda pysam
) before trying pip install neoepiscope
again.
To download compatible reference annotation files (hg19, GRCh38, and/or mouse mm9) and link installations of relevant optional softwares to neoepiscope
(e.g. netMHCpan), you will need to use our download functionality. Run the command:
neoepiscope download
and respond to the prompts as relevant for your needs.
To make sure that the software is running properly, clone this repository, and from within it run:
python setup.py test
If you aren't using human hg19 or GRCh38 or mouse mm9 reference builds from our download functionality, you will need to download and prepare your own annotation files. Before calling any neoepitopes, run neoepiscope in index
mode to prepare dictionaries of transcript data used in neoepitope prediction:
neoepiscope index -g <GTF> -d <DIRECTORY TO HOLD PICKLED DICTIONARIES>
Options:
-g, --gtf
path to GTF file
-d, --dicts
path to write pickled dictionaries
To call neoepitopes from somatic mutations, ensure that the column with data for the tumor sample in your VCF file precedes the column with data from a matched normal sample. If it does not, run neoepiscope in swap
mode to produce a new VCF:
neoepiscope swap -i <INPUT VCF> -o <SWAPPED VCF>
Options:
-i, --input
path to input VCF
-o, --output
path to swapped VCF
If you would like to include germline variation in your neoepitope prediction, merge
your somatic and germline VCFs for a sample prior to phasing variants:
neoepiscope merge -g <GERMLINE VCF> -s <SOMATIC VCF> -o <MERGED VCF>
Options:
-g, --germline
path to germline VCF
-s, --somatic
path to somatic VCF
-o, --output
path to write merged VCF
-t, --tumor-id
tumor ID (matching sample in tumor BAM file's read group field)
If you plan to use GATK's ReadBackedPhasing for haplotype phasing (see below), make sure to specify a tumor ID using the -t
flag. It should match the sample name in the header of your tumor BAM file (the SM value in the read group field).
Next, run HapCUT2 with your merged or somatic VCF and your tumor BAM file (make sure to use --indels 1
when running extractHAIRS
if you wish to predict neoepitopes resulting from insertions and deletions). Before calling neoepitopes, prep
your HapCUT2 output to included unphased mutations as their own haplotypes and flag germline variants if relevant:
neoepiscope prep -v <VCF> -c <HAPCUT2 OUTPUT> -o <ADJUSTED HAPCUT OUTPUT>
Options:
-v, --vcf
path to VCF file used to generate HapCUT2 output
-c, --hapcut2-output
path to original HapCUT2 output
-o, --output
path to output file
-p, --phased
flag input VCF as phased with GATK ReadBackedPhasing
Alternatively, you may perform phasing using GATK's ReadBackedPhasing on your merged or somatic VCF. If you phased variants with GATK instead of HapCUT2, make sure to use the -p
flag when running neoepiscope prep
to format your output:
neoepiscope prep -v <VCF> -o <ADJUSTED HAPCUT OUTPUT> -p
You may also predict neoepitopes without phasing by preparing your merged or somatic VCF:
neoepiscope prep -v <VCF> -o <ADJUSTED HAPCUT OUTPUT>
Finally, call
neoepitopes:
neoepiscope call -b <GENOME BUILD> -c <PREPPED HAPCUT2 OUTPUT> [options]
Options:
-x, --bowtie-index
path to bowtie index of reference genome
-d, --dicts
path to directory containing pickled dictionaries generated in index
mode
-b, --build
which genome build to use (human hg19 or GRCh38 or mouse mm9; overrides -x
and -d
options)
-c, --merged-hapcut2-output
path to HapCUT2 output adjusted by neoepiscope prep
-v, --vcf
path to VCF file used to generate HapCUT2 output
-o, --output
path to output file
-f, --fasta
output additional fasta file output
-k, --kmer-size
kmer size for neoepitope prediction (default 8-11 amino acids)
-p, --affinity-predictor
software to use for MHC binding predictions (default MHCflurry v1 with rank and affinity scores)
-a, --alleles
alleles to use for MHC binding predictions
-n, --no-affinity
do not run binding affinity predictions, overrides the -p
and -a
options
-g, --germline
how to handle germline mutations (by default includes as background variation)
-s, --somatic
how to handle somatic mutations (by default includes for neoepitope enumeration)
-e, --rna-edits
path to directory containing REDIportal-formatted RNA edits file
-u, --upstream-atgs
handling of translation from upstream start codons - ("novel" (default) only, "all", "none", "reference" only)
-i, --isolate
isolate mutations - disables phasing of mutations which share a haplotype
--nmd
enumerate neoepitopes from nonsense mediated decay transcripts
--pp
enumerate neoepitopes from polymorphic pseudogene transcripts
--igv
enumerate neoepitopes from IG V transcripts
--trv
enumerate neoepitopes from TR V transcripts
--allow-nonstart
enumerate neoepitopes from transcripts without annotated start codons
--allow-nonstop
enumerate neoepitopes from transcripts without annotated stop codons
--rna-bam
path to paired end RNA-seq alignment file
--transcript-counts
path to file containing per-transcript read counts
--tpm-threshold
minimum transcript TPM required to retain neoepitope
Using the --build
option requires use of our download
functionality to procure and index the required reference files for human hg19, human GRCh38, and/or mouse mm9. If using an alternate genome build, you will need to download your own bowtie index and GTF files for that build and use the neoepiscope index
mode to prepare them for use with the --dicts
and --bowtie-index
options.
Haplotype information should be included using -c /path/to/haplotype/file
. This in the form of HapCUT2 output, generated either from your somatic VCF or a merged germline/somatic VCF made with our neoepiscope merge
functionality. The HapCUT2 output should be adjusted using our neoepiscope prep
functionality to ensure that mutations that lack phasing data are still included in analysis.
If you wish to extract variant allele frequency information from your somatic VCF to be output with relevant epitopes, include the path to the somatic VCF you used to create your merged VCF using -v /path/to/VCF
.
To specify the output file, use -o /path/to/output_file
. If no output file is specified, the output will be written to standard out. By default, only data on neoepitopes is output in the file. By using the --fasta
option, an additional file, /path/to/output_file.fasta, will be made. This is a FASTA file specifying the full-protein sequences from each mutation-affected transcript. The header in the FASTA will give the name of the transcript from which the protein originated, followed by "v[#]" for every version of the transcript. This option is only available when writing output to a file, not standard out.
The default kmer size for neoepitope enumeration is 8-11 amino acids, but a custom range can be specified using the --kmer-size
argument with the minimum and maximum epitope size separated by commas (e.g. --kmer-size 8,20
to get epitopes ranging from 8 to 20 amino acids in length).
For affinity prediction, neoepiscope
currently supports predictions from MHCflurry
v1, MHCnuggets
v2, netMHC
v4, netMHCpan
v3 or v4, netMHCIIpan
v3, netMHCII
v2, PickPocket
v1, netMHCstabpan
v1, and PSSMHCpan
v1. When installing our software with pip
, MHCflurry
and MHCnuggets
are automatically installed or updated. Optional integration of netMHC
, netMHCpan
, netMHCIIpan
, netMHCII
, PickPocket
, netMHCstabpan
, or PSSMHCpan
must be done from your own installation of these softwares using our download functionality (see "Installing neoepiscope" above). Note that gawk
may be required for the use of these additional tools. Please note that MHCflurry and MHCnuggets require the use of TensorFlow, which was limited compatibility with python v3.7. If you would like to use these tools, please use python v3.6 or lower to run neoepiscope
.
The default affinity prediction software for neoepiscope
is MHCflurry
v1. To specify a custom suite of binding prediction softwares, use the -p
argument for each software followed by its name, version, and desired scoring output(s) (e.g. -p mhcflurry 1 affinity,rank -p mhcnuggets 2 affinity
). To forgo binding affinity predictions, use the --no-affinity
command line option.
Germline and somatic mutations can be handled in a variety of ways. They can be excluded entirely (e.g. --germline exclude
), included as background variation to personalize the reference transcriptome (e.g. --germline background
), or included as variants from which to enumerate neoepitopes (e.g. --somatic include
). The default value for --germline
is background
, and the default value for --somatic
is include
.
The choice of start codon for a transcript can also be handled with flexibility. By default, the value for the --upstream-atgs
argument is none
, which specifies preferential use of the reference start codon for a transcript, or alternatively the nearest ATG downstream of it in the case of a disrupted reference start codon. Alternatively, the use of --upstream-atgs novel
allows for the use of a novel ATG upstream of the reference start codon in the case of a disrupted start codon. A less conservative --upstream-atgs all
uses the most upstream ATG, regardless of its novelty. For a conservative option, --upstream-atgs reference
requires use of only the reference start codon, preventing enumeration of neoepitopes from a transcript if the reference start codon is disrupted.
By default, neoepiscope
only enumerates neoepitopes from protein coding transcripts with annotated start and stop codons. However, by specifying the --nmd
, --pp
, --igv
, and/or --trv
flags, you can additionally enumerate neoepitopes from nonsense mediated decay transcripts, polymorphic pseudogene transcripts, immunoglobulin variable transcripts, and/or T cell receptor variable transcripts, respectively. For further flexibility, you can add the --allow-nonstart
and/or --allow-nonstop
to enumerate neoepitopes from transcripts without annotated start and/or stop codons, respectively.
Two options exist for quantifying expression of neoepitopes: 1) providing transcript read counts to calculate transcript-level expression in TPM or 2) providing an RNA alignment to calculate direct read-level support of the source mutation. Both options may be used simultaneously. To calculate transcript-level expression, use the --transcript-counts
option and provide the path to a tab-seperated file with transcript identifiers in the first column and read counts in the second column (e.g. the output from HTseq
's htseq-count
program). This will provide the TPM value(s) for the transcript(s) a neoepitope is associated with. To additionally filter out neoepitopes from poorly expressed transcripts, you can use the --tpm-threshold
option to set a minimum TPM requirement. To calculate mutation-level expression, you can provide a paired-end RNA-seq BAM alignment file. This will provide the number of reads supporting the mutation, the number of reads covering the position of the mutation, and the percent of reads covering the mutation which support that mutation. NOTE: mutation-level expression requires samtools
to be installed and in your PATH.
neoepiscope
output is a TSV file, either written to standard out by default, or the file named with the --output
option. The 1st column lists the neoepitope sequence. The 2nd column lists the chromosome on which the source mutation occurs, and the 3rd column lists the position of the mutation on that chromosome. The 4th column lists the reference nucleotide sequence at that position (*
for insertions), and the 5th column lists the alternative nucleotide sequence at that position (*
for deletions). The 6th column lists the type of variant - V
for SNVs/MNVs, I
for insertions, and D
for deletions. The 7th column lists the VAF for that mutation (if available), and the 8th column lists the paired normal epitope for neoepitopes resulting from SNVs/MNVs. The 9th column lists any warnings associated with the neoepitope or its transcript(s) of origin (e.g. if the reference start codon was disrupted and an alternative start codon was used), the 10th column lists the Ensembl identifier(s) of the transcript(s) of origin for the neoepitope, and the 11th column lists the transcript type(s) of the transcript(s) of origin. The 12th column lists the Ensembl identifier(s) of any genes associated with the transcript(s) of origin, and the 13th column lists the gene name(s). The 14th column lists the TPM(s) expression levels for the transcript(s) associated with that epitope. The 15th column lists the number of RNA-seq reads supporting the source mutation The 16th column lists the number of RNA-seq reads covering the position of the source mutation. The 17th column lists the percentage of reads covering the position of the source mutation which support that mutation. The 18th column lists the IEDB identifier(s) associated with the epitope if it is a known sequence, with any relevant peptide modifications listed. If any MHC binding predictions were run for neoepitopes, the following columns list the binding affinities of the neoepitope for that HLA allele/binding prediction tool combination as labeled (e.g. mhcnuggets_HLA-A*02:01_affinity
represents the binding affinity in nM of that neoepitope for the allele HLA-A*02:01 as predicted by MHCnuggets
).
If the --fasta
option was specified, a fasta file will also be written to the file specified with the --output
option, with the additional extension .fasta
. Sequence names will be transcript identifiers followed by _vX
, where X
is a version number. Sequences are the amino acid sequences derived from translation of that transcript.