-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Tumor cells can arise from non-synonymous coding mutations. A non-synonymous coding mutation in the DNA causes an alteration in the amino acid sequences of endogenous proteins. Through the process of Antigen Presentation these proteins are processed by the proteasome and lysed into peptides. These peptides are loaded into the antigen presentation complexes, in particular the Major Histocompatibility complex (MHC). The MHC then gravitates to the cell surface. These peptides are referred to as neo-antigens.
The T-Cell Receptor will uniquely bind to these peptides displayed on the MHC and elicit some kind of immune response. Because of their aberrant nature, the T-cells will view these mutant affiliated peptides as “non-self” or foreign entities, and eliminate the cell. Leveraging this mechanism of immune response, personalized vaccine development has shown great promise in driving immune response against tumor cells.
MHC’s have a high level of allelic diversity. This high level of diversity results in MHCs having variable affinities to mutated peptides. Therefore identifying proper neo-antigen targets is essential for the development of personalized therapies.
Given the interest in personalized immunotherapy, pVACtools was developed by the Griffith Lab at Washington Unitversity St. Louis to help identify and visualize these tumor neoantigens. pVACseq is a tool within the pVACtools toolkit that identifies and prioritizes neoantigens leveraging Tumor-Normal DNA and RNA data.
Trinity Cancer Transcriptome Analysis Toolkit (CTAT) aims to provide tools for leveraging RNA-seq to gain insights into the biology of cancer transcriptomes. CTAT-pVACseq uses the pVACseq framework to identify neoantigens leveraging RNA-seq data. CTAT-pVACseq best practices encourages users to run CTAT-Mutations pipeline and use outputs from CTAT-Mutations as inputs for CTAT-pVACseq.
More information on pVACseq can be found on the pVACtools website.
CTAT-pVACseq requires the following program in order to run:
- Java
- Cromwell
- Docker
Running CTAT-pVACseq is a three step process.
- Run the CTAT-Mutations pipeline to generate a BAM alignment and a VCF.
- Preprocessing the inputs for pVACseq.
- Running pVACseq.
For the first 2 steps, the Trinity CTAT project provides a public reference library that holds most of the needed reference files. The library can be found here. Please make sure to use the same version throughout. The only additional reference file required is the VEP resource file
The first step of running the CTAT-pVACseq pipeline for a sample is to run the RNA-seq data through the CTAT-Mutations pipeline to get the BAM alignment and VCF calls we need.
The easiest way to do that is to use one of the containers we provide.
Following this, before running pVACseq itself, we process the CTAT-Mutations outputs to add a number of annotations. In order to run the preprocessing step, the user must have the proper reference files as described above.
The following command will run this preprocessing step:
java -jar cromwell-71.jar \
run CTAT-pVACseq/WDL/preprocessing_Main_RNAseq.wdl \
-i inputs.json
Users have to update the inputs.json
file so that it includes the correct reference paths.
The required input files include the following:
Inputs | Description |
---|---|
HaplotypeCaller_VCF | VCF (Variant Call Format) file containing variants, output by CTAT-Mutations |
HaplotypeCaller_VCF_index | Index file for the VCF input |
BAM | BAM alignment file, output by CTAT-Mutations |
BAM_index | Index file for the BAM input |
GTF | GTF (General Transfer Format) file |
RNA_editing_VCF | VCF file containing RNA-editing sites |
gnomadVCF | gnomAD VCF file |
gnomadVCFindex | Index file for the gnomAD VCF |
ref_dict | Dictionary for the refernce genome file |
ref_fasta | Reference genome Fasta file |
ref_fasta_index | Index for the reference Fasta file |
VEP_Reference | VEP (Variant Effect Predictor) resource file |
Tumor_ID | ID for the given tumor sample |
sample_id | ID for the given sample |
All of the reference file can be found in the Trinity CTAT Resource Library described above. The VEP resource file can be downloaded using the following instructions.
The preprocessing workflow will output two VCFs; annotated_TXGX.vcf
, and <ID>_decomposed_output.vcf
.
The output VCFs from the above preprocessing step can then be fed into the pVACseq workflow. The following command is used to run the pVACseq workflow.
java -jar cromwell-71.jar \
run CTAT-pVACseq/WDL/pVACseq.wdl \
-i inputs.json \
Users have to update the inputs.json
file and add the paths from the VCF outputs given by the preprocessing step. The annotated_TXGX.vcf
will be your input VCF and <ID>_decomposed_output.vcf
will be your input phased VCF.
Within the input.json file, users can choose what HLAs types and epitopes lengths they want to use in the pVACseq analysis. Users can also choose what algorithms to use.
Inputs | Description |
---|---|
VCF | VCF (Variant Call Format) file containing variants, output from preprocessing step annotated_TXGX.vcf
|
phased_VCF | VCF file that contains the phased proximal variant information, Output from preprocessing step <ID>_decomposed_output.vcf
|
HLAs | Name of the allele to use for epitope prediction. Multiple alleles can be specified using a comma-separated list. ex.) HLA-A*11:01,HLA-A*29:02,HLA-B*08:01,HLA-B*45:01,HLA-C*07:01,HLA-C*06:02
|
epitope_prediction_algorithms | The epitope prediction algorithms to use. Multiple prediction algorithms can be specified, separated by spaces. ex.) MHCflurry MHCnuggetsI MHCnuggetsII NNalign NetMHC NetMHCIIpan NetMHCcons NetMHCpan PickPocket SMM SMMPMBEC SMMalign
|
epitope_lengths_I | Length of MHC Class I subpeptides (neoepitopes) to predict. Multiple epitope lengths can be specified using a comma-separated list. Typical epitope lengths vary between 8-15. Required for Class I prediction algorithms. default: 8,9,10,11
|
sample_id | The ID for the given sample. |
cpus | CPU count to use |
If you know the HLA typing for the patient you have the RNA-seq sample of through other means, you can use that information in the pVACseq options. If however you do not have the information, you will need to run another tool to infer them from your RNA-seq data. Some of the tools available for bulk RNA data are:
- arcasHLA (pub: https://academic.oup.com/bioinformatics/article/36/1/33/5512361)
- OptiType (pub: https://academic.oup.com/bioinformatics/article/30/23/3310/206910)
After completion, CTAT-pVACseq outputs the results in the directory output
. In here is the directory MHC_Class_I
which holds the MHC class I epitope predictions.
Output | Description |
---|---|
< SampleID >.all_epitopes.tsv | All predicted epitopes and their binding affinity scores, with additional variant information. |
< SampleID >.filtered.tsv | Epitopes after applying filters; cleavage site, stability predictions, and reference proteome similarity metrics added. |