RNA-SSNV

The RNA-SSNV is a scalable and efficient analysis method for RNA somatic mutation detection from RNA-WES (tumor-normal) paired sequencing data which utilized Mutect2 as core-caller and Multi-filtering strategy & Machine-learning based model to maximize precision & recall performance. It runs highly automated once related configs & infos get configurated properly. It reports an aggregated mutation file (standard maf format) to facilitate downstream analysis and clinical decision.

Important!!! This is the github repository storing manual and necessary codes for RNA-SSNV. The practical application was located within our onedrive storage.

Pre-requirements

Required Python Packages

pip install -r requirements.txt --user

Modify config & table file

tables/project_RNA_somatic_calling_info.tsv: project-related sequencing data information, modify under following instruction.

Project RNA somatic calling sample info

file_name	file_id	aliquots_id	case_id	sample_type
Name of bam file	Name of specific folder which contains bam file	ID of aliquot sequenced (must be unique to the specific bam file)	ID of corresponding patient's case	Type of aliquot's origin. Be ware, tumor sample should be "Primary Tumor", paired-normal sample should be "Solid Tissue Normal" or "Blood Derived Normal". And the "Best Normal" sample must be chose and included to support multi-sample calling.

configs/project_config.yaml: general framework-related configurations, modify it accordingly.

Run Framework

Once configurated correctly, our framework is ready to go. Please execute the following commands step by step, make sure everything works smoothly before moving forward.

Call and annotate raw RNA somatic mutations

We assume that available RNA&DNA sequence data for common users were aligned RNA-seq data (bam format) and co-cleaned analysis-ready DNA-seq data (bam format) which were standard pre-process for TCGA. Once user correctly configurated our framework, calling and annotate raw RNA somatic mutations will be automatically conducted.

# make sure to conduct dry running to see if the mutation calling pipeline will work as expected
snakemake --cores {num_of_cores} \
-ns rules/RNA-Somatic-tsv-Snakefile.smk \
--configfile configs/project_RNA_Somatic_config.yaml

# run the pipeline
snakemake --cores {num_of_cores} \
-s rules/RNA-Somatic-tsv-Snakefile.smk \
--configfile configs/project_RNA_Somatic_config.yaml

Beware, thanks to the breakpoint-run feature of snakemake, our framework can save process-finished files and delete corrupted files automatically when accidental disruption (power failure or unintended termination) occurred. Just re-run the command and our framework will continue its unfinished works.

In case of folders got locked after accidental disruption, --unlock and --rerun-incomplete option can be added during dry run to unlock corresponding folders.

Extract features for raw RNA somatic mutations

All parameters should be files or folders' absolute paths.

# run feature-extraction codes
python lib/own_data_vcf_info_retriver.py \
--cancer_type {your_specified_cancer_type} \
--RNA_calling_info {your_RNA_calling_info} \
--project_folder {your_project_folder} \
--exon_interval {your_exon_interval} \
--output_table_path {your_specified_feature_table_path} \
--num_threads {num_of_threads}

Predict reliable RNA somatic mutations

For the generated result, records with pred_label being 1 should be considered as reliable RNA somatic mutations which were predicted to be positive with default 0.5 threshold.

# run model predicting codes
python model_utilize.py \
--REDIportal resources/REDIportal_main_table.hg38.bed \
--DARNED resources/DARNED_hg19_to_bed_to_hg38_rm_alt.bed \
--raw_RNA_mutations {your_specified_feature_table_path} \
--model_path model/exon_RNA_analysis_newer.model \
--one_hot_encoder_path model/exon_RNA_analysis_newer.one_hot_encoder \
--training_columns_path model/exon_RNA_analysis_newer.training_data_col \
--output_table_path {your_specified_predicted_table_path}

Visualize contribution of important features using SHAP library

To inspect the feature contribution of single prediction, predicted table path and row-index of the prediction record were required. A svg format image containing feature contribution will be generated.

python lib/result_explainer.py \
--explain_type datarow \
--data_info {your_specified_predicted_table_path} \
--model_path model/exon_RNA_analysis_newer.model \
--one_hot_encoder_path model/exon_RNA_analysis_newer.one_hot_encoder \
--training_columns_path model/exon_RNA_analysis_newer.training_data_col \
--explain_plot_path {your_specified_feature_contribution_svg_image_path} \
--explain_row_index {row_index}

To inspect the feature contribution of multiple predictions, only predicted table path was required. A svg format image containing feature contribution will be generated.

python lib/result_explainer.py \
--explain_type dataset \
--data_info {your_specified_predicted_table_path} \
--model_path model/exon_RNA_analysis_newer.model \
--one_hot_encoder_path model/exon_RNA_analysis_newer.one_hot_encoder \
--training_columns_path model/exon_RNA_analysis_newer.training_data_col \
--explain_plot_path {your_specified_feature_contribution_svg_image_path}

Pairwise analysis for DNA and RNA somatic mutations (only do it with DNA evidence)

The combination of DNA and RNA somatic mutation can achieve maximum performance for mutational investigation. By incoporating DNA evidence into RNA somatic mutations, users can easily examine their intersectionality and validate their existence.

Step 0: Prepare for essential data

python lib/Mutect2_calls_prepare_to_table.py \
--cancer_type {your_cancer_type} \
--project_folder {your_project_folder} \
--RNA_calling_info {your_RNA_calling_info} \
--output_file_path {your_specified_path_for_RNA_mutations_to_table}

Step 1: Generate RNA-omitted DNA mutations to force-call

Using DNA evidence (mutations) to generate RNA-omitted DNA mutations to force-call and retrieve their status within RNA sequence data.

DNA mutations' required columns (maf format): "Tumor_Sample_UUID", "Chromosome", "Start_Position", "Reference_Allele", "Tumor_Allele1", "Tumor_Allele2"

Demo DNA evidence (header-row required)

Tumor_Sample_UUID	Chromosome	Start_Position	Reference_Allele	Tumor_Allele1	Tumor_Allele2
TCGA-05-4244	chr1	1543964	T	G	T

python model_analyze_with_DNA.py \
--step 1 \
--cancer_type {your_cancer_type} \
--DNA_info {your_DNA_mutations} \
--RNA_info {your_specified_predicted_table_path} \
--WXS_target_interval resources/whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals_add_chr_to_hg38_rm_alt.bed \
--exon_interval resources/GRCh38_GENCODE_v22_exon_rm_alt.bed \
--RNA_calling_info {your_RNA_calling_info} \
--RNA_bam_folder {your_project_folder}/{your_cancer_type}/RNA/apply_BQSR \
--Mutect2_target_detected_sites {your_specified_path_for_RNA_mutations_to_table} \
--project_folder {your_project_folder} \
--num_threads {num_of_threads} \
--output_file_path {your_specified_temporary_analysis_class_object}

Step 1.1: Force calling all DNA only mutations and extract features

Modify config file for force-calling process

configs/project_force_call_config.yaml: framework-related configurations for force-calling, modify it accordingly.

Afterwards, run commands sequencially to conduct force-calling of Mutect2 to query RNA coverage, allele depths for DNA only mutations.

# dry run to see if the mutation calling pipeline works
snakemake --cores {num_of_cores} \
-ns rules/RNA-Somatic-tsv-Snakefile-force-call.smk \
--configfile configs/project_force_call_config.yaml

# run formally
snakemake --cores {num_of_cores} \
-s rules/RNA-Somatic-tsv-Snakefile-force-call.smk \
--configfile configs/project_force_call_config.yaml

# run feature extraction codes for force-called mutations' info
python lib/force_call_data_vcf_info_retriver.py \
--cancer_type {your_cancer_type} \
--RNA_calling_info {your_RNA_calling_info} \
--project_folder {your_project_folder} \
--exon_interval resources//GRCh38_GENCODE_v22_exon_rm_alt.bed \
--output_table_path {your_specified_force_called_table_path} \
--num_threads {num_of_threads}

Step 2: Combine RNA force-called results with RNA somatic mutations to finalize RNA-DNA integrative analysis

python model_analyze_with_DNA.py \
--step 2 \
--force_call_RNA_info {your_specified_force_called_table_path} \
--instance_path {your_specified_temporary_analysis_class_object} \
--model_path models/exon_RNA_analysis_newer.model \
--one_hot_encoder_path models/exon_RNA_analysis_newer.one_hot_encoder \
--training_columns_path models/exon_RNA_analysis_newer.training_data_col \
--output_file_path {your_specified_final_table_path}

Step 3: Add DNA coverage info from DNA sequence data to compare coverages for DNA-level and RNA-level (require the existence for DNA sequence data)

python lib/result_adder.py \
--result_info {your_specified_final_table_path} \
--output_info {your_specified_final_table_with_DNA_coverage_path} \
--add_type DNA \
--DNA_calling_info {your_DNA_calling_info} \
--DNA_tumor_folder {your_DNA_sequence_data_folder} \
--num_threads {num_of_threads}

Train your own discriminant model

Although we used 511 cases of TCGA LUAD RNA-WES paired data to train our discriminant model, other non-cancerous RNA somatic mutations or non-bulk RNA-Seq (other sequencing technology) may exhibit different patterns of FP calls. In that case, our model may not served as expected, and a customized model can be trained on your own.

Data-preparation

Make sure all RNA somatic mutations within your project got called and annotated using our Call and annotate raw RNA somatic mutations commands
Gold-standard TP mutations for your project (maf-format) with required five columns: "Chromosome", "Start_Position", "Tumor_Allele2", "Tumor_Allele1", "Tumor_Sample_UUID"
- Tumor_Allele2: Same as the reference allele
- Tumor_Allele1: Same as the alternative allele

Train customized model

Using gold-standard TP mutations with their corresponding RNA somatic mutations to train customized model. The performance matrix for model training will be generated in the output information.

# run feature-extraction codes
python lib/own_data_vcf_info_retriver.py \
--cancer_type {your_cancer_type} \
--RNA_calling_info {your_RNA_calling_info} \
--project_folder {your_project_folder} \
--exon_interval resources/GRCh38_GENCODE_v22_exon_rm_alt.bed \
--output_table_path {your_specified_feature_table_path} \
--num_threads {num_of_threads}

# train your own model
python own_model_construct.py \
--REDIportal resources/REDIportal_main_table.hg38.bed \
--DARNED resources/DARNED_hg19_to_bed_to_hg38_rm_alt.bed \
--raw_RNA_mutations {your_specified_feature_table_path} \
--DNA_mutations {your_DNA_mutations} \
--model_folder_path {your_specified_folder_path_to_store_trained_model}

Utilize customized model

Back to the beginning of our framework, edit the model absolute path, start our framework and good to go!

Output folders & files

Our framework outputs several folders containing intermediate files and final project-level mutations annotation file (following standard maf format). Here, we detailly describe the results/ folder's schema.

Sequencing data pre-process

results/project_name/RNA/marked_duplicates: temporary folder containing MarkDuplicates tool's output.
results/project_name/RNA/splited_n_cigar_reads: temporary folder containing SplitNCigarReads tool's output.
results/project_name/RNA/base_reclibrate: temporary folder containing BaseRecalibrate tool's output.
results/project_name/RNA/apply_BQSR: permanent folder containing ApplyBQSR tool's output, final files (bam format) used to call RNA somatic mutations, applicable for other analysis.

Calling process - called RNA somatic mutation

results/project_name/RNA/RNA_somatic_mutation/Mutect2: permanent folder containing Mutect2 tool's output.
results/project_name/RNA/RNA_somatic_mutation/GetPileupSummaries: permanent folder containing GetPileupSummaries tool's output (best normal sample's pileup summary info).
results/project_name/RNA/RNA_somatic_mutation/FilterMutectCalls: permanent folder containing FilterMutectCalls tool's output, final files (vcf format) used to discriminate true RNA somatic mutations, applicable for other filtering strategy.
results/project_name/RNA/RNA_somatic_mutation/Funcotator/SNP: permanent folder containing Funcotator's annnotation info for raw RNA SNP calls.
results/project_name/RNA/RNA_somatic_mutation/SelectVariants/SNP_WES_interval: permanent folder containing raw RNA SNP calls subsetted via given WES target intervals.
results/project_name/RNA/RNA_somatic_mutation/SelectVariants/SNP_WES_interval_exon: permanent folder containing final raw RNA SNP calls subsetted by given WES target intervals and exon regions**.

Framework explaination

Essential codes

rules/RNA_Somatic-tsv-Snakefile.smk & rules/RNA_Somatic-tsv-Snakefile-force-call.smk: snakemake-style codes to describe our whole RNA somatic mutation calling pipeline (modify at your own risk!!!).
lib/own_data_vcf_info_retriver.py*&*lib/force_call_data_vcf_info_retriver.py: python codes to extract features (variant, genotype and annotation level) from different data sources.
model_utilize.py: python codes to predict the probability and labels of given Mutect2 calls.

Pre-trained models

models/exon_RNA_analysis_newer.one_hot_encoder: one-hot encoder which adapted to following model.
models/exon_RNA_analysis_newer.model: random forest discriminant model trained using whole TCGA LUAD project data.
exon_RNA_analysis_newer.training_data_col: column names used in model training and prediction

Resource files

resources/whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals_add_chr_to_hg38_rm_alt.bed: bed-format interval file for paired-normal Whole Exome Sequence(WES) targets. (canonical for TCGA projects)
resources/GRCh38_GENCODE_v22_exon_rm_alt.bed: bed-format interval file for GENCODE v22 exon regions.

Publication

Long, Q., Yuan, Y., Li, M. (2022). RNA-SSNV: A reliable somatic single nucleotide variant identification framework for bulk RNA-Seq data. Frontiers in Genetics.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.idea		.idea
.vscode		.vscode
configs		configs
lib		lib
media		media
rules		rules
scripts		scripts
tables		tables
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
maftools.R		maftools.R
model_analyze_with_DNA.py		model_analyze_with_DNA.py
model_construct.py		model_construct.py
model_utilize.py		model_utilize.py
own_model_construct.py		own_model_construct.py
requirments.txt		requirments.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-SSNV

Pre-requirements

Required Python Packages

Modify config & table file

Run Framework

Call and annotate raw RNA somatic mutations

Extract features for raw RNA somatic mutations

Predict reliable RNA somatic mutations

Visualize contribution of important features using SHAP library

Pairwise analysis for DNA and RNA somatic mutations (only do it with DNA evidence)

Step 0: Prepare for essential data

Step 1: Generate RNA-omitted DNA mutations to force-call

Step 1.1: Force calling all DNA only mutations and extract features

Step 2: Combine RNA force-called results with RNA somatic mutations to finalize RNA-DNA integrative analysis

Step 3: Add DNA coverage info from DNA sequence data to compare coverages for DNA-level and RNA-level (require the existence for DNA sequence data)

Train your own discriminant model

Data-preparation

Train customized model

Utilize customized model

Output folders & files

Sequencing data pre-process

Calling process - called RNA somatic mutation

Framework explaination

Essential codes

Pre-trained models

Resource files

Publication

About

Releases

Packages

Languages

License

pmglab/RNA-SSNV

Folders and files

Latest commit

History

Repository files navigation

RNA-SSNV

Pre-requirements

Required Python Packages

Modify config & table file

Run Framework

Call and annotate raw RNA somatic mutations

Extract features for raw RNA somatic mutations

Predict reliable RNA somatic mutations

Visualize contribution of important features using SHAP library

Pairwise analysis for DNA and RNA somatic mutations (only do it with DNA evidence)

Step 0: Prepare for essential data

Step 1: Generate RNA-omitted DNA mutations to force-call

Step 1.1: Force calling all DNA only mutations and extract features

Step 2: Combine RNA force-called results with RNA somatic mutations to finalize RNA-DNA integrative analysis

Step 3: Add DNA coverage info from DNA sequence data to compare coverages for DNA-level and RNA-level (require the existence for DNA sequence data)

Train your own discriminant model

Data-preparation

Train customized model

Utilize customized model

Output folders & files

Sequencing data pre-process

Calling process - called RNA somatic mutation

Framework explaination

Essential codes

Pre-trained models

Resource files

Publication

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages