Skip to content

A bioinformatics pipeline to identify the best available PDB structures for all available variants for specified genes of interest

Notifications You must be signed in to change notification settings

Utilon/MutaPipe_Repo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

🧬🧬🧬 M u t a P i p e 🧬🧬🧬

A bioinformatics pipeline to identify high quality mutant and WT PDB structures for genes of interest

+NOTE: We are always working to improve MutaPipe, so any bug reports or suggestions are highly welcome.

Table of Contents

  1. Introduction
  2. Documentation
  3. Citation

Introduction

MutaPipe is a fast and efficient bioinformatics pipeline to screen the Protein Data Bank (PDB) for genes of interest and retrieve the highest quality protein structures (best resolution) for each unique sequence associated with a gene. Additionally, whenever corresponding data is available, variants will be annotated using information on variant pathogenicity from ClinVar.

This allows researchers to efficiently screen the PDB for the most suitable template structures for a specific WT or mutant protein, which can then be used for further in silico analysis such as homology modeling, mutagenesis experiments or molecular dynamics simulations.

Additionally, MutaPipe will download the AlphaFold predicted WT protein structure for each of the input genes (to a separate directory).

alt text

Figure 1. Pipeline overview. MutaPipe takes a list of gene names as input. The pipeline firstly performs a search of the entire Protein Data Bank (PDB) and retrieves information on all available protein structures associated with the input genes (and the species: Homo sapiens). Subsequently, a BLASTp against the canonical protein reference sequence (obtained from UniProt) is performed for each identified PDB sequence. The BLASTp alignment is used to identify mismatches, including single amino variants (SAVs), in the amino acid sequence of the PDB structures relative to the canonical sequence. Next, the data is filtered and rearranged to output a table listing the n highest quality structures for all available sequences in the PDB which are associated with the input genes. Additionally, if a given amino acid change has corresponding variant data in ClinVar, this information will be added to the output table.

Documentation

Workflow

MutaPipe uses the inputted gene names to find all associated PDB IDs (using the PDB API) and downloads all corresponding fasta, mmCif, and pdb files for further anaylses (will require disk space depending on the number and size of the available structures). In a next step, MutaPipe parses all the downloaded files and combines all relevant information (e.g. resolution, sequence, unsolved residues in PDB structures). Subsequently, MutaPipe accessess the canonical protein sequences for all input genes (from the UniProt fasta file) and uses BLASTp (e-value threshold: .05) to identify all mismatches, incl. SAVs, available in the identified PDB structures. BLASTp is performed separately for every sequence in every structure against the UniProt reference sequence for the input gene associated with this structure (e.g. BLASTp of all sequences in a given FUS structure against the canonical protein sequence for FUS). The .xml output files generated by BLASTp are parsed and used to extract information, incl.: e-value, bit, alignment length, mismatches, close mismatches, gaps, indels. This information will be used by MutaPipe to identify all unique sequences in the PDB which are associated with the input genes and subsequently get the best n structures (the ones with the highest resolution) for each of these sequences.

MutaPipe achieves this by sorting the data to get:

  • all available wildtype (WT) structures for the input genes
  • the best n structures per SAV (structures with only this one amino acid change/mutation and no other amino acid changes/mutations)
  • the best n structures per unique sequence/combination of mutations available in the PDB (including WT/no mutation)
  • the best n structures for any specific mutation, regardless of other mutations in the same structure

Finally, whenever corresponding data is available, MutaPipe will annotate variants using variant information from ClinVar, incl. information on variant pathogenicity.

Filtering
To ensure sequences not associated with the gene of interest which are present in a PDB structure (e.g. if a protein of interest is co-crystallised with another peptide) do not get included in the final output, MutaPipe also has an additional step to filter out sequences:

  1. which are shorter than a given percentage of the reference sequence (set variable relative_sequence_length; default value = 0.5)
  2. whose best High Scoring Segment Pair (HSP) covers less than a given percentage of the reference sequence (set variable hsp_coverage; default value = 0.1) (Note: A high-scoring segment pair, or HSP, is a subsegment of a pair of sequences, either nucleotide or amino acid, that share high level of similarity. The level of similarity between the sequences depends on the settings of the local or global alignment algorithm which generated them.)

Incorporated Scripts

There are currently 9 different python scripts incorporated in MutaPipe:

Script Input Operations Output
00 gene names specified using the -g flag
(e.g. -g "SOD1 ALS2 FUS")
- creates a directory called Results in the current
working directory where all further outputs will be stored

- sends a GET request to the PDB API to perform an exact search for
each input gene name and the species Homo sapiens to identify all
available protein structures.
- directory: Results/

- 00_search_overview_PDBids.csv
contains all gene names and corresponding PDB IDs if available

- 00_search_overview_availability.csv
contains a boolean value for each gene indicating whether if PDB data is available or not
01 00_search_overview_PDBids.csv - creates a directory in the Results directory for each input gene with available PDB data to store gene-specific outputs

- uses Biopython (Bio.PDB.PDBList module) to download all available mmCIF and PDB files for each gene into the respective gene folder

- sends a GET request to the PDB API download all available fasta files for each gene into the respective gene folder
- directory per gene: Results/GENENAME_[*n available]structures/

In each respective gene folder:
- all available mmCIF, pdb, and fasta files for the identifies structures
stored in format pdbID.cif/pdb/fasta

In the Results folder:
- 01_search_overview_folders.csv
lists all the the newly created folders and their contents
- 01_search_overview_n_structures.csv
lists number of structures retrieved per gene
02 01_search_overview_folders.csv
- searches for mmCIF files in all folders listed in 01_search_overview_folders.csv

Parses all available mmCIF files, extracting:
- resolution (999 for missing values/NMR structures)
- experimental methods (e.g. X-Ray Crystallography)
- polypeptide sequences (uninterrupted AA sequences in the structures with no unsolved residues)
- sequence information in fasta format (stored in files named [PDB_id]_ex.fasta)
In each respective gene folder:
- _ex.fasta file for every structure
fasta file created from the mmCIF file
- GENENAME_02_resolutions.csv
contains the resolution for all structures associated with this gene
- GENENAME_02_poly_seq.csv
contains all polypeptide sequences for all structures associated with this gene
- GENENAME_02_structure_info.csv
lists all available header information for all structures associated with this gene

In the Results folder:
- 02_all_resolutions.csv
contains the resolutions of all parsed structures for all genes
- 02_all_poly_seq.csv
contains all polypeptide sequences for all structures of all genes
- 02_structure_info.csv
lists all available header information for all structures associated with all genes
03 01_search_overview_folders.csv - searches for PDB files in all folders listed in 01_search_overview_folders.csv

- extracts info on missing residues / residues which have not been solved in the crystal structure
In the Results folder:
- 03_unsolved_residues_per_structure.csv
lists all unsolved residues in all structures for all genes (one row for each structure)
- 03_unsolved_residues_per_chain.csv
lists all unsolved residues in all chains of all structures for all genes (one row for each chain)
04 01_search_overview_folders.csv - searches for fasta and _ex.fasta files in all folders listed in 01_search_overview_folders.csv

- extracts info from fasta files, incl:
- chain name
- description
- species
- sequence

- extracts info from _ex.fasta files, incl:
- chain name
- description
- uniprot id
- sequence

- combines info from the two files
In each respective gene folder:
- GENENAME_04_fasta_info.csv
contains information extracted from all fasta files for this gene
- GENENAME_04_fasta_ex_info.csv
contains information extracted from all _ex.fasta files for this gene
- GENENAME_04_fasta_combined_info.csv
contains combined information extracted from all fasta and _ex.fasta files for this gene

In the Results folder:
- 04_fasta_info.csv
contains information extracted from all fasta files for all genes
- 04_fasta_ex_info.csv
contains information extracted from all _ex.fasta files for all genes
- 04_fasta_combined_info.csv
contains combined information extracted from all fasta and _ex.fasta files for all genes
05 04_fasta_combined_info.csv - creates a directory called RefSeqs in the Results directory
............................................................................................................

- loops over input csv file (one row for each unique sequence in all PDB structures for all genes)

- extracts the reference sequence for each gene from the Uniprot reference fasta into the RefSeqs directory

- writes a fasta file for each unique sequence/chain in the input csv into the RefSeqs directory

- performs BLASTp of each sequence against its reference sequence (e.g. FUS canonical sequence serves as reference for all sequences in all FUS structures) (output stored in .xml format in RefSeqs)

- parses the BLASTp output (xml files) to identify mismatches, gaps etc.
- directory: Results/RefSeqs
- directory: Resutls/RefSeqs/PDB_seqs_and_blastp_outputs

In the RefSeqs folder:
- A fasta file for each input gene containing the reference sequence
in format GENE_reference.fasta

In the RefSeqs/PDB_seqs_and_blastp_outputs folder:
- A fasta file for every unique sequence in the identified PDB structures
in format GENE_pdbID_Chains.fasta
- BLASTp output files (.xml) for all BLASTp runs
in format GENE_pdbID_Chains.xml

In each respective gene folder:- 05_blastp_results.csv
lists all the information in the input file and the corresponding BLASTp results for this gene
- 05_refeseq_warnings.csv
will contain data if there is no or more than one identified reference sequence for this gene (only first one is used for further analyses)
- 05_blastp_warnings.csv
will only contain data if there are warnings regarding BLASTp for this gene, including if BLASTp failed or if there is more than one alignment (should only be one as only one reference is used)

In the Results folder:
- 05_blastp_results.csv
lists all the information in the input file and the corresponding BLASTp results
- 05_refeseq_warnings.csv
lists genes with no or more than one identified reference sequence (only first one is used for further analyses)
- 05_blastp_warnings.csv
lists warnings regarding BLASTp, including if BLASTp failed or if there is more than one alignment (should only be one as only one reference is used)
06_a 00_search_overview_availability.csv - uses Entrez Direct (GET request) to query ClinVar for information on each input gene

- downloads xml files with ids for all variants for each input gene

- parses xml files with variant ids for each gene to construct search urls for 250 ids at a time (to ensure url doesn't exceed maximum lenght)

- uses constructed urls to download variant data (xml format) from ClinVar for all variant ids associated with each gene
- directory: Results/ClinVar_Annotations

In the Results/ClinVar_Annotations folder:
- one xml file for each gene containing all assiociated variant ids in ClinVar
in format 06_a_ClinVar_[gene]_ids.xml

- xml files containing variant information for 250 variants at a time
in format 06_a_ClinVar_[gene]_data_batch_[*i*]_of_[*n* batches].xml

In the Results folder:
- 06_a_ClinVar_Annotations_genes_no_data_retrieved.txt
lists all genes for which no ClinVar annotations were retrieved
06_b (takes no input) - parse the xml batch files containing variant information from ClinVar (created by the previous script 06_a)

- create a dataframe with ClinVar information for all variants for all input genes
In each respective gene folder:
- GENENAME_06_b_ClinVar_Annotations.csv
lists ClinVar annotations for all variants in this input genes

In the Results folder:
- 06_b_ClinVar_Annotations.csv
lists ClinVar annotations for all variants in all input genes
07 02_structure_info.csv

03_unsolved_residues_per_chain.csv

05_blastp_results.csv

06_b_ClinVar_Annotations.csv
- combine the information in the 3 dfs 02_structure_info.csv, 03_unsolved_residues_per_chain.csv, 05_blastp_results.csv (according to PDBid and chain)

- filter out sequences which are shorter than a given percentage of the reference sequence (set variable relative_sequence_length)

- filter out sequences whose best hsp covers less than a given percentage of the reference sequence (set variable hsp_coverage)

-sort/filter the df in order to get:
- n best structures (best resolution) for all single amino acid variants (SAVs) (structures with only this one mutation and no other mutations)
- n best structures (best resolution) for all unique combinations of mutations available in the PDB
- n best structures (best resolution) for any specific mutation, regardless of other mutations in the same structure
- all wildtype structures (defined as HSP covering 99% of reference sequence, 100% similarity, no mismatches)

- add all available ClinVar annotations to all three n_best_structure tables/dfs
In each respective gene folder:
- GENENAME_07_best_structures_per_SAV.csv
lists n best structures for each SAV (one mutation per structure) for this gene (incl. ClinVar annotations)
- GENENAME_07_best_structures_all_unique_combinations.csv
lists n best structures for all unique sequences/mismatch combinations for this gene (incl. ClinVar annotations)
- GENENAME_07_best_structures_any_mutation.csv
lists n best structures for any variant/mismatch in this gene regardless of other mismatches in the same sequence (incl. ClinVar annotations)
- GENENAME_07_wildtype_structures
lists all available WT structures for this gene

In the Results folder:
- 07_best_structures_per_SAV.csv
lists n best structures for each SAV (one mutation per structure) for all genes (incl. ClinVar annotations)
- 07_best_structures_all_unique_combinations.csv
lists n best structures for all unique sequences/mismatch combinations for all genes (incl. ClinVar annotations)
- 07_best_structures_any_mutation.csv
lists n best structures for any variant/mismatch in all genes regardless of other mismatches in the same sequence (incl. ClinVar annotations)
- 07_wildtype_structures
lists all available WT structures for all genes
08 gene names specified using the -g flag
(e.g.-g "SOD1 ALS2 FUS")
- gets the corresponding UniProt ID for each gene name (in Homo Sapiens) via the UniProt API
- creates a directory called AlphaFold_structures
- downloads all AlphaFold2 predicted structures for the identified UniProt IDs
- outputs a csv file called 08_AlphaFold_structures indicating download status for each structure
In the Results folder:
- 08_AlphaFold_structures.csv
lists information on downloaded AlphaFold predicted structures for all input genes

In the Results/AlphaFold_structures folder:
- AlphaFold predicted structures (WT) for all input genes (whenever available in AlphaFold database)

Minimum Requirements

  • stable internet connection
  • python > 3.6.8 (see Table below for required modules, packages and libraries)
    • Biopython
  • NCBI BLAST+
  • RAM: >16GB
  • Space required by the installation: 36 MB
  • Scratch space for usage: depends on the number and size of PDB structures associated with the input genes NOTE: To check how many structures are available for your genes of interest, install MutaPipe and run the command python3 00_search_pdb.py -g GENES_OF_INTEREST from with the MutaPipe directory. This will generate console output as well as csv files both of which provide an overview of how many structures have been identified for your input genes.
  • Reference Proteome (default version included in MutaPipe) NOTE: To run MutaPipe a fasta file containing the reference proteome / canonical protein sequences for the organism in question (= Homo sapiens) is required (see further below for download instructions).

The table below list all requirements (including required Python modules, packages and libraries) for each of the 8 Python scripts in MutaPipe:

Python Script Required Modules / Packages / Libraries
(* = not included in Python)
Other requirements
All scripts argparse
datetime
os
pandas*
sys
Python > 3.6.8

RAM: >16GB
00 requests* Internet access (PDB API)
01 ast
Biopython*
requests*
Internet access (PDB API)
02 Biopython*
03 Biopython*
04 Biopython*
05 Biopython*
numpy*
re
UniProt Reference Proteome (fasta)

NCBI BLAST+
06_a math
requests*
xml.etree.ElementTree
Internet access (for accessing the ClinVar API via Entrez Direct)
06_b xml.etree.ElementTree
07 ast
numpy*
08 requests* Internet access (for UniProt API and to download AF structures from AlphaFold database)

Local Deployment

To obtain MutaPipe please use git to download the most recent development tree:

git clone https://github.com/Utilon/MutaPipe_Repo.git

Usage

MutaPipe can be run in two different ways (from within the MutaPipe directory):

  1. Using the bash script MutaPipe.sh
  2. Using the MutaPipe python scripts individually

Using the bash script

MutaPipe can be run using the bash script MutaPipe.sh. MutaPipe.sh executes all python scripts incorporated in the pipeline one after another. (If instead you wish to run any of the pipeline stages separately, please follow the instructions further below).

All arguments for the MutaPipe.sh script are OPTIONAL. The script has preset default values, i.e. it can be run without setting any of the available flags. To view the default values set in MutaPipe.sh, use the command ./MutaPipe.sh -h.

  Usage: [-h|l|t|g|o|a|f|p|d|b|u|r|c|n|e]
   
  General options:
  -h	HELP				Print this help message and exit
  -l	LOG				Write console output to log file in target directory if set to True. Default = False
  -t	TARGET_DIRECTORY  		Specify target directory where output will be stored. Default = current_working_directory
  
  MutaPipe options:
  -g	GENES				Specify genes of interest. To to pass a file containing all gene names use -g \$(cat filename). Default = ['OPTN', 'ERBB4', 'DCTN1']
  -o	ORGANISM			Set species for which to search pdb structures. Default = Homo sapiens
  -a	ALL_PDB_IDS		    	Specify whether to retrieve all (True) or max. 10 PDB IDs (False) per gene. Default = True
  -f	FORMAT			  	Specify file formats to download. Default = [cif pdb fasta]. Options = [cif pdb fasta]
  -p	POLYPEPTIDES			Specify whether to extract polypeptide sequence (True) or not (False). Default = True
  -d    DELETE_FILES   		     	Specify whether to delete mmCIF, pdb and fasta files after parsing (True) or not (False). Default = True
  -b	BLASTp_PATH			Set path to blastp on your system. Default = blastp
  -u	UNIPROT_REFSEQS			Set path to reference proteome fasta file. Default = MutaPipe_Repo/MutaPipe/Uniprot_reference_seqs/
  -r	RELATIVE_SEQUENCE_LENGTH	Set to filter out sequences shorter than a given percentage of the reference sequence (0.1-1.0). Default = 0.5
  -c	HSP_COVERAGE			Set to filter out sequences whose best hsp covers less than a given percentage of the reference sequence (0.1-1.0). Default = 0.1   
  -n    N_BEST_STRUCTURES  		Set number of best structures to be listed in output for each sequence/variant. Default = 1
  -e    EXCLUDE_UNSOLVED_MISMATCHES     Indicate whether to exclude cases where the mismatch of interest is not solved in the crystal structure (True) or not (False). Default = False
Usage example bash

To run MutaPipe for gene NEK1 and set the paramaters to filter out

  1. sequences shorter than 70% of the reference sequence and
  2. sequences whose best HSP covers less than 30% of the reference sequence, use the command:
./MutaPipe.sh -g NEK1 -r 0.7 -c 0.3

To run MutaPipe with more than one input gene, e.g. for NEK1 and SOD1, and the same parameters as before, use the command:

./MutaPipe.sh -g "NEK1 SOD1" -r 0.7 -c 0.3

To run MutaPipe passing a file containing the genes of interest, e.g. this example file called genes.txt, and the same parameters as before, use the command:

./MutaPipe.sh -g $(cat genes.txt) -r 0.7 -c 0.3

Using the python scripts

MutaPipe can be run manually, by running the 9 python scripts one after another (script 00, 01, 02, 03, 04, 05, 06_a, 06_b, 07, 08). Use the -h flag on any MutaPipe python script to see instructions and available arguments (e.g. use the command: python3 script_name -h).

All arguments for the MutaPipe python scripts are OPTIONAL. The scripts have preset default values, i.e. they can be run without setting any of the available flags. To view the default values set in each MutaPipe python script, use the command python3 script_name -h.

Script dependencies: Although the scripts can be run individually, they mostly require at least some output from previous scripts in the pipeline. The table below lists the script dependencies in more detail:

+Table to be added here

The following options are available for all the MutaPipe python scripts:

-h, --help		show help message and exit
-l, --log		write console output to log file in current directory if set to True, default = False
-t, --target		specify target directory, default = current_working_directory

Moreover, additional arguments can be set during different stages of MutaPipe:

# additional arguments for script 00_search_pdb.py
-g, --genes		Specify genes for which to search pdb structures, default = ['OPTN', 'ERBB4', 'DCTN1']; to pass a file containing all genes use -g $(cat filename)
-o, --organism		Specify species for which to search pdb structures, default = Homo sapiens
-a, --all		Retrieve all (True) vs max. 10 pdb IDs per gene (False), default = True

# additional arguments for script 01_download_files.py
-f, --format		Specify file format to be downloaded. For mmCif files (.cif) use 'cif' ; for pdb files (.pdb) use 'pdb' ; for fasta files (.fasta) use 'fasta' ; default = cif pdb fasta

# additional arguments for script 02_parse_cif_files.py
-pp, --polypeptides	Specify whether to extract polypeptide sequence (True) or not (False), default = True
-del, --delete_files 	Specify whether to delete mmCIF files after parsing (True) or not (False), default = True

# additional arguments for script 03_parse_pdb_files_extract_unsolved_residues.py
-del, --delete_files 	Specify whether to delete mmCIF files after parsing (True) or not (False), default = True

# additional arguments for script 04_parse_fasta_files.py
-del, --delete_files 	Specify whether to delete mmCIF files after parsing (True) or not (False), default = True

# additional arguments for script 05_blast_against_reference.py
-bp, --blastp_path		Specify the path to blastp on your system ; default = blastp       
-refseq, --reference_sequences	Specify path to uniprot reference fasta, default = MutaPipe_Repo/MutaPipe/Uniprot_reference_seqs/UP000005640_9606.fasta    
# additional arguments for script 07_combine_data_to_get_best_n_structures_per_sequence.py
-rsl, --relative_sequence_length filter out sequences shorter than a given percentage of the reference sequence, default = 0.5
-cov, --hsp_coverage		 filter out sequences whose best HSP covers less than a given percentage of the reference sequence, default = 0.1
-n_best, --n_best_structures	 Specify number of top structures per sequence/variant to be included in final output, default = 5
Usage example python

Let's assume we want to run MutaPipe for the genes SOD1, ERBB4, and DCTN1 using the python scripts one after another, so we have more direct control over the options at each stage of the pipeline.

Our genes of interest for this example are stored in a file called genes.txt

Using the first MutaPipe python script 00_search_pdb.py and the -g flag, we can search the PDB for structures associated with our genes of interest. Let's assume we also want to restrict our search to only retrieve max. 10 PDB ID's per input gene. We can do this by setting the -a flag to False.

python3 00_search_pdb.py -g $(cat genes.txt) -a False

The output files generated by 00_search_pdb.py reveal that there are PDB structures for all our genes of interest. More specifically, our search retrieved:

  1. 10 PDB IDs for SOD1 (setting -a to True results in 127 PDB ID's)
  2. 10 PDB IDs for ERBB4
  3. 9 PDB IDs for DCTN1

Next, we use the scipt 01_download_files.py to download all relevant mmCif, pdb, and fasta files (for the PDB IDs identified in the previous step). While we could use the -fflag here to specify the desired file format to be downloaded, we don't set the flag as the default option (set to download mmCif, pdb, and fasta files) is required to run the full MutaPipe pipeline. Changing the file format to be downloaded is not recommended unless the rest one doesn't intend to run the rest of the pipeline - Remember that a complete MutaPipe run always requires downloading all three file formats (mmCif, pdb, fasta).

python3 01_download_files.py

Looking at our Results directory, we now see a new subfolder for each of our input genes (SOD1, ERBB4, and DCTN1) which contains all associated files retrieved from the PDB for this gene (mmCif, pdb, fasta).

Now we are going to parse the downloaded mmCif files using the script 02_parse_cif_files.py. We can set the flag -pp to True or False to indicate whether we want to also extract the polypeptide sequences from the mmCif files or not. The polypeptide sequences are continuous chains of amino acids without any missing residues in the crystal structure (comparable to how the sequence is shown in PyMOL). It is not required to extract the polypeptide sequences for MutaPipe to run, so let's set the -ppflag to False:

python3 02_parse_cif_files.py -pp False

The output files generated by 02_parse_cif_files.py list important information on all the structures, including structure name, resolution, structure method, deposition date, and classification.

In a next step, we use script 03_parse_pdb_files_extract_unsolved_residues.py to extract all unsolved residues in each crystall structure from the previously downloaded pdb files.

python3 03_parse_pdb_files_extract_unsolved_residues.py

The output files generated by python3 03_parse_pdb_files_extract_unsolved_residues.py provide an overview of all unsolved residues per structure and per chain. Unresolved/missing residues are amino acids which are absent in the crystall structure (no atomic coordinates available).

Next, we are going to use the script 04_parse_fasta_files.py to parse the downloaded fasta files as well as the fasta files created from the mmCif files (all in format _ex.fasta).

python3 04_parse_fasta_files.py

The output files generated by python3 04_parse_fasta_files.py list important information on all the sequences/chains in each fasta file, including chain name, UniProt ID, description, species, and amino acid sequence.

Now that all the files have been parsed, we can use the script 05_blast_against_reference.py to perform BLASTp. The script will perform a BLASTp for every sequence in every structure associated with our input genes against the corresponding reference sequence. By default, the reference sequence for each of our input genes will be retrieved from the UniProt fasta file containing the human reference proteome called UP000005640_9606.fasta which is included in MutaPipe (version December 2021; see instructions on how to download the latest version of this file here). If we wanted to use a different file, we could set the -refseq flag and specify the path to the fasta file containing the reference proteome we'd like to use.

python3 05_blast_against_reference.py

Looking at our Results directory, we can now see that a new subdirectory called RefSeqs has been created. It contains the canonical amino acid sequences for all of our input genes (SOD1, ERBB4, and DCTN1) as well as a subdirectory called PDB_seqs_and_blastp_outputs where all the PDB sequences used as query sequences in the BLASTp are stored in fasta format, and their corresponding blastp output files are stored in xml format.

Moreover, the output files generated by script 05_blast_against_reference.py nicely summarise all BLASTp results. Note that there are two files containing potential warnings: One in case no or more than one reference sequence is identified for an input gene and one in case BLASTp fails or does not produce any significant alignements. In our case 3 warnings for the gene DCTN1 have been produced, namely:

  1. No alignments for DCTN1_2hl3_ChainC.fasta
  2. No alignments for DCTN1_2hqh_ChainsE,F,G,H.fasta
  3. No alignments for DCTN1_3e2u_ChainsE,F,G,H.fasta To investigate potential reasons for why there were no alignments for these sequences, we can have a look at these sequences in the output file called 05_blastp_results.csv. In our case, we can infer that the warnings have been generated because the sequences in question are actually not DCTN1 sequences, but are other protein sequences which happen to be also present in/co-crystallised with a PDB structure associated with our input genes. The sequences which produced warnings in our case are not DCTN1 sequences, but are instead (1) Microtubule-associated protein RP/EB family member 1, (2) Restin, and (3) CAP-Gly domain-containing linker protein 1.

The next two steps consist of downloading and parsing ClinVar annotations for all variants of our genes of interest. This information is used to annotate all variants available in the identified PDB sequences for the genes of interest. The script 06_a_download_ClinVar_data.py incorporates Entrez Direct to access ClinVar and download data on all variant ids associated with our input genes. Subsequently, script 06_b_parse_ClinVar_data.py will parse the data downloaded from ClinVar.

python3 06_a_download_ClinVar_data.py
python3 06_b_parse_ClinVar_data.py

Script 06_a_download_ClinVar_data.py produces a file called 06_a_ClinVar_Annotations_genes_no_data_retrieved.txt which contains a list of all genes for which no data has been retrieved from ClinVar. Script 06_b_parse_ClinVar_data.py generates a table called 06_b_ClinVar_Annotations.csv with information on all variants in ClinVar associated with our input genes.

Finally, the last script 07_combine_data_to_get_best_n_structures_per_sequence.py will combine the data from ClinVar, the PDB and the BLASTp results, and subsequently, filter, rearrange and order it to get the n best structures for each variant/sequence.

Here we can filter out:

  1. sequences which are considered too short (compared to the reference sequence) to be valuable template structures. We can use the relative sequence length flag -rsl to set this value (0.1-1.0).
  2. sequences whose best High Scoring Segment Pair (HSP) covers less than a given percentage of the reference sequence. We can use the hsp coverage flag -cov to set this value (0.1-1.0). (Note: A high-scoring segment pair, or HSP, is a subsegment of a pair of sequences, either nucleotide or amino acid, that share high level of similarity. The level of similarity between the sequences depends on the settings of the local or global alignment algorithm which generated them.)

For our example we'll stick to the default settings which are -rsl 0.5 and -cov 0.1, so we don't have to set the flags.

python3 07_combine_data_to_get_best_n_structures_per_sequence.py

Script 07_combine_data_to_get_best_n_structures_per_sequence.py produces a file called 07_wildtype_structures.csv which contains all available WT structures associated with the input genes (top row = highest quality structure). The other 3 csv output files generated by script 07_combine_data_to_get_best_n_structures_per_sequence.py neatly list the n best PDB structures for each sequence/variant available in the PDB which is associated with our input genes:

  • The best structures for all SAVs are contained in 07_best_structures_per_SAV.csv
  • The best structures for all unique sequences associated with the input genes are contained in 07_best_structures_all_unique_combinations.csv
  • The best structures for any available variant/amino acid change in the PDB regardless of other variants/AA changes in the same structure are contained in 07_best_structures_any_mutation.csv

Output

MutaPipe creates a directory called Results where all intermediary and final main output files are stored (Note: the -t flag lets you specify where you'd like this folder to be created, default = current working directory). All relevant files and data retrieved or extracted from the reference proteome fasta file, the ClinVar database, and the PDB are stored in additional subdirectories in the Results directory called ClinVar_Annotations, RefSeqs, and [GENENAME]_[n]structures (see Figure 2 below for an example).

Each of the eigth python scripts incorporated in MutaPipe produces at least one main output file which will be stored directly in the Results directory. NOTE: All MutaPipe output filenames start with the numerical identifier of the corresponding MutaPipe script, making it easy to understand which outputs have been generated at which stage of the pipeline.

After a complete MutaPipe run with the default settings the following main output files are stored in the Results directory:

- 00_search_overview_availability.csv 			* lists all input genes with a boolean value indicating whether corresponding PDB data exists
- 00_search_overview_PDBids.csv 			* lists all input genes with available PDB data and their corresponding PDB IDs
- 01_search_overview_folders.csv 			* lists all newly created directories (on for each input gene with available PDB data) and their content
- 01_search_overview_n_structures.csv 			* lists number of structures retrieved per gene (simple overview)
- 02_all_poly_seq.csv 					* lists polypeptide sequences for all PDB structures associated with the input genes
- 02_all_resolutions.csv 				* lists resolutions for all PDB structures associated with the input genes
- 03_unsolved_residues_per_chain.csv 			* lists all unsolved residues in all chains of all PDB structures for all genes (one row for each chain)
- 03_unsolved_residues_per_structure.csv 		* lists all unsolved residues in all PDB structures for all genes (one row for each structure)
- 04_fasta_combined_info.csv 				* lists combined information extracted from all fasta and fasta_ex files for all available sequences in the PDB associated with the input genes
- 04_fasta_info.csv 					* lists information extracted from all fasta files (downloaded from PDB) for all available sequences in the PDB associated with the input genes
- 04_fasta_ex_info.csv 					* lists information extracted from all _ex.fasta files (extracted from mmCif files) for all available sequences in the PDB associated with the input genes
- 05_blastp_results.csv 				* lists all the information in 04_fasta_combined_info.csv and the corresponding blastp results for every sequence
- 05_blastp_warnings.csv 				* lists blastp warnings (if blastp failed or if there is more than one alignment)
- 05_refeseq_warnings.csv 				* lists genes with no or more than one identified reference sequence (only first one is used for further analyses)
- 06_a_ClinVar_Annotations_genes_no_data_retrieved.txt 	* lists all genes for which no ClinVar annotations could be retrieved
- 06_b_ClinVar_Annotations.csv 				* lists all variants availalbe in ClinVar and their annotations for the input genes
- 07_best_structures_all_unique_combinations.csv 	* lists best structure for all unique sequences/mismatch combinations for all input genes, incl. ClinVar annotations
- 07_best_structures_any_mutation.csv 			* lists best structure for any mismatch regardless of other mismatches in the same structure for all input genes, incl. ClinVar annotations
- 07_best_structures_per_SAV.csv 			* lists best structure for each SAV (one mutation per structure) for all input genes, incl. ClinVar annotations
- 07_wildtype_structures.csv 				* lists all available WT structures for all input genes
+NOTE: A section explaining all columns in main output files will be added shortly!

To avoid cluttering the directory containing the main output files, MutaPipe creates subfolders in the Results directory where it stores any additional files and relevant data retrieved or extracted from the reference proteome fasta files, the ClinVar database, and the PDB.

Note: For each input gene with available PDB data, MutaPipe creates a subdirectory containing (1) data from the PDB retrieved via the PDB API (i.e. the corresponding mmCif, fasta, and pdb files) as well as (2) gene-specific output files generated at different steps in the pipeline.

MutaPipe creates the following subdirectories in Results:

- ClinVar_Annotations 			* subdirectory where all ClinVar data is stored
- RefSeqs 				* subdirectory where all the reference sequences and other files used/generated by BLASTp are stored
- [GENENAME]_[n]structures 		* subdirectory for each input gene where all corresponding data from the PDB as well as gene-specific output files are stored

Ouput example

Let's assume we want to get the best PDB structures for all availbale NEK1-related sequences in the PDB. Apart from ensuring all the mmCIF, PDB, and fasta files get deleted again after parsing to save space by setting the the -d option, we stick to the default settings and thus use the following command to obtain results:

./MutaPipe.sh -g NEK1 -d True

Consequently, we see that a new folder called Results has been created in our current working directory. It includes all the main output files as well as three additional subdirectories called ClinVar_Annotations, RefSeqs, and NEK1_2structures.

alt text

Figure 2. Output generated by MutaPipe a) looking at the content in Results b) looking at the content in subdirectories ClinVar_Annotations, RefSeqs, and NEK1_2structures. Note: MutaPipe's main output files and gene-specific output files are highlighted in yellow in 3a. and 3b., respectively. If we didn't set the -d option to delete the files after parsing, the NEK1_2structures folder would also contain all NEK1 mmCIF, PDB and fasta files.

How to Download the Reference Proteome

MutaPipe will need a fasta file with all canonical protein sequences of the species of interest.

NOTE: When installing MutaPipe (see Local Deployment) a fasta file of the reference proteome for Homo sapiens will be included (version from December 2021). Please feel free to update to a newer version if one is available (we will also regularly update the reference proteome file on here, but this is currently not done automatically).

In order to run MutaPipe a fasta file containing the reference proteome / canonical protein sequences for the organism in question (= Homo sapiens) will have to be downloaded from UniProt and stored in a folder called Uniprot_reference_seqs in the MutaPipe directory (where the python scripts are stored). The filename should be UP000005640_9606.fasta for the script to work properly. The latest version of this fasta file can be accessed via the UniProt ftp client.

Citation

MutaPipe has been developed by Deborah Ness, an NIHR Maudsley Biomedical Research Centre PhD Student, in collaboration with and supervised by Dr Alfredo Iacoangeli, for her PhD project at the Department of Basic and Clinical Neuroscience at King's College London.

This work has not yet been published (April 2022).

About

A bioinformatics pipeline to identify the best available PDB structures for all available variants for specified genes of interest

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published