A bioinformatics pipeline to identify high quality mutant and WT PDB structures for genes of interest
+NOTE: We are always working to improve MutaPipe, so any bug reports or suggestions are highly welcome.
MutaPipe is a fast and efficient bioinformatics pipeline to screen the Protein Data Bank (PDB) for genes of interest and retrieve the highest quality protein structures (best resolution) for each unique sequence associated with a gene. Additionally, whenever corresponding data is available, variants will be annotated using information on variant pathogenicity from ClinVar.
This allows researchers to efficiently screen the PDB for the most suitable template structures for a specific WT or mutant protein, which can then be used for further in silico analysis such as homology modeling, mutagenesis experiments or molecular dynamics simulations.
Additionally, MutaPipe will download the AlphaFold predicted WT protein structure for each of the input genes (to a separate directory).
Figure 1. Pipeline overview. MutaPipe takes a list of gene names as input. The pipeline firstly performs a search of the entire Protein Data Bank (PDB) and retrieves information on all available protein structures associated with the input genes (and the species: Homo sapiens). Subsequently, a BLASTp against the canonical protein reference sequence (obtained from UniProt) is performed for each identified PDB sequence. The BLASTp alignment is used to identify mismatches, including single amino variants (SAVs), in the amino acid sequence of the PDB structures relative to the canonical sequence. Next, the data is filtered and rearranged to output a table listing the n highest quality structures for all available sequences in the PDB which are associated with the input genes. Additionally, if a given amino acid change has corresponding variant data in ClinVar, this information will be added to the output table.
MutaPipe uses the inputted gene names to find all associated PDB IDs (using the PDB API) and downloads all corresponding fasta
, mmCif
, and pdb
files for further anaylses (will require disk space depending on the number and size of the available structures).
In a next step, MutaPipe parses all the downloaded files and combines all relevant information (e.g. resolution, sequence, unsolved residues in PDB structures). Subsequently, MutaPipe accessess the canonical protein sequences for all input genes (from the UniProt fasta file) and uses BLASTp (e-value threshold: .05) to identify all mismatches, incl. SAVs, available in the identified PDB structures. BLASTp is performed separately for every sequence in every structure against the UniProt reference sequence for the input gene associated with this structure (e.g. BLASTp of all sequences in a given FUS structure against the canonical protein sequence for FUS). The .xml
output files generated by BLASTp are parsed and used to extract information, incl.: e-value, bit, alignment length, mismatches, close mismatches, gaps, indels. This information will be used by MutaPipe to identify all unique sequences in the PDB which are associated with the input genes and subsequently get the best n structures (the ones with the highest resolution) for each of these sequences.
MutaPipe achieves this by sorting the data to get:
- all available wildtype (WT) structures for the input genes
- the best n structures per SAV (structures with only this one amino acid change/mutation and no other amino acid changes/mutations)
- the best n structures per unique sequence/combination of mutations available in the PDB (including WT/no mutation)
- the best n structures for any specific mutation, regardless of other mutations in the same structure
Finally, whenever corresponding data is available, MutaPipe will annotate variants using variant information from ClinVar, incl. information on variant pathogenicity.
Filtering
To ensure sequences not associated with the gene of interest which are present in a PDB structure (e.g. if a protein of interest is co-crystallised with another peptide) do not get included in the final output, MutaPipe also has an additional step to filter out sequences:
- which are shorter than a given percentage of the reference sequence (set variable relative_sequence_length; default value = 0.5)
- whose best High Scoring Segment Pair (HSP) covers less than a given percentage of the reference sequence (set variable hsp_coverage; default value = 0.1) (Note: A high-scoring segment pair, or HSP, is a subsegment of a pair of sequences, either nucleotide or amino acid, that share high level of similarity. The level of similarity between the sequences depends on the settings of the local or global alignment algorithm which generated them.)
There are currently 9 different python scripts incorporated in MutaPipe:
Script | Input | Operations | Output |
---|---|---|---|
00 | gene names specified using the -g flag(e.g. -g "SOD1 ALS2 FUS" ) |
- creates a directory called Results in the current working directory where all further outputs will be stored - sends a GET request to the PDB API to perform an exact search for each input gene name and the species Homo sapiens to identify all available protein structures. |
- directory: Results/ - 00_search_overview_PDBids.csv contains all gene names and corresponding PDB IDs if available - 00_search_overview_availability.csv contains a boolean value for each gene indicating whether if PDB data is available or not |
01 | 00_search_overview_PDBids.csv |
- creates a directory in the Results directory for each input gene with available PDB data to store gene-specific outputs - uses Biopython (Bio.PDB.PDBList module) to download all available mmCIF and PDB files for each gene into the respective gene folder - sends a GET request to the PDB API download all available fasta files for each gene into the respective gene folder |
- directory per gene: Results/GENENAME_[*n available]structures/ In each respective gene folder: - all available mmCIF, pdb, and fasta files for the identifies structures stored in format pdbID.cif/pdb/fasta In the Results folder: - 01_search_overview_folders.csv lists all the the newly created folders and their contents - 01_search_overview_n_structures.csv lists number of structures retrieved per gene |
02 | 01_search_overview_folders.csv |
- searches for mmCIF files in all folders listed in 01_search_overview_folders.csv Parses all available mmCIF files, extracting: - resolution (999 for missing values/NMR structures) - experimental methods (e.g. X-Ray Crystallography) - polypeptide sequences (uninterrupted AA sequences in the structures with no unsolved residues) - sequence information in fasta format (stored in files named [PDB_id]_ex.fasta ) |
In each respective gene folder: - _ex.fasta file for every structure fasta file created from the mmCIF file - GENENAME_02_resolutions.csv contains the resolution for all structures associated with this gene - GENENAME_02_poly_seq.csv contains all polypeptide sequences for all structures associated with this gene - GENENAME_02_structure_info.csv lists all available header information for all structures associated with this gene In the Results folder: - 02_all_resolutions.csv contains the resolutions of all parsed structures for all genes - 02_all_poly_seq.csv contains all polypeptide sequences for all structures of all genes - 02_structure_info.csv lists all available header information for all structures associated with all genes |
03 | 01_search_overview_folders.csv |
- searches for PDB files in all folders listed in 01_search_overview_folders.csv - extracts info on missing residues / residues which have not been solved in the crystal structure |
In the Results folder: - 03_unsolved_residues_per_structure.csv lists all unsolved residues in all structures for all genes (one row for each structure) - 03_unsolved_residues_per_chain.csv lists all unsolved residues in all chains of all structures for all genes (one row for each chain) |
04 | 01_search_overview_folders.csv |
- searches for fasta and _ex.fasta files in all folders listed in 01_search_overview_folders.csv - extracts info from fasta files, incl: - chain name - description - species - sequence - extracts info from _ex.fasta files, incl: - chain name - description - uniprot id - sequence - combines info from the two files |
In each respective gene folder: - GENENAME_04_fasta_info.csv contains information extracted from all fasta files for this gene - GENENAME_04_fasta_ex_info.csv contains information extracted from all _ex.fasta files for this gene - GENENAME_04_fasta_combined_info.csv contains combined information extracted from all fasta and _ex.fasta files for this gene In the Results folder: - 04_fasta_info.csv contains information extracted from all fasta files for all genes - 04_fasta_ex_info.csv contains information extracted from all _ex.fasta files for all genes - 04_fasta_combined_info.csv contains combined information extracted from all fasta and _ex.fasta files for all genes |
05 | 04_fasta_combined_info.csv |
- creates a directory called RefSeqs in the Results directory ............................................................................................................ - loops over input csv file (one row for each unique sequence in all PDB structures for all genes) - extracts the reference sequence for each gene from the Uniprot reference fasta into the RefSeqs directory - writes a fasta file for each unique sequence/chain in the input csv into the RefSeqs directory - performs BLASTp of each sequence against its reference sequence (e.g. FUS canonical sequence serves as reference for all sequences in all FUS structures) (output stored in .xml format in RefSeqs) - parses the BLASTp output (xml files) to identify mismatches, gaps etc. |
- directory: Results/RefSeqs - directory: Resutls/RefSeqs/PDB_seqs_and_blastp_outputs In the RefSeqs folder: - A fasta file for each input gene containing the reference sequence in format GENE_reference.fasta In the RefSeqs/PDB_seqs_and_blastp_outputs folder: - A fasta file for every unique sequence in the identified PDB structures in format GENE_pdbID_Chains.fasta - BLASTp output files (.xml) for all BLASTp runs in format GENE_pdbID_Chains.xml In each respective gene folder:- 05_blastp_results.csv lists all the information in the input file and the corresponding BLASTp results for this gene - 05_refeseq_warnings.csv will contain data if there is no or more than one identified reference sequence for this gene (only first one is used for further analyses) - 05_blastp_warnings.csv will only contain data if there are warnings regarding BLASTp for this gene, including if BLASTp failed or if there is more than one alignment (should only be one as only one reference is used) In the Results folder: - 05_blastp_results.csv lists all the information in the input file and the corresponding BLASTp results - 05_refeseq_warnings.csv lists genes with no or more than one identified reference sequence (only first one is used for further analyses) - 05_blastp_warnings.csv lists warnings regarding BLASTp, including if BLASTp failed or if there is more than one alignment (should only be one as only one reference is used) |
06_a | 00_search_overview_availability.csv |
- uses Entrez Direct (GET request) to query ClinVar for information on each input gene - downloads xml files with ids for all variants for each input gene - parses xml files with variant ids for each gene to construct search urls for 250 ids at a time (to ensure url doesn't exceed maximum lenght) - uses constructed urls to download variant data (xml format) from ClinVar for all variant ids associated with each gene |
- directory: Results/ClinVar_Annotations In the Results/ClinVar_Annotations folder: - one xml file for each gene containing all assiociated variant ids in ClinVar in format 06_a_ClinVar_[gene]_ids.xml - xml files containing variant information for 250 variants at a time in format 06_a_ClinVar_[gene]_data_batch_[*i*]_of_[*n* batches].xml In the Results folder: - 06_a_ClinVar_Annotations_genes_no_data_retrieved.txt lists all genes for which no ClinVar annotations were retrieved |
06_b | (takes no input) | - parse the xml batch files containing variant information from ClinVar (created by the previous script 06_a) - create a dataframe with ClinVar information for all variants for all input genes |
In each respective gene folder: - GENENAME_06_b_ClinVar_Annotations.csv lists ClinVar annotations for all variants in this input genes In the Results folder: - 06_b_ClinVar_Annotations.csv lists ClinVar annotations for all variants in all input genes |
07 | 02_structure_info.csv 03_unsolved_residues_per_chain.csv 05_blastp_results.csv 06_b_ClinVar_Annotations.csv |
- combine the information in the 3 dfs 02_structure_info.csv , 03_unsolved_residues_per_chain.csv , 05_blastp_results.csv (according to PDBid and chain)- filter out sequences which are shorter than a given percentage of the reference sequence (set variable relative_sequence_length )- filter out sequences whose best hsp covers less than a given percentage of the reference sequence (set variable hsp_coverage )-sort/filter the df in order to get: - n best structures (best resolution) for all single amino acid variants (SAVs) (structures with only this one mutation and no other mutations) - n best structures (best resolution) for all unique combinations of mutations available in the PDB - n best structures (best resolution) for any specific mutation, regardless of other mutations in the same structure - all wildtype structures (defined as HSP covering 99% of reference sequence, 100% similarity, no mismatches) - add all available ClinVar annotations to all three n_best_structure tables/dfs |
In each respective gene folder: - GENENAME_07_best_structures_per_SAV.csv lists n best structures for each SAV (one mutation per structure) for this gene (incl. ClinVar annotations) - GENENAME_07_best_structures_all_unique_combinations.csv lists n best structures for all unique sequences/mismatch combinations for this gene (incl. ClinVar annotations) - GENENAME_07_best_structures_any_mutation.csv lists n best structures for any variant/mismatch in this gene regardless of other mismatches in the same sequence (incl. ClinVar annotations) - GENENAME_07_wildtype_structures lists all available WT structures for this gene In the Results folder: - 07_best_structures_per_SAV.csv lists n best structures for each SAV (one mutation per structure) for all genes (incl. ClinVar annotations) - 07_best_structures_all_unique_combinations.csv lists n best structures for all unique sequences/mismatch combinations for all genes (incl. ClinVar annotations) - 07_best_structures_any_mutation.csv lists n best structures for any variant/mismatch in all genes regardless of other mismatches in the same sequence (incl. ClinVar annotations) - 07_wildtype_structures lists all available WT structures for all genes |
08 | gene names specified using the -g flag(e.g. -g "SOD1 ALS2 FUS" ) |
- gets the corresponding UniProt ID for each gene name (in Homo Sapiens) via the UniProt API - creates a directory called AlphaFold_structures - downloads all AlphaFold2 predicted structures for the identified UniProt IDs - outputs a csv file called 08_AlphaFold_structures indicating download status for each structure |
In the Results folder: - 08_AlphaFold_structures.csv lists information on downloaded AlphaFold predicted structures for all input genes In the Results/AlphaFold_structures folder: - AlphaFold predicted structures (WT) for all input genes (whenever available in AlphaFold database) |
- stable internet connection
- python > 3.6.8 (see Table below for required modules, packages and libraries)
- Biopython
- NCBI BLAST+
- RAM: >16GB
- Space required by the installation: 36 MB
- Scratch space for usage: depends on the number and size of PDB structures associated with the input genes
NOTE: To check how many structures are available for your genes of interest, install MutaPipe and run the command
python3 00_search_pdb.py -g GENES_OF_INTEREST
from with the MutaPipe directory. This will generate console output as well as csv files both of which provide an overview of how many structures have been identified for your input genes. - Reference Proteome (default version included in MutaPipe) NOTE: To run MutaPipe a fasta file containing the reference proteome / canonical protein sequences for the organism in question (= Homo sapiens) is required (see further below for download instructions).
The table below list all requirements (including required Python modules, packages and libraries) for each of the 8 Python scripts in MutaPipe:
Python Script | Required Modules / Packages / Libraries (* = not included in Python) |
Other requirements |
---|---|---|
All scripts | argparse datetime os pandas* sys |
Python > 3.6.8 RAM: >16GB |
00 | requests* |
Internet access (PDB API) |
01 | ast Biopython* requests* |
Internet access (PDB API) |
02 | Biopython* |
|
03 | Biopython* |
|
04 | Biopython* |
|
05 | Biopython* numpy* re |
UniProt Reference Proteome (fasta) NCBI BLAST+ |
06_a | math requests* xml.etree.ElementTree |
Internet access (for accessing the ClinVar API via Entrez Direct) |
06_b | xml.etree.ElementTree |
|
07 | ast numpy* |
|
08 | requests* |
Internet access (for UniProt API and to download AF structures from AlphaFold database) |
To obtain MutaPipe please use git to download the most recent development tree:
git clone https://github.com/Utilon/MutaPipe_Repo.git
MutaPipe can be run in two different ways (from within the MutaPipe
directory):
- Using the bash script
MutaPipe.sh
- Using the MutaPipe python scripts individually
MutaPipe can be run using the bash script MutaPipe.sh
. MutaPipe.sh executes all python scripts incorporated in the pipeline one after another. (If instead you wish to run any of the pipeline stages separately, please follow the instructions further below).
All arguments for the MutaPipe.sh script are OPTIONAL. The script has preset default values, i.e. it can be run without setting any of the available flags. To view the default values set in MutaPipe.sh, use the command ./MutaPipe.sh -h
.
Usage: [-h|l|t|g|o|a|f|p|d|b|u|r|c|n|e]
General options:
-h HELP Print this help message and exit
-l LOG Write console output to log file in target directory if set to True. Default = False
-t TARGET_DIRECTORY Specify target directory where output will be stored. Default = current_working_directory
MutaPipe options:
-g GENES Specify genes of interest. To to pass a file containing all gene names use -g \$(cat filename). Default = ['OPTN', 'ERBB4', 'DCTN1']
-o ORGANISM Set species for which to search pdb structures. Default = Homo sapiens
-a ALL_PDB_IDS Specify whether to retrieve all (True) or max. 10 PDB IDs (False) per gene. Default = True
-f FORMAT Specify file formats to download. Default = [cif pdb fasta]. Options = [cif pdb fasta]
-p POLYPEPTIDES Specify whether to extract polypeptide sequence (True) or not (False). Default = True
-d DELETE_FILES Specify whether to delete mmCIF, pdb and fasta files after parsing (True) or not (False). Default = True
-b BLASTp_PATH Set path to blastp on your system. Default = blastp
-u UNIPROT_REFSEQS Set path to reference proteome fasta file. Default = MutaPipe_Repo/MutaPipe/Uniprot_reference_seqs/
-r RELATIVE_SEQUENCE_LENGTH Set to filter out sequences shorter than a given percentage of the reference sequence (0.1-1.0). Default = 0.5
-c HSP_COVERAGE Set to filter out sequences whose best hsp covers less than a given percentage of the reference sequence (0.1-1.0). Default = 0.1
-n N_BEST_STRUCTURES Set number of best structures to be listed in output for each sequence/variant. Default = 1
-e EXCLUDE_UNSOLVED_MISMATCHES Indicate whether to exclude cases where the mismatch of interest is not solved in the crystal structure (True) or not (False). Default = False
To run MutaPipe for gene NEK1 and set the paramaters to filter out
- sequences shorter than 70% of the reference sequence and
- sequences whose best HSP covers less than 30% of the reference sequence, use the command:
./MutaPipe.sh -g NEK1 -r 0.7 -c 0.3
To run MutaPipe with more than one input gene, e.g. for NEK1 and SOD1, and the same parameters as before, use the command:
./MutaPipe.sh -g "NEK1 SOD1" -r 0.7 -c 0.3
To run MutaPipe passing a file containing the genes of interest, e.g. this example file called genes.txt, and the same parameters as before, use the command:
./MutaPipe.sh -g $(cat genes.txt) -r 0.7 -c 0.3
MutaPipe can be run manually, by running the 9 python scripts one after another (script 00
, 01
, 02
, 03
, 04
, 05
, 06_a
, 06_b
, 07
, 08
). Use the -h
flag on any MutaPipe python script to see instructions and available arguments (e.g. use the command: python3 script_name -h
).
All arguments for the MutaPipe python scripts are OPTIONAL. The scripts have preset default values, i.e. they can be run without setting any of the available flags. To view the default values set in each MutaPipe python script, use the command python3 script_name -h
.
Script dependencies: Although the scripts can be run individually, they mostly require at least some output from previous scripts in the pipeline. The table below lists the script dependencies in more detail:
+Table to be added here
The following options are available for all the MutaPipe python scripts:
-h, --help show help message and exit
-l, --log write console output to log file in current directory if set to True, default = False
-t, --target specify target directory, default = current_working_directory
Moreover, additional arguments can be set during different stages of MutaPipe:
# additional arguments for script 00_search_pdb.py
-g, --genes Specify genes for which to search pdb structures, default = ['OPTN', 'ERBB4', 'DCTN1']; to pass a file containing all genes use -g $(cat filename)
-o, --organism Specify species for which to search pdb structures, default = Homo sapiens
-a, --all Retrieve all (True) vs max. 10 pdb IDs per gene (False), default = True
# additional arguments for script 01_download_files.py
-f, --format Specify file format to be downloaded. For mmCif files (.cif) use 'cif' ; for pdb files (.pdb) use 'pdb' ; for fasta files (.fasta) use 'fasta' ; default = cif pdb fasta
# additional arguments for script 02_parse_cif_files.py
-pp, --polypeptides Specify whether to extract polypeptide sequence (True) or not (False), default = True
-del, --delete_files Specify whether to delete mmCIF files after parsing (True) or not (False), default = True
# additional arguments for script 03_parse_pdb_files_extract_unsolved_residues.py
-del, --delete_files Specify whether to delete mmCIF files after parsing (True) or not (False), default = True
# additional arguments for script 04_parse_fasta_files.py
-del, --delete_files Specify whether to delete mmCIF files after parsing (True) or not (False), default = True
# additional arguments for script 05_blast_against_reference.py
-bp, --blastp_path Specify the path to blastp on your system ; default = blastp
-refseq, --reference_sequences Specify path to uniprot reference fasta, default = MutaPipe_Repo/MutaPipe/Uniprot_reference_seqs/UP000005640_9606.fasta
# additional arguments for script 07_combine_data_to_get_best_n_structures_per_sequence.py
-rsl, --relative_sequence_length filter out sequences shorter than a given percentage of the reference sequence, default = 0.5
-cov, --hsp_coverage filter out sequences whose best HSP covers less than a given percentage of the reference sequence, default = 0.1
-n_best, --n_best_structures Specify number of top structures per sequence/variant to be included in final output, default = 5
Let's assume we want to run MutaPipe for the genes SOD1, ERBB4, and DCTN1 using the python scripts one after another, so we have more direct control over the options at each stage of the pipeline.
Our genes of interest for this example are stored in a file called genes.txt
Using the first MutaPipe python script 00_search_pdb.py
and the -g
flag, we can search the PDB for structures associated with our genes of interest. Let's assume we also want to restrict our search to only retrieve max. 10 PDB ID's per input gene. We can do this by setting the -a
flag to False.
python3 00_search_pdb.py -g $(cat genes.txt) -a False
The output files generated by 00_search_pdb.py
reveal that there are PDB structures for all our genes of interest. More specifically, our search retrieved:
- 10 PDB IDs for SOD1 (setting
-a
to True results in 127 PDB ID's) - 10 PDB IDs for ERBB4
- 9 PDB IDs for DCTN1
Next, we use the scipt 01_download_files.py
to download all relevant mmCif, pdb, and fasta files (for the PDB IDs identified in the previous step). While we could use the -f
flag here to specify the desired file format to be downloaded, we don't set the flag as the default option (set to download mmCif, pdb, and fasta files) is required to run the full MutaPipe pipeline. Changing the file format to be downloaded is not recommended unless the rest one doesn't intend to run the rest of the pipeline - Remember that a complete MutaPipe run always requires downloading all three file formats (mmCif, pdb, fasta).
python3 01_download_files.py
Looking at our Results directory, we now see a new subfolder for each of our input genes (SOD1, ERBB4, and DCTN1) which contains all associated files retrieved from the PDB for this gene (mmCif, pdb, fasta).
Now we are going to parse the downloaded mmCif files using the script 02_parse_cif_files.py
. We can set the flag -pp
to True or False to indicate whether we want to also extract the polypeptide sequences from the mmCif files or not. The polypeptide sequences are continuous chains of amino acids without any missing residues in the crystal structure (comparable to how the sequence is shown in PyMOL). It is not required to extract the polypeptide sequences for MutaPipe to run, so let's set the -pp
flag to False:
python3 02_parse_cif_files.py -pp False
The output files generated by 02_parse_cif_files.py
list important information on all the structures, including structure name, resolution, structure method, deposition date, and classification.
In a next step, we use script 03_parse_pdb_files_extract_unsolved_residues.py
to extract all unsolved residues in each crystall structure from the previously downloaded pdb files.
python3 03_parse_pdb_files_extract_unsolved_residues.py
The output files generated by python3 03_parse_pdb_files_extract_unsolved_residues.py
provide an overview of all unsolved residues per structure and per chain. Unresolved/missing residues are amino acids which are absent in the crystall structure (no atomic coordinates available).
Next, we are going to use the script 04_parse_fasta_files.py
to parse the downloaded fasta files as well as the fasta files created from the mmCif files (all in format _ex.fasta).
python3 04_parse_fasta_files.py
The output files generated by python3 04_parse_fasta_files.py
list important information on all the sequences/chains in each fasta file, including chain name, UniProt ID, description, species, and amino acid sequence.
Now that all the files have been parsed, we can use the script 05_blast_against_reference.py
to perform BLASTp. The script will perform a BLASTp for every sequence in every structure associated with our input genes against the corresponding reference sequence. By default, the reference sequence for each of our input genes will be retrieved from the UniProt fasta file containing the human reference proteome called UP000005640_9606.fasta
which is included in MutaPipe (version December 2021; see instructions on how to download the latest version of this file here). If we wanted to use a different file, we could set the -refseq
flag and specify the path to the fasta file containing the reference proteome we'd like to use.
python3 05_blast_against_reference.py
Looking at our Results directory, we can now see that a new subdirectory called RefSeqs
has been created. It contains the canonical amino acid sequences for all of our input genes (SOD1, ERBB4, and DCTN1) as well as a subdirectory called PDB_seqs_and_blastp_outputs
where all the PDB sequences used as query sequences in the BLASTp are stored in fasta format, and their corresponding blastp output files are stored in xml format.
Moreover, the output files generated by script 05_blast_against_reference.py
nicely summarise all BLASTp results. Note that there are two files containing potential warnings: One in case no or more than one reference sequence is identified for an input gene and one in case BLASTp fails or does not produce any significant alignements. In our case 3 warnings for the gene DCTN1 have been produced, namely:
- No alignments for
DCTN1_2hl3_ChainC.fasta
- No alignments for
DCTN1_2hqh_ChainsE,F,G,H.fasta
- No alignments for
DCTN1_3e2u_ChainsE,F,G,H.fasta
To investigate potential reasons for why there were no alignments for these sequences, we can have a look at these sequences in the output file called05_blastp_results.csv
. In our case, we can infer that the warnings have been generated because the sequences in question are actually not DCTN1 sequences, but are other protein sequences which happen to be also present in/co-crystallised with a PDB structure associated with our input genes. The sequences which produced warnings in our case are not DCTN1 sequences, but are instead (1) Microtubule-associated protein RP/EB family member 1, (2) Restin, and (3) CAP-Gly domain-containing linker protein 1.
The next two steps consist of downloading and parsing ClinVar annotations for all variants of our genes of interest. This information is used to annotate all variants available in the identified PDB sequences for the genes of interest. The script 06_a_download_ClinVar_data.py
incorporates Entrez Direct to access ClinVar and download data on all variant ids associated with our input genes. Subsequently, script 06_b_parse_ClinVar_data.py
will parse the data downloaded from ClinVar.
python3 06_a_download_ClinVar_data.py
python3 06_b_parse_ClinVar_data.py
Script 06_a_download_ClinVar_data.py
produces a file called 06_a_ClinVar_Annotations_genes_no_data_retrieved.txt
which contains a list of all genes for which no data has been retrieved from ClinVar. Script 06_b_parse_ClinVar_data.py
generates a table called 06_b_ClinVar_Annotations.csv
with information on all variants in ClinVar associated with our input genes.
Finally, the last script 07_combine_data_to_get_best_n_structures_per_sequence.py
will combine the data from ClinVar, the PDB and the BLASTp results, and subsequently, filter, rearrange and order it to get the n best structures for each variant/sequence.
Here we can filter out:
- sequences which are considered too short (compared to the reference sequence) to be valuable template structures. We can use the relative sequence length flag
-rsl
to set this value (0.1-1.0). - sequences whose best High Scoring Segment Pair (HSP) covers less than a given percentage of the reference sequence. We can use the hsp coverage flag
-cov
to set this value (0.1-1.0). (Note: A high-scoring segment pair, or HSP, is a subsegment of a pair of sequences, either nucleotide or amino acid, that share high level of similarity. The level of similarity between the sequences depends on the settings of the local or global alignment algorithm which generated them.)
For our example we'll stick to the default settings which are -rsl 0.5
and -cov 0.1
, so we don't have to set the flags.
python3 07_combine_data_to_get_best_n_structures_per_sequence.py
Script 07_combine_data_to_get_best_n_structures_per_sequence.py
produces a file called 07_wildtype_structures.csv
which contains all available WT structures associated with the input genes (top row = highest quality structure).
The other 3 csv output files generated by script 07_combine_data_to_get_best_n_structures_per_sequence.py
neatly list the n best PDB structures for each sequence/variant available in the PDB which is associated with our input genes:
- The best structures for all SAVs are contained in
07_best_structures_per_SAV.csv
- The best structures for all unique sequences associated with the input genes are contained in
07_best_structures_all_unique_combinations.csv
- The best structures for any available variant/amino acid change in the PDB regardless of other variants/AA changes in the same structure are contained in
07_best_structures_any_mutation.csv
MutaPipe creates a directory called Results
where all intermediary and final main output files are stored (Note: the -t
flag lets you specify where you'd like this folder to be created, default = current working directory). All relevant files and data retrieved or extracted from the reference proteome fasta file, the ClinVar database, and the PDB are stored in additional subdirectories in the Results
directory called ClinVar_Annotations
, RefSeqs
, and [GENENAME]_[n]structures
(see Figure 2 below for an example).
Each of the eigth python scripts incorporated in MutaPipe produces at least one main output file which will be stored directly in the Results
directory. NOTE: All MutaPipe output filenames start with the numerical identifier of the corresponding MutaPipe script, making it easy to understand which outputs have been generated at which stage of the pipeline.
After a complete MutaPipe run with the default settings the following main output files are stored in the Results
directory:
- 00_search_overview_availability.csv * lists all input genes with a boolean value indicating whether corresponding PDB data exists
- 00_search_overview_PDBids.csv * lists all input genes with available PDB data and their corresponding PDB IDs
- 01_search_overview_folders.csv * lists all newly created directories (on for each input gene with available PDB data) and their content
- 01_search_overview_n_structures.csv * lists number of structures retrieved per gene (simple overview)
- 02_all_poly_seq.csv * lists polypeptide sequences for all PDB structures associated with the input genes
- 02_all_resolutions.csv * lists resolutions for all PDB structures associated with the input genes
- 03_unsolved_residues_per_chain.csv * lists all unsolved residues in all chains of all PDB structures for all genes (one row for each chain)
- 03_unsolved_residues_per_structure.csv * lists all unsolved residues in all PDB structures for all genes (one row for each structure)
- 04_fasta_combined_info.csv * lists combined information extracted from all fasta and fasta_ex files for all available sequences in the PDB associated with the input genes
- 04_fasta_info.csv * lists information extracted from all fasta files (downloaded from PDB) for all available sequences in the PDB associated with the input genes
- 04_fasta_ex_info.csv * lists information extracted from all _ex.fasta files (extracted from mmCif files) for all available sequences in the PDB associated with the input genes
- 05_blastp_results.csv * lists all the information in 04_fasta_combined_info.csv and the corresponding blastp results for every sequence
- 05_blastp_warnings.csv * lists blastp warnings (if blastp failed or if there is more than one alignment)
- 05_refeseq_warnings.csv * lists genes with no or more than one identified reference sequence (only first one is used for further analyses)
- 06_a_ClinVar_Annotations_genes_no_data_retrieved.txt * lists all genes for which no ClinVar annotations could be retrieved
- 06_b_ClinVar_Annotations.csv * lists all variants availalbe in ClinVar and their annotations for the input genes
- 07_best_structures_all_unique_combinations.csv * lists best structure for all unique sequences/mismatch combinations for all input genes, incl. ClinVar annotations
- 07_best_structures_any_mutation.csv * lists best structure for any mismatch regardless of other mismatches in the same structure for all input genes, incl. ClinVar annotations
- 07_best_structures_per_SAV.csv * lists best structure for each SAV (one mutation per structure) for all input genes, incl. ClinVar annotations
- 07_wildtype_structures.csv * lists all available WT structures for all input genes
+NOTE: A section explaining all columns in main output files will be added shortly!
To avoid cluttering the directory containing the main output files, MutaPipe creates subfolders in the Results
directory where it stores any additional files and relevant data retrieved or extracted from the reference proteome fasta files, the ClinVar database, and the PDB.
Note: For each input gene with available PDB data, MutaPipe creates a subdirectory containing (1) data from the PDB retrieved via the PDB API (i.e. the corresponding mmCif, fasta, and pdb files) as well as (2) gene-specific output files generated at different steps in the pipeline.
MutaPipe creates the following subdirectories in Results
:
- ClinVar_Annotations * subdirectory where all ClinVar data is stored
- RefSeqs * subdirectory where all the reference sequences and other files used/generated by BLASTp are stored
- [GENENAME]_[n]structures * subdirectory for each input gene where all corresponding data from the PDB as well as gene-specific output files are stored
Let's assume we want to get the best PDB structures for all availbale NEK1-related sequences in the PDB. Apart from ensuring all the mmCIF, PDB, and fasta files get deleted again after parsing to save space by setting the the -d option, we stick to the default settings and thus use the following command to obtain results:
./MutaPipe.sh -g NEK1 -d True
Consequently, we see that a new folder called Results
has been created in our current working directory. It includes all the main output files as well as three additional subdirectories called ClinVar_Annotations
, RefSeqs
, and NEK1_2structures
.
Figure 2. Output generated by MutaPipe a) looking at the content in Results
b) looking at the content in subdirectories ClinVar_Annotations
, RefSeqs
, and NEK1_2structures
. Note: MutaPipe's main output files and gene-specific output files are highlighted in yellow in 3a. and 3b., respectively. If we didn't set the -d option to delete the files after parsing, the NEK1_2structures
folder would also contain all NEK1 mmCIF, PDB and fasta files.
MutaPipe will need a fasta file with all canonical protein sequences of the species of interest.
NOTE: When installing MutaPipe (see Local Deployment) a fasta file of the reference proteome for Homo sapiens will be included (version from December 2021). Please feel free to update to a newer version if one is available (we will also regularly update the reference proteome file on here, but this is currently not done automatically).
In order to run MutaPipe a fasta file containing the reference proteome / canonical protein sequences for the organism in question (= Homo sapiens) will have to be downloaded from UniProt and stored in a folder called Uniprot_reference_seqs
in the MutaPipe
directory (where the python scripts are stored). The filename should be UP000005640_9606.fasta
for the script to work properly.
The latest version of this fasta file can be accessed via the UniProt ftp client.
MutaPipe has been developed by Deborah Ness, an NIHR Maudsley Biomedical Research Centre PhD Student, in collaboration with and supervised by Dr Alfredo Iacoangeli, for her PhD project at the Department of Basic and Clinical Neuroscience at King's College London.
This work has not yet been published (April 2022).