wbps_scratch

Scratch scripts for performing analysis related to WormBase ParaSite data.

`./parse_pdb.py pdb_dir`

pdb_dir should contain .pdb files of AlphaFold (AF) models. Each file is parsed using BioPython, and statistics derived from the extracted pLDDT scores. Prints CSV-formatted lines to the console of "pdb_filename,mean,median,stdev,var,max,min,perc_confident", where perc_confident is defined as the percentage of AF model residues that are above the 70% ("Confident") threshold.

`pdb_analysis.ipynb`

scp (large) datasets from MARS:/users/whh2g/sharedscratch/parse_pdb/ and place in data/from_MARS/. Uses matplotlib to produce plots in output dir plots/.

`./get_all_species.py output_dir`

Download all gzipped gff3 and genomic fasta files for all species from the current WBPS release, and store in output_dir.

`./prepare_geenuff_inputs.py input_dir output_dir`

Create necessary directory structure to be compatable with the import2geenuff.py script from GeenuFF. input_dir should contain the raw gzipped gff3 and genomic fasta files as gathered by get_all_species.py.

`determine_training_species.ipynb`

Uses TSV of BUSCO scores for annotation and assembly of all WBPS species to select candidates for reannotation, model training and validation in Helixer. Outputs text lists of candidate species to data/helixer_training_species/.

`./prepare_helixer_training_inputs.py train_set_file valid_set_file train_dir`

Prepare symbolic links in train_dir with the following hierarchy, using train_set_file and valid_set_file species lists:

<train_dir>
├── training_data.species_01.h5 -> ../h5s/speciesA/test_data.h5
├── training_data.species_02.h5 -> ../h5s/speciesB/test_data.h5
├── validation_data.species_03.h5 -> ../h5s/speciesC/test_data.h5
└── validation_data.species_04.h5 -> ../h5s/speciesD/test_data.h5

`./get_gene_map.py gffcmp_refmap`

Extract gene IDs from given GFFCompare .refmap file gffcmp_refmap using a regex pattern. Prints TSV-formatted lines to console of ref_gene_id target_gene_id.

`./busco_preprocessing.sh seqfile annfile`

Sort annotation GFF3 annfile and translate CDS to protein sequences using genome fasta seqfile. These can then be used for running BUSCO in "proteins" mode.

`./get_prot_seq_for_uniprot_acc.py uniprot_acc`

Call the EBI AlphaFold API to get the protein sequence for a given uniprot_acc.

`./find_microexon_genes.py output_dir`

Prints to console genes that comply with criteria defined within script, followed by the total number. If (optional) output_dir is specified, write MD-formatted exon-lengths for each microexon gene.

`./run_omark.sh input db`

Run omamer search and omark for a given protein FASTA file input and OMAmer H5 file db. Ensure these commands are available in PATH, either through virtualenv or otherwise.

`compare_gene_maps.ipynb`

Analyse the gene maps between Strongyloides stercoralis WBPS18 and WBPS19 releases, produced from my own running of Liftoff (see get_gene_map.py) and also the official WBPS mapping pipeline.

`./filter_Xx_ctg_for_Artemis.sh ctg_num`

Filter FASTA and GFF files for given ctg_num of species Xx (so far Ac - Ancylostoma ceylanicum and Sm - Schistosoma mansoni) and launch Artemis with these as inputs. The hard-coded files need to be available in the directory from which the script is run.

`./schisto_orthogroup_pipeline_N.py [OPTIONS]`

For a OrthoFinder orthogroup (HOG) output table, iterate through all orthogroups which contain at least one orthologous transcript from each of a selection of N species, write .bed files of CDS boundaries for each species' transcript, and use pyGenomeTracks to plot the tracks. Plotting now only carried out with --do-plot option, while --overwrite will overwrite a plot even when it exists. There is an additional TSV output file generated on each run in data/schistosome_orthogroups/ which contains analysis for each HOG. Several of the TSV columns will only be filled when the option --load-blast is supplied (as they rely on BLAST outputs from the OrthoFinder working directory). In addition, supplying --clade integer will filter these BLAST-specific data to only the relevant species clade.

`schistosome_orthologue_analysis.ipynb`

Analyse interesting cases of orthogroup exons based on given statistical metrics. Uses the outputs from analyse_schistosome_orthogroups.py.

`./find_biologically_interesting_genes.py pfamout_path [OPTIONS]`

For a given pfamout_path determine if each protein qualifies as "biologically interesting" and report such genes (to console or --output, if given) that are unique to given tool (i.e. Helixer or BRAKER) based on OrthoFinder orthogroups. If --combined option is given, report genes predicted by both tools but not existing in WBPS. If --odb option is given, consider only proteins that have orthologues inside/outside Nematoda ODB10 when equal to "exists"/"missing", respectively.

`venn3_orthogroups.ipynb`

Create 3-way venn diagram plots of orthogroups from OrthoFinder outputs.

`global_pident_inference.ipynb`

Test various formulae for estimating global alignment pident just from BLAST outputs. Outputs residual plots. If there is a good model, it will negate the need to run the inefficient Needleman-Wunsch algorithm for each all-vs-all pairing of sequences to determine global alignment identity.

`./filter_longest_transcripts.py input_gff3`

Iterates through all genes for a given input_gff and selects the transcript with longest protein coding length. Prints each gene, longest transcript and all its child features to stdout.

`reannotation_X_all.ipynb`

Perform analysis on reannotation of species X, comparing output of automated tools with canonical WBPS annotation. Parses OrthoFinder results and derives statistics such as InterPro accession frequencies.

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.vscode		.vscode
data		data
db		db
orthologue_analysis		orthologue_analysis
plots		plots
reannotation		reannotation
slurm_scripts		slurm_scripts
snippets		snippets
test_pdb		test_pdb
utils		utils
.gitignore		.gitignore
README.md		README.md
busco_preprocessing.sh		busco_preprocessing.sh
compare_gene_maps.ipynb		compare_gene_maps.ipynb
determine_training_species.ipynb		determine_training_species.ipynb
filter_Ac_ctg_for_Artemis.sh		filter_Ac_ctg_for_Artemis.sh
filter_Hc_ctg_for_Artemis.sh		filter_Hc_ctg_for_Artemis.sh
filter_Sm_ctg_for_Artemis.sh		filter_Sm_ctg_for_Artemis.sh
filter_longest_transcripts.py		filter_longest_transcripts.py
find_biologically_interesting_genes.py		find_biologically_interesting_genes.py
find_microexon_genes.py		find_microexon_genes.py
get_all_species.py		get_all_species.py
get_gene_map.py		get_gene_map.py
get_prot_seq_for_uniprot_acc.py		get_prot_seq_for_uniprot_acc.py
global_pident_inference.ipynb		global_pident_inference.ipynb
parse_pdb.py		parse_pdb.py
pdb_analysis.ipynb		pdb_analysis.ipynb
populate_interpro_accession_dict.py		populate_interpro_accession_dict.py
ppac_merged_split_run_anno.py		ppac_merged_split_run_anno.py
ppac_merged_split_run_braker3.py		ppac_merged_split_run_braker3.py
ppac_merged_split_run_helixer.py		ppac_merged_split_run_helixer.py
ppac_merged_split_run_utils.py		ppac_merged_split_run_utils.py
prepare_geenuff_inputs.py		prepare_geenuff_inputs.py
prepare_helixer_training_inputs.py		prepare_helixer_training_inputs.py
reannotation_hcontortus_all.ipynb		reannotation_hcontortus_all.ipynb
reannotation_ppacificus_all.ipynb		reannotation_ppacificus_all.ipynb
reannotation_smansoni_all.ipynb		reannotation_smansoni_all.ipynb
requirements.txt		requirements.txt
run_omark.sh		run_omark.sh
schisto_orthogroup_pipeline_12.py		schisto_orthogroup_pipeline_12.py
schisto_orthogroup_pipeline_8.py		schisto_orthogroup_pipeline_8.py
schisto_orthogroup_pipeline_8_LT.py		schisto_orthogroup_pipeline_8_LT.py
schistosome_orthologue_analysis.ipynb		schistosome_orthologue_analysis.ipynb
schistosome_orthologue_analysis2.ipynb		schistosome_orthologue_analysis2.ipynb
schistosome_orthologue_analysis3.ipynb		schistosome_orthologue_analysis3.ipynb
schistosome_orthologue_analysis4.ipynb		schistosome_orthologue_analysis4.ipynb
venn3_orthogroups.ipynb		venn3_orthogroups.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wbps_scratch

`./parse_pdb.py pdb_dir`

`pdb_analysis.ipynb`

`./get_all_species.py output_dir`

`./prepare_geenuff_inputs.py input_dir output_dir`

`determine_training_species.ipynb`

`./prepare_helixer_training_inputs.py train_set_file valid_set_file train_dir`

`./get_gene_map.py gffcmp_refmap`

`./busco_preprocessing.sh seqfile annfile`

`./get_prot_seq_for_uniprot_acc.py uniprot_acc`

`./find_microexon_genes.py output_dir`

`./run_omark.sh input db`

`compare_gene_maps.ipynb`

`./filter_Xx_ctg_for_Artemis.sh ctg_num`

`./schisto_orthogroup_pipeline_N.py [OPTIONS]`

`schistosome_orthologue_analysis.ipynb`

`./find_biologically_interesting_genes.py pfamout_path [OPTIONS]`

`venn3_orthogroups.ipynb`

`global_pident_inference.ipynb`

`./filter_longest_transcripts.py input_gff3`

`reannotation_X_all.ipynb`

About

Releases

Packages

Languages

haessar/wbps_scratch

Folders and files

Latest commit

History

Repository files navigation

wbps_scratch

./parse_pdb.py pdb_dir

pdb_analysis.ipynb

./get_all_species.py output_dir

./prepare_geenuff_inputs.py input_dir output_dir

determine_training_species.ipynb

./prepare_helixer_training_inputs.py train_set_file valid_set_file train_dir

./get_gene_map.py gffcmp_refmap

./busco_preprocessing.sh seqfile annfile

./get_prot_seq_for_uniprot_acc.py uniprot_acc

./find_microexon_genes.py output_dir

./run_omark.sh input db

compare_gene_maps.ipynb

./filter_Xx_ctg_for_Artemis.sh ctg_num

./schisto_orthogroup_pipeline_N.py [OPTIONS]

schistosome_orthologue_analysis.ipynb

./find_biologically_interesting_genes.py pfamout_path [OPTIONS]

venn3_orthogroups.ipynb

global_pident_inference.ipynb

./filter_longest_transcripts.py input_gff3

reannotation_X_all.ipynb

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`./parse_pdb.py pdb_dir`

`pdb_analysis.ipynb`

`./get_all_species.py output_dir`

`./prepare_geenuff_inputs.py input_dir output_dir`

`determine_training_species.ipynb`

`./prepare_helixer_training_inputs.py train_set_file valid_set_file train_dir`

`./get_gene_map.py gffcmp_refmap`

`./busco_preprocessing.sh seqfile annfile`

`./get_prot_seq_for_uniprot_acc.py uniprot_acc`

`./find_microexon_genes.py output_dir`

`./run_omark.sh input db`

`compare_gene_maps.ipynb`

`./filter_Xx_ctg_for_Artemis.sh ctg_num`

`./schisto_orthogroup_pipeline_N.py [OPTIONS]`

`schistosome_orthologue_analysis.ipynb`

`./find_biologically_interesting_genes.py pfamout_path [OPTIONS]`

`venn3_orthogroups.ipynb`

`global_pident_inference.ipynb`

`./filter_longest_transcripts.py input_gff3`

`reannotation_X_all.ipynb`

Packages