-
Notifications
You must be signed in to change notification settings - Fork 82
Prepare GTDB Tk data
- Get all genomes used to generate Archaeal and Bacterial tree:
cat data_from_db/gtdb_bac_taxonomy.tsv |awk '{print $1}' > raw_genomes.lst
cat data_from_db/gtdb_arc_taxonomy.tsv |awk '{print $1}' >> raw_genomes.lst
- Pull the fna files of those genomes in a folder:
gtdb genomes pull --batchfile raw_genomes.lst --genomic --output fastani/
- Archive all genomes in the postprocessed/fastani folder:
pgzip *.fna
Copy the create_genome_paths.sh
script from the scripts folder to the fastani folder ( above database/) and run it
GTDB-Tk also need to untrimmed version of each MSA:
gtdb tree create --no_trim --no_tree --genome_batchfile raw_bacterial.lst --guaranteed_batchfile raw_bacterial.lst --output bacterial_msa --marker_set_ids 1 --classic_header
gtdb tree create --no_trim --no_tree --genome_batchfile raw_archaeal.lst --guaranteed_batchfile raw_archaeal.lst --output archaeal_msa --marker_set_ids 19 --classic_header
- Copy new msa files to GTDB-Tk package
cp bacterial_msa/gtdb_concatenated.faa gtdbtk_package/msa/gtdb_r<#>_bac120.faa
cp archaeal_msa/gtdb_concatenated.faa gtdbtk_package/msa/gtdb_r<#>_ar53.faa
Get the original masks from the original run from
/srv/projects/gtdb/release207/bacteria/pre_curation/bac120/20211110/msa/gtdb_r207_bac120_mask.txt
We are using the original trees ( before being imported in ARB) as the reference trees. ARB rounds up the branch length of the tree from 6 to 4 decimals.
- Decorated the rooted tree with the taxonomy:
phylorank decorate gtdb_r207_bac120.rooted.fullids.tree ../../taxonomy/bac120_taxonomy_r207_reps.tsv gtdb_r207_bac120_decorated_fullids.tree --skip_rd_refine
TODO: convert the original tree (Arc and bac) from canonical ids to full ids.
phylorank outliers gtdb_r207_bac120_decorated_fullids.tree ../../taxonomy/bac120_taxonomy_r207_reps.tsv phylorank_outliers --skip_mpld3
- Get the 2 dictionaries from outliers command and paste them in the metadata.txt file
- Edit version variable
Pplacer package are created by using the official tree and the official trimmed msa.
- Optional: remove dummy node using gtdb_validation_tk.
gtdb_validation_tk remove_dummy gtdb_<release>_ar_curated.tree gtdb_<release>_ar_no_dummy.tree
- Strip the taxonomy from the decorated tree:
conda activate genometreetk-0.1.8
genometreetk strip gtdb_<release>_bac_no_dummy.tree bac120_<release>_stripped.tree
genometreetk strip gtdb_<release>_ar_no_dummy.tree ar53_<release>_stripped.tree
-
Use Fasttree to generate a fitting log only for the archaeal tree:
FastTreeMP -wag -nome -mllen -intree ar53_<release>_stripped.tree -log fitting_stats.log < ar_msa_<release>.faa > ar_<release>_fitted.tree
We are using the original FastTree log file for the bacterial tree -
Unroot the tree
hatchet unroot --input_tree gtdb_r207_bac120_decorated_fullids.tree --output_tree gtdb_r207_bac120_decorated_unrooted.tree
- Remove spaces from gtdb_r207_bac120_decorated_unrooted.tree
- Generate pkg folder:
conda activate taxtastic-0.9.0
taxit create -l gtdbtk.refpkg -P gtdbtk.refpkg --aln-fasta <msa_file> --tree-stats <fasttree_log_file> --tree-file <decorated_unrooted.tree>
- Copy the pplacer package in GTDB-Tk data folder
conda activate hatchet-0.0.2 hatchet hatchet_wf -d bac -t ../phylorank/gtdb_r220_bac120.decorated.fullids.tree --msa bac120_msa_r220.faa --tax ../../taxonomy_files_reps/bac120_taxonomy_r220_reps.tsv -o split/ --red_file ../phylorank/phylorank_outliers_bac120/gtdb_r220_bac120.decorated.fullids.node_rd.tsv --original_log gtdb_r220_bac120_fasttree.log --metadata ../../metadata_files/bac120_metadata_r220.tsv
Copy the output directory to the GTDB-Tk package high_level/gtdbtk_package_backbone.refpkg/, high_level/high_red_value.tsv , species_level/gtdbtk.package.*.refpkg/ , species_level/red_value*.tsv_, species_level/tree_mapping.tsv
cat sp_clusters.tsv | awk 'BEGIN {FS="\t"}; {printf ("%s\t%s\t%s\n", $2, $1, $4)}' > gtdb_radii.tsv
rename versions find . -type l -name 'ar*' -exec rename 's/86/86.1/' {} ;