Python package to simulate and estimate Y-STR mutation models
####Running MUTEA MUTEA's analyses require two main sources of input: a VCF file containing STR calls generated by HipSTR and a rooted phylogenetic tree in Newick format relating the samples in the VCF file. The VCF file must be bgzip compressed and indexed using tabix. To generate the phylogenetic tree, we recommed that you use RAxML. In the examples that follow, we use the data from our recent publication "Population-Scale Sequencing Data Enable Precise Estimates of Y-STR Mutation Rates" to demonstrate how to run MUTEA.
First, let's download a subset of the data used in the publication:
git clone https://github.com/tfwillems/ystr-mut-rates
Next, we want to estimate a stutter model for each STR in the VCF. These models capture the error profile present at each STR. We estimate these profiles as follows:
python main.py --tree ystr-mut-rates/1000Y.MLtree.lengths.bootstraps.rooted.rotated.20140826.nwk
--out example
--vcf ystr-mut-rates/1kg_calls.variable.min_qual.vcf.gz
--calc_stutter_probs
Here, --tree indicates the path to the phylogenetic tree while --vcf indicates the path to the VCF file. Upon completion, every line in example.stutter_probs.txt contains a stutter model for a Y-STR in the VCF. We can now use these stutter models to estimate the mutation rates of Y-STRs in the VCF. We do this using the command:
python main.py --tree ystr-mut-rates/1000Y.MLtree.lengths.bootstraps.rooted.rotated.20140826.nwk
--out example
--vcf ystr-mut-rates/1kg_calls.variable.min_qual.vcf.gz
--use_read_counts example.stutter_probs.txt
--loc_range chrY:1-100000000
--max_tmrca 84000
--gen_time 30
--calc_mut_rates
Here, --use_read_counts indicates that we want to use the read information in the VCF in conjunction with the stutter models we estimated above. The option --loc_range is required and indicates the region of the chromosome that we're interested in analyzing. This option can be used to analyze all Y-STRs by setting the range to the entire chromosome or can be used to parallelize analyses by restricting separate threads to different regions of the chromosome. Importantly, you should also specify the options --max_tmrca and --gen_time. The ratio of max_tmrca/gen_time
is used to scale each branch in the tree from a numerical value to an equivalent number of generations. Without appropriate values for these parameters, the estimated mutation rates will not be valid. The values for these two options are entirely dependent on the phylogeny being used and therefore default values should not be used. To determine appropriate values, we suggest that you run MUTEA on a subset of Y-STRs using --gen_time 1
and a range of values for --max_tmrca
. The estimates obtained for each value of --max_tmrca
can then be compared to estimates present in large databases such as the YHRD to determine the value that best matches previously published mutation rates.
After running the command depicted above, each line in example.mut_rates.txt will contain the estimated mutation model for each Y-STR. The first column in these files indicates the name of the Y-STR, while the second column indicates the log10 mutation rate.
For more information about additional MUTEA options, please type python main.py -h