-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some curious observations on STITCH results #81
Comments
Nice writeup and great plots! First small question, do you know the recombination rate and landscape for this species? Is it similar to mice and humans? There are some heuristics internally which are based on a recombination rate landscape like seen in mice and humans. If things are different (in particular if there are a higher density of recombination hotspots), changing some settings might be useful. Second question, are the results stable? If you re-impute using K=8 a few times in the same region (and the other K's), do you see similar results? Same with K=16. STITCH has an option "S" to average across multiple runs internally. I hope it still works, I can't quite remember if it's fully functional. I admit I don't normally use it as the performance gains are normally very small. Third question, can you try re-imputing one of the regions but turn on The comment about how you see changes in the ~0-0.2 range is interesting. If it's a perfect copy, I wonder if there is something like a duplication (or multiple fold change) in coverage. Do you see anything different about the depth of coverage across all samples at this SNP? Otherwise, in general, variants that impute well, but the dosage is so low suggests to me that two ancestral haplotypes are inaccurately collapsed together. This suggests the internal heuristics aren't working optimally. Another explanation might be if one or more of your original 8 inbred lines weren't actually fully inbred in some genomic locations? I'm a bit surprised your p-value plots aren't block-ier, given the recent founding. This does make me wonder if recombination rate is high in this species? Your final plot also suggests a weird absence of blocks. You have a primary signal around ~19 Mbp, which then decays over the rest of the chromosome. But you lose signal in some large chunks, around ~15 Mbp. I wonder why that is, maybe the haplotype that the signal is on becomes recently IBD with another founding haplotype so the signal is attenuated? Hopefully not something like an uncorrect genome assembly problem. Note that you can also impute / get out the imputed founder haplotype using |
Thanks a lot for the very quick response! I will think about those points and come back here. |
So the second one which is the ancestral haplotype frequencies looks pretty good. There are a lot of INFO = 1 SNPs (I assume monomorphic alt) which you can remove as they just waste time in imputation. There are also a lot of likely false positives which you could probably clean up in a two round imputation (like in the original paper, re-impute using only SNPs at INFO > 0.4 or similar). I think the heuristics are working OK but they could do better. You can try setting |
I have some curious observation that maybe you can help me with. I am using STITCH to impute ~1600 fish samples at variable levels of coverage (0.2x to 3x). These samples are F2 individuals from a complex cross setup of from 8 inbred lines (one line crossed to all the others, and some of the remaining lines crossed to each other). The number of individuals per cross varies but is in the n=100 range. I used stitch with 8, 16, 32 ancestral haplotypes (K), and the info score looks better the higher the value of K (see plot). I do not have an external validation.
I also run a GWAS on the imputed dosage for each round of imputation, and I noticed that in many loci what is a single peak in the 16 haplotypes model becames 2 peaks in the 8 haplotypes model (as an example see images - first one with K=16, second with K=8).
Do you have any idea how can this happen as a result of the way STITCH imputes the dosages? I also imputed the samples with a custom HMM and the results agree more with the K=16 model overall.
Another curious observation is the following - I was plotting the relationship between imputed dosage and phenotype for a specific locus, split up by the F2 cross membership of the samples. I noticed that at some loci the association signal is very strong, but the actual dosage is almost 0 for all the samples. Digging in, I realized that this is due to the fact that the dosages are all almost 0 but not quite, and the small nudges from 0 that they have perfectly copy the pattern that I observe at other loci in the 0-2 dosage range, but this time in a tiny ~0-0.2 range. Do you have any guess on how can this happen, and how can I avoid such cases from giving a spurious association signal?
I suspected that maybe some haplotype has a non-0 allele at the SNP, and it has a non-0 likelihood and so nudges the dosages a bit in a way that is correlated to the status of some SNP with a true association signal. The first image shows the genotype to phenotype relationship in the 0-2 scale, the second image the same thing for the same locus but zooming in in the genotype axis.
The text was updated successfully, but these errors were encountered: