-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plot muts #64
Plot muts #64
Conversation
I see you are expecting the sequences to all be the same length or you throw an error. |
From my understanding p-distance is the number of different sites between two sequences divided by the sequence length. It's not really an alignment, it's a limited metric that presupposes sequences are the same length. It also presupposes that the sequences are aligned. We could align them prior to this if we wanted to, or approach it differently. |
From talking with @InaMBerry and @mmelendrez I don't think we can assume the sequences were pre-aligned and will probably need to be aligned as I did in the MutationCount project |
They will require an investigator check after alignment. Not all aligners The investigator can input an alignment so alignment is assumed of you
|
Hmmm, this looks suspicious. Most of the references should be within the CI interval since the interval was calculated on the references themselves. In addition, I calculated the CI for a normal distribution for these and the query and all the refs were within it, so this does not make sense. I agree, this is not a good test dataset, In fact it is not a test set at all, it is real data. I am working on making better ones.... |
I calculated the CI purely from the line-of-fit, which doesn't fit very well . . . |
CI is usually calculated from the standard deviation of the population, which in this case should be the distances from the base reference to the other references. Distance from the base reference to itself (0) should not be included in the calculations. Maybe that's the problem? |
Indeed this is different from how I calculated it. On Fri, Oct 9, 2015 at 9:55 AM, InaMBerry [email protected] wrote:
|
Here is a google spreadsheet that shows kinda what we are looking for we think At least it is something we know we need to look for in our final graphic |
There is a second tab at the bottom for Jun's sequence (which I didn't notice at first) |
Yes, that looks about right. But I would also like to see all the dots for reference and query sequences on the graph. I was also wondering if it would be possible to write the names of query sequences that fall outside of the poisson interval in the output file? It is easy to figure out with a small dataset but if we have lots of query sequences it will be more difficult. And it is important to know what sequence that is potentially bad. |
I've corrected the spreadsheet with our actual example, which is very messy because we have a sample in 2011 with few mutations. @InaMBerry , that is certainly possible. I'm not sure how we would infer the interval, but I welcome ideas. If we figure that out printing the outliers is no problem (and would already work in those cases where the query is not the newest sequence). |
…t_muts Conflicts: tests/test_deprecation.py
Closes #63
I want to write a test for
do_plot
which uses some standard numerical data and make sure that I am using the poisson confidence interval correctly.Caveats: