Plot muts #64

averagehat · 2015-10-06T22:09:50Z

Closes #63

I want to write a test for do_plot which uses some standard numerical data and make sure that I am using the poisson confidence interval correctly.

Caveats:

In order for p-distance to make sense all sequences must be the same length
Currently the year is taken from the Fasta ID, so the ID should be free of numerical characters that aren't the year. We could create some other way of communicating that information if we really wanted to.

necrolyte2 · 2015-10-07T11:43:23Z

I see you are expecting the sequences to all be the same length or you throw an error.
I thought that the idea was to align all sequences to the base reference and then count the differences after that?

averagehat · 2015-10-07T13:43:23Z

From my understanding p-distance is the number of different sites between two sequences divided by the sequence length. It's not really an alignment, it's a limited metric that presupposes sequences are the same length. It also presupposes that the sequences are aligned.

We could align them prior to this if we wanted to, or approach it differently.

necrolyte2 · 2015-10-07T13:48:21Z

From talking with @InaMBerry and @mmelendrez I don't think we can assume the sequences were pre-aligned and will probably need to be aligned as I did in the MutationCount project

mmelendrez · 2015-10-07T13:53:05Z

They will require an investigator check after alignment. Not all aligners
do a good job totally depends on sequences. The investigator will always
have to double check the alignment and usually trim the sequences so they
are all the same length. Occasionally they also may have to remove
divergent sequences and realign as well.

The investigator can input an alignment so alignment is assumed of you
want. Otherwise there will have to be a checkpoint after alignment for the
investigator before the program moves forward.
On Oct 7, 2015 09:48, "Tyghe Vallard" [email protected] wrote:

From talking with @InaMBerry https://github.com/InaMBerry and
@mmelendrez https://github.com/mmelendrez I don't think we can assume
the sequences were pre-aligned and will probably need to be aligned as I
did in the MutationCount project

—
Reply to this email directly or view it on GitHub
#64 (comment).

averagehat · 2015-10-07T20:16:57Z

Here is an example using a small number of references with a sample Jun provided. All references are much older than the base reference, from about ~1980. Not the best data to test this with.

InaMBerry · 2015-10-09T13:00:05Z

Hmmm, this looks suspicious. Most of the references should be within the CI interval since the interval was calculated on the references themselves. In addition, I calculated the CI for a normal distribution for these and the query and all the refs were within it, so this does not make sense.

I agree, this is not a good test dataset, In fact it is not a test set at all, it is real data. I am working on making better ones....

averagehat · 2015-10-09T13:24:24Z

I calculated the CI purely from the line-of-fit, which doesn't fit very well . . .

InaMBerry · 2015-10-09T13:55:04Z

CI is usually calculated from the standard deviation of the population, which in this case should be the distances from the base reference to the other references. Distance from the base reference to itself (0) should not be included in the calculations. Maybe that's the problem?

averagehat · 2015-10-09T14:20:16Z

Indeed this is different from how I calculated it.

On Fri, Oct 9, 2015 at 9:55 AM, InaMBerry [email protected] wrote:

CI is usually calculated from the standard deviation of the population,
which in this case should be the distances from the base reference to the
other references. Distance from the base reference to itself (0) should not
be included in the calculations. Maybe that's the problem?

—
Reply to this email directly or view it on GitHub
#64 (comment).

necrolyte2 · 2015-10-14T17:07:02Z

Here is a google spreadsheet that shows kinda what we are looking for we think
https://docs.google.com/spreadsheets/d/17fW7oELqevb_Au7cTjZgQxLzqYLUVJI4lVPsUytIeTI/edit?usp=sharing

At least it is something we know we need to look for in our final graphic

averagehat · 2015-10-14T21:01:30Z

There is a second tab at the bottom for Jun's sequence (which I didn't notice at first)

InaMBerry · 2015-10-15T12:05:54Z

Yes, that looks about right. But I would also like to see all the dots for reference and query sequences on the graph. I was also wondering if it would be possible to write the names of query sequences that fall outside of the poisson interval in the output file? It is easy to figure out with a small dataset but if we have lots of query sequences it will be more difficult. And it is important to know what sequence that is potentially bad.

averagehat · 2015-10-20T20:17:04Z

I've corrected the spreadsheet with our actual example, which is very messy because we have a sample in 2011 with few mutations.

@InaMBerry , that is certainly possible.
The problem in this case is that the query sequence is later than all the other sequences, so it doesn't actually fall within the interval (in this case the the newest reference is 2011 and the query is 2012). We would have to infer where the interval is headed in order to determine whether or not it would fit. You can see how that is a problem with our current set from the spreadsheet. This would become more of an issue if the query sequence is much newer than the newest reference.

I'm not sure how we would infer the interval, but I welcome ideas. If we figure that out printing the outliers is no problem (and would already work in those cases where the query is not the newest sequence).

…pieces into samtools

…into samtools

…t_muts Conflicts: tests/test_deprecation.py

averagehat added 2 commits October 6, 2015 17:59

added working plot_muts script

a243083

refactored out plotting from data collection

d7b9315

averagehat added 2 commits October 7, 2015 13:03

fixed confidence interval

cb76882

CI now based on y-intercept of 0

93a6325

averagehat added 12 commits November 10, 2015 16:36

alignment reconstruction from reference

f41f119

fixed not-equal option in vcfcat

e627f4e

Merge branch 'rename-project-i60' of https://github.com/VDBWRAIR/bio_…

9244c92

…pieces into samtools

more cigar functionality

4672e12

removed test failed bc required double import

0594f96

added support for haplotype grouping

8283172

documentation

406e362

renamed

0f85466

Merge branch 'plot_muts' of https://github.com/averagehat/bio_pieces …

46a7184

…into samtools

added test references

29fbfa9

guess missing year

6f36d31

plot muts

d194284

necrolyte2 mentioned this pull request Dec 29, 2015

Next release #70

Merged

averagehat added 2 commits December 30, 2015 11:08

Merge branch 'dev' of https://github.com/VDBWRAIR/bio_pieces into plo…

23b0ddd

…t_muts Conflicts: tests/test_deprecation.py

optional imports

9b7f7bf

necrolyte2 closed this Jan 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plot muts #64

Plot muts #64

averagehat commented Oct 6, 2015

necrolyte2 commented Oct 7, 2015

averagehat commented Oct 7, 2015

necrolyte2 commented Oct 7, 2015

mmelendrez commented Oct 7, 2015

averagehat commented Oct 7, 2015

InaMBerry commented Oct 9, 2015

averagehat commented Oct 9, 2015

InaMBerry commented Oct 9, 2015

averagehat commented Oct 9, 2015

necrolyte2 commented Oct 14, 2015

averagehat commented Oct 14, 2015

InaMBerry commented Oct 15, 2015

averagehat commented Oct 20, 2015

Plot muts #64

Plot muts #64

Conversation

averagehat commented Oct 6, 2015

necrolyte2 commented Oct 7, 2015

averagehat commented Oct 7, 2015

necrolyte2 commented Oct 7, 2015

mmelendrez commented Oct 7, 2015

averagehat commented Oct 7, 2015

InaMBerry commented Oct 9, 2015

averagehat commented Oct 9, 2015

InaMBerry commented Oct 9, 2015

averagehat commented Oct 9, 2015

necrolyte2 commented Oct 14, 2015

averagehat commented Oct 14, 2015

InaMBerry commented Oct 15, 2015

averagehat commented Oct 20, 2015