Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot muts #64

Closed
wants to merge 18 commits into from
Closed

Plot muts #64

wants to merge 18 commits into from

Conversation

averagehat
Copy link
Contributor

Closes #63

I want to write a test for do_plot which uses some standard numerical data and make sure that I am using the poisson confidence interval correctly.

Caveats:

  • In order for p-distance to make sense all sequences must be the same length
  • Currently the year is taken from the Fasta ID, so the ID should be free of numerical characters that aren't the year. We could create some other way of communicating that information if we really wanted to.

@necrolyte2
Copy link
Member

I see you are expecting the sequences to all be the same length or you throw an error.
I thought that the idea was to align all sequences to the base reference and then count the differences after that?

@averagehat
Copy link
Contributor Author

From my understanding p-distance is the number of different sites between two sequences divided by the sequence length. It's not really an alignment, it's a limited metric that presupposes sequences are the same length. It also presupposes that the sequences are aligned.

We could align them prior to this if we wanted to, or approach it differently.

@necrolyte2
Copy link
Member

From talking with @InaMBerry and @mmelendrez I don't think we can assume the sequences were pre-aligned and will probably need to be aligned as I did in the MutationCount project

@mmelendrez
Copy link
Member

They will require an investigator check after alignment. Not all aligners
do a good job totally depends on sequences. The investigator will always
have to double check the alignment and usually trim the sequences so they
are all the same length. Occasionally they also may have to remove
divergent sequences and realign as well.

The investigator can input an alignment so alignment is assumed of you
want. Otherwise there will have to be a checkpoint after alignment for the
investigator before the program moves forward.
On Oct 7, 2015 09:48, "Tyghe Vallard" [email protected] wrote:

From talking with @InaMBerry https://github.com/InaMBerry and
@mmelendrez https://github.com/mmelendrez I don't think we can assume
the sequences were pre-aligned and will probably need to be aligned as I
did in the MutationCount project


Reply to this email directly or view it on GitHub
#64 (comment).

@averagehat
Copy link
Contributor Author

Here is an example using a small number of references with a sample Jun provided. All references are much older than the base reference, from about ~1980. Not the best data to test this with.

virus

@InaMBerry
Copy link

Hmmm, this looks suspicious. Most of the references should be within the CI interval since the interval was calculated on the references themselves. In addition, I calculated the CI for a normal distribution for these and the query and all the refs were within it, so this does not make sense.

I agree, this is not a good test dataset, In fact it is not a test set at all, it is real data. I am working on making better ones....

@averagehat
Copy link
Contributor Author

I calculated the CI purely from the line-of-fit, which doesn't fit very well . . .

@InaMBerry
Copy link

CI is usually calculated from the standard deviation of the population, which in this case should be the distances from the base reference to the other references. Distance from the base reference to itself (0) should not be included in the calculations. Maybe that's the problem?

@averagehat
Copy link
Contributor Author

Indeed this is different from how I calculated it.

On Fri, Oct 9, 2015 at 9:55 AM, InaMBerry [email protected] wrote:

CI is usually calculated from the standard deviation of the population,
which in this case should be the distances from the base reference to the
other references. Distance from the base reference to itself (0) should not
be included in the calculations. Maybe that's the problem?


Reply to this email directly or view it on GitHub
#64 (comment).

@necrolyte2
Copy link
Member

Here is a google spreadsheet that shows kinda what we are looking for we think
https://docs.google.com/spreadsheets/d/17fW7oELqevb_Au7cTjZgQxLzqYLUVJI4lVPsUytIeTI/edit?usp=sharing

At least it is something we know we need to look for in our final graphic

@averagehat
Copy link
Contributor Author

There is a second tab at the bottom for Jun's sequence (which I didn't notice at first)

@InaMBerry
Copy link

Yes, that looks about right. But I would also like to see all the dots for reference and query sequences on the graph. I was also wondering if it would be possible to write the names of query sequences that fall outside of the poisson interval in the output file? It is easy to figure out with a small dataset but if we have lots of query sequences it will be more difficult. And it is important to know what sequence that is potentially bad.

@averagehat
Copy link
Contributor Author

I've corrected the spreadsheet with our actual example, which is very messy because we have a sample in 2011 with few mutations.

@InaMBerry , that is certainly possible.
The problem in this case is that the query sequence is later than all the other sequences, so it doesn't actually fall within the interval (in this case the the newest reference is 2011 and the query is 2012). We would have to infer where the interval is headed in order to determine whether or not it would fit. You can see how that is a problem with our current set from the spreadsheet. This would become more of an issue if the query sequence is much newer than the newest reference.

I'm not sure how we would infer the interval, but I welcome ideas. If we figure that out printing the outliers is no problem (and would already work in those cases where the query is not the newest sequence).

@necrolyte2 necrolyte2 mentioned this pull request Dec 29, 2015
@necrolyte2 necrolyte2 closed this Jan 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants