-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explanation of site_annotations output #548
Comments
I agree that the docs could use more details as I was quite unsure what the various values meant ... and I think I am the one who generated the file that gets read for gambiae. I want to check that my understanding is correct before I change the docs, though, so I hope @alimanfoo and @cclarkson can check my homework. Currently, the "Returns" part just says:
My suggestion would be something like:
Is that correct? Is that understandable? Do we want more details? |
Just discussed, all looks good but might be good to check exactly the difference between simple and complex 2-fold degenerate. |
Thanks @jonbrenas and @alimanfoo Based on those explanations, I still can't quite make sense of the outputs. For example, if I take the results of bob = ag3.site_annotations(region = '2L:10000000-10100000') and then I get a contingency table with pd.crosstab(bob['codon_degeneracy'], bob['codon_nonsyn']) I get So, for example, how can a codon that can produce an additional 3 amino acids to the reference one be anything other than 0-fold degenerate (here, 72 of them are complex 2-fold). Also, how can a position where 2 additional amino acids can be produced (ie: 3 amino acids in total) be classed as simple 2-fold (379 cases). Apologies if this is getting into the weeds too much. Feel free to ignore. |
Thanks @EricRLucas ! The weird values for the "complex 2-fold degenerate" row are due to the fact that it appears to be the garbage bin of the labels. I am not sure what "complex 2-fold degenerate" is supposed to mean but it is possible that the code is incorrect. What it does is throw everything that doesn't fall in one of the 3 nice categories (i.e., all potential amino acids are different; all potential amino acids are the same; the potential amino acids form a nice (2,2) split) into the "complex 2-fold degenerate" category. Does anyone have a nice definition of what being "complex 2-fold degenerate" site means? I have not found a good reference yet so I am starting to think that it is a label that we made up and not a very good one. Reading the code, I see my mistake regarding @alimanfoo, @cclarkson, @EricRLucas, opinions? |
@jonbrenas thanks, I think that I now understand what codon_nonsyn means, and so I can see how that contingency table can make sense. How about the following definition for codon_nonsyn: Number of possible nucleotide changes at this position that would result in an amino acid change from the reference. |
Please could the the docs include information on how to interpret the output of site_annotations()? I can't work out what the different levels of codon_nonsyn mean (effect?). I also can't quite work out what degeneracy shows (all 4-fold degenerate positions have a 0 score for nonsyn, which makes sense, but none of the other degeneracies have a 0; I would have thought that a 3-fold degenerate position would have a lot of 0 values).
The text was updated successfully, but these errors were encountered: