Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explanation of site_annotations output #548

Open
EricRLucas opened this issue Jun 11, 2024 · 6 comments
Open

explanation of site_annotations output #548

EricRLucas opened this issue Jun 11, 2024 · 6 comments
Assignees
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@EricRLucas
Copy link

Please could the the docs include information on how to interpret the output of site_annotations()? I can't work out what the different levels of codon_nonsyn mean (effect?). I also can't quite work out what degeneracy shows (all 4-fold degenerate positions have a 0 score for nonsyn, which makes sense, but none of the other degeneracies have a 0; I would have thought that a 3-fold degenerate position would have a lot of 0 values).

@leehart
Copy link
Collaborator

leehart commented Jun 14, 2024

@leehart leehart added documentation Improvements or additions to documentation question Further information is requested labels Jul 29, 2024
@jonbrenas
Copy link
Collaborator

jonbrenas commented Aug 15, 2024

I agree that the docs could use more details as I was quite unsure what the various values meant ... and I think I am the one who generated the file that gets read for gambiae. I want to check that my understanding is correct before I change the docs, though, so I hope @alimanfoo and @cclarkson can check my homework.

Currently, the "Returns" part just says:

Dataset
A dataset of site annotations.

My suggestion would be something like:

Dataset
A dataset of site annotations containing 7 variables:
seq_cls: The feature class. There are 11 possible values:
        1: Upstream
        2: Downstream
        3: 5' UTR
        4: 3' UTR
        5: CDS (first)
        6: CDS (mid)
        7: CDS (last)
        8: Intron (first)
        9: Intron (mid)
        10: Intron (last)
        0: Unknown
seq_flen: The length of the feature.
seq_relpos_start: Relative position to the start of the feature. 0 if not in a feature.
seq_relpos_stop: Relative position to the end of the feature. 0 if not in a feature.
codon_position: Position within a triplet codon. -1 if not in a CDS.
codon_nonsyn: Number of different amino acids that can be obtained by changing this codon (does not include the amino acid that is encoded by the triplet codon as is). 0 if not in a CDS.
codon_degeneracy: The redundancy of the codon. Can take 5 different values:
        1: 0-fold degenerate, i.e., all nucleotides encode different amino acids
        2: simple 2-fold degenerate, i.e., 2 different amino acids can be encoded depending on which nucleotide is present; each amino acid is encoded by two different nucleotides
        3: complex 2-fold degenerate, i.e., 2 different amino acids can be encoded depending on which nucleotide is present but they are not paired
        4: 4-fold degenerate, i.e., all nucleotides encode the same amino acid
        -1: not in a CDS

Is that correct? Is that understandable? Do we want more details?

@alimanfoo
Copy link
Member

Just discussed, all looks good but might be good to check exactly the difference between simple and complex 2-fold degenerate.

@EricRLucas
Copy link
Author

EricRLucas commented Oct 2, 2024

Thanks @jonbrenas and @alimanfoo

Based on those explanations, I still can't quite make sense of the outputs. For example, if I take the results of

bob = ag3.site_annotations(region = '2L:10000000-10100000')

and then I get a contingency table with

pd.crosstab(bob['codon_degeneracy'], bob['codon_nonsyn'])

I get

Screenshot from 2024-10-02 13-04-40

So, for example, how can a codon that can produce an additional 3 amino acids to the reference one be anything other than 0-fold degenerate (here, 72 of them are complex 2-fold). Also, how can a position where 2 additional amino acids can be produced (ie: 3 amino acids in total) be classed as simple 2-fold (379 cases).

Apologies if this is getting into the weeds too much. Feel free to ignore.

@jonbrenas
Copy link
Collaborator

Thanks @EricRLucas !

The weird values for the "complex 2-fold degenerate" row are due to the fact that it appears to be the garbage bin of the labels. I am not sure what "complex 2-fold degenerate" is supposed to mean but it is possible that the code is incorrect. What it does is throw everything that doesn't fall in one of the 3 nice categories (i.e., all potential amino acids are different; all potential amino acids are the same; the potential amino acids form a nice (2,2) split) into the "complex 2-fold degenerate" category. Does anyone have a nice definition of what being "complex 2-fold degenerate" site means? I have not found a good reference yet so I am starting to think that it is a label that we made up and not a very good one.

Reading the code, I see my mistake regarding codon_nonsyn. The definition should probably be:
Number of amino acids that can be obtained by changing this codon (does not include the amino acid that is encoded by the triplet codon as is). 0 if not in a CDS. (I removed 'different'). In the case of "simple 2-fold degenerate" sites what we get is a list of amino acids that can be obtained that looks like ['X', 'X', 'Y', 'Y'] (order notwithstanding). The code then removes the amino acid that is coded by the reference (let's say 'X') and counts what's left (i.e., ['Y', 'Y']) and gets 2 amino acids (that are not different). I don't know if that is the expected behaviour.

@alimanfoo, @cclarkson, @EricRLucas, opinions?

@EricRLucas
Copy link
Author

@jonbrenas thanks, I think that I now understand what codon_nonsyn means, and so I can see how that contingency table can make sense. How about the following definition for codon_nonsyn:

Number of possible nucleotide changes at this position that would result in an amino acid change from the reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants