explanation of site_annotations output #548

EricRLucas · 2024-06-11T11:31:28Z

Please could the the docs include information on how to interpret the output of site_annotations()? I can't work out what the different levels of codon_nonsyn mean (effect?). I also can't quite work out what degeneracy shows (all 4-fold degenerate positions have a 0 score for nonsyn, which makes sense, but none of the other degeneracies have a 0; I would have thought that a 3-fold degenerate position would have a lot of 0 values).

leehart · 2024-06-14T08:45:31Z

Internal ref: https://github.com/malariagen/vector-ops/issues/1278

jonbrenas · 2024-08-15T13:46:24Z

I agree that the docs could use more details as I was quite unsure what the various values meant ... and I think I am the one who generated the file that gets read for gambiae. I want to check that my understanding is correct before I change the docs, though, so I hope @alimanfoo and @cclarkson can check my homework.

Currently, the "Returns" part just says:

Dataset
A dataset of site annotations.

My suggestion would be something like:

Dataset
A dataset of site annotations containing 7 variables:
seq_cls: The feature class. There are 11 possible values:
        1: Upstream
        2: Downstream
        3: 5' UTR
        4: 3' UTR
        5: CDS (first)
        6: CDS (mid)
        7: CDS (last)
        8: Intron (first)
        9: Intron (mid)
        10: Intron (last)
        0: Unknown
seq_flen: The length of the feature.
seq_relpos_start: Relative position to the start of the feature. 0 if not in a feature.
seq_relpos_stop: Relative position to the end of the feature. 0 if not in a feature.
codon_position: Position within a triplet codon. -1 if not in a CDS.
codon_nonsyn: Number of different amino acids that can be obtained by changing this codon (does not include the amino acid that is encoded by the triplet codon as is). 0 if not in a CDS.
codon_degeneracy: The redundancy of the codon. Can take 5 different values:
        1: 0-fold degenerate, i.e., all nucleotides encode different amino acids
        2: simple 2-fold degenerate, i.e., 2 different amino acids can be encoded depending on which nucleotide is present; each amino acid is encoded by two different nucleotides
        3: complex 2-fold degenerate, i.e., 2 different amino acids can be encoded depending on which nucleotide is present but they are not paired
        4: 4-fold degenerate, i.e., all nucleotides encode the same amino acid
        -1: not in a CDS

Is that correct? Is that understandable? Do we want more details?

alimanfoo · 2024-09-09T14:11:51Z

Just discussed, all looks good but might be good to check exactly the difference between simple and complex 2-fold degenerate.

EricRLucas · 2024-10-02T12:09:50Z

Thanks @jonbrenas and @alimanfoo

Based on those explanations, I still can't quite make sense of the outputs. For example, if I take the results of

bob = ag3.site_annotations(region = '2L:10000000-10100000')

and then I get a contingency table with

pd.crosstab(bob['codon_degeneracy'], bob['codon_nonsyn'])

I get

So, for example, how can a codon that can produce an additional 3 amino acids to the reference one be anything other than 0-fold degenerate (here, 72 of them are complex 2-fold). Also, how can a position where 2 additional amino acids can be produced (ie: 3 amino acids in total) be classed as simple 2-fold (379 cases).

Apologies if this is getting into the weeds too much. Feel free to ignore.

jonbrenas · 2024-10-02T13:43:31Z

Thanks @EricRLucas !

The weird values for the "complex 2-fold degenerate" row are due to the fact that it appears to be the garbage bin of the labels. I am not sure what "complex 2-fold degenerate" is supposed to mean but it is possible that the code is incorrect. What it does is throw everything that doesn't fall in one of the 3 nice categories (i.e., all potential amino acids are different; all potential amino acids are the same; the potential amino acids form a nice (2,2) split) into the "complex 2-fold degenerate" category. Does anyone have a nice definition of what being "complex 2-fold degenerate" site means? I have not found a good reference yet so I am starting to think that it is a label that we made up and not a very good one.

Reading the code, I see my mistake regarding codon_nonsyn. The definition should probably be:
Number of amino acids that can be obtained by changing this codon (does not include the amino acid that is encoded by the triplet codon as is). 0 if not in a CDS. (I removed 'different'). In the case of "simple 2-fold degenerate" sites what we get is a list of amino acids that can be obtained that looks like ['X', 'X', 'Y', 'Y'] (order notwithstanding). The code then removes the amino acid that is coded by the reference (let's say 'X') and counts what's left (i.e., ['Y', 'Y']) and gets 2 amino acids (that are not different). I don't know if that is the expected behaviour.

@alimanfoo, @cclarkson, @EricRLucas, opinions?

EricRLucas · 2024-10-02T14:45:08Z

@jonbrenas thanks, I think that I now understand what codon_nonsyn means, and so I can see how that contingency table can make sense. How about the following definition for codon_nonsyn:

Number of possible nucleotide changes at this position that would result in an amino acid change from the reference.

leehart added documentation Improvements or additions to documentation question Further information is requested labels Jul 29, 2024

leehart assigned jonbrenas Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explanation of site_annotations output #548

explanation of site_annotations output #548

EricRLucas commented Jun 11, 2024

leehart commented Jun 14, 2024

jonbrenas commented Aug 15, 2024 •

edited

Loading

alimanfoo commented Sep 9, 2024

EricRLucas commented Oct 2, 2024 •

edited

Loading

jonbrenas commented Oct 2, 2024

EricRLucas commented Oct 2, 2024

explanation of site_annotations output #548

explanation of site_annotations output #548

Comments

EricRLucas commented Jun 11, 2024

leehart commented Jun 14, 2024

jonbrenas commented Aug 15, 2024 • edited Loading

alimanfoo commented Sep 9, 2024

EricRLucas commented Oct 2, 2024 • edited Loading

jonbrenas commented Oct 2, 2024

EricRLucas commented Oct 2, 2024

jonbrenas commented Aug 15, 2024 •

edited

Loading

EricRLucas commented Oct 2, 2024 •

edited

Loading