Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recording variant status of genotypes (natural or engineered) #2346

Open
ValWood opened this issue Sep 7, 2020 · 55 comments
Open

Recording variant status of genotypes (natural or engineered) #2346

ValWood opened this issue Sep 7, 2020 · 55 comments
Assignees
Labels
discuss PHI-Canto schema changes Changes to database schema are required

Comments

@ValWood
Copy link
Member

ValWood commented Sep 7, 2020

For pathogens and host genotypes, many natural variants occur and for these, we want to capture the differences to some nominally WT variant. We need to be able to distinguish when a variant is naturally occurring or 'engineered' for any specific allele.

So, we would like an additional field in the genotype pop up to be able to select one of either
I) Natural variant (NV)
or
ii) Engineered variant (EV).

We also thought that later if researchers took a natural variant, and then engineered a different residue, we would be able to combine these as multi-allele phenotypes. Although I am not sure about this? it might imply that 2 copies of the gene are present. I have forgotten how this is specified.
If not, maybe this is something we can look into ( this is more future-proofing, although the NV/EV distinction is required now I don't think we have examples of editing to a natural variant right now @CuzickA can confirm)

@jseager7 jseager7 changed the title Additional to pathogen and host genotype curation option (natural or engineered variation) Recording variant status of genotypes (natural or engineered) Sep 7, 2020
@jseager7
Copy link
Collaborator

jseager7 commented Sep 7, 2020

This is almost certainly going to require database changes, and it sounds like it would make the most sense to record this as part of the genotype, probably on the same modal (pop-up) as where the strain is specified:

image

@kimrutherford Since this property seems to be specific to PHI-base, would it make sense to add a data column to the genotype table (following the convention used in the annotation table) to store all the miscellaneous data about genotypes? That way it will at least be easier to extend in future.

@ValWood ValWood added the discuss label Sep 7, 2020
@jseager7
Copy link
Collaborator

jseager7 commented Sep 7, 2020

We also need to decide whether or not we want to include the variant type in the genotype display name, as we already do with the genotype background and allele type.

@ValWood
Copy link
Member Author

ValWood commented Sep 7, 2020

We thought that NV or EV could be included (maybe mouse over to see the full expanded abbreviation)

@jseager7
Copy link
Collaborator

jseager7 commented Sep 8, 2020

Here's an example of what the inline genotype display name currently looks like:

TRI5+[WT product] (bkg: background) (strain: PH-1)

(note that the example is in pathogen-host mode; in single organism mode you wouldn't see the strain information)

Where would you like to include the NV or EV abbreviation in this display name? Should we introduce new brackets for it, or include it in one of the existing sets of brackets?

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 11, 2020

Here's an example of what the inline genotype display name currently looks like:

TRI5+[WT product] (bkg: background) (strain: PH-1)

(note that the example is in pathogen-host mode; in single organism mode you wouldn't see the strain information)

Where would you like to include the NV or EV abbreviation in this display name? Should we introduce new brackets for it, or include it in one of the existing sets of brackets?

Maybe after [WT product] before (bkg: background)?

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 11, 2020

We also discussed in the case of 'nv' that it would be useful to have an option for 'nv-unknown' and 'nv-known' where the sequence variation could be recorded.

@ValWood
Copy link
Member Author

ValWood commented Sep 11, 2020

Yes, after the genotype, before the background.

Note we said this would always be required, but for WT it would not be relevant (although I guess it would not harm to add NV).
We really only need NV for the known differences (i.e amino acid change) related to some canonical form.

@ValWood
Copy link
Member Author

ValWood commented Sep 11, 2020

We also discussed in the case of 'nv' that it would be useful to have an option for 'nv-unknown' and 'nv-known' where the sequence variation could be recorded.

I don't remember/understand this.
It would always be NV unless a change would made by an experimenter and then it would be EV? Am I missing something?

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 11, 2020

We also discussed in the case of 'nv' that it would be useful to have an option for 'nv-unknown' and 'nv-known' where the sequence variation could be recorded.

I don't remember/understand this.
It would always be NV unless a change would made by an experimenter and then it would be EV? Am I missing something?

We were discussing eg avrSEN1 where the nv resulted in an early STOP vs atr1/RPP1 paper where there were 5 different pathogen strains which would be nv but the actual changes in sequence were not recorded.
In the avrSEN1 example would we just have genotype WT, nv and strain name? or do we also want to capture the detail of the truncation reported in the paper?

@ValWood
Copy link
Member Author

ValWood commented Sep 11, 2020

The NV/EV call be associated with any allele-type? (in fact it would not usually be WT ). This was proposed as an additional field? The allele variant would be curated (e.g A123L) and NV could be associated with this. We decided that we wouldn't curate natural variation as WT?

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 11, 2020

The NV/EV call be associated with any allele-type? (in fact it would not usually be WT ). This was proposed as an additional field? The allele variant would be curated (e.g A123L) and NV could be associated with this. We decided that we wouldn't curate natural variation as WT?

Thanks @ValWood that makes sense. So for the atr1/RPP1 example it probably would be 'WT nv strain' but for the avrSen1 example we can capture the known allele variant in the genotype and label nv.

@ValWood
Copy link
Member Author

ValWood commented Sep 11, 2020

or unknown NV. It seems inconsistent to use WT for ones we don't know.... I haven't really thought about this though.

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 11, 2020

Maybe we need 'ev', 'nv' for the known variant alleles and 'nvu' for those captured as WT?

@jseager7 jseager7 self-assigned this Sep 14, 2020
@jseager7 jseager7 added schema changes Changes to database schema are required and removed discuss labels Sep 14, 2020
@jseager7
Copy link
Collaborator

jseager7 commented Sep 14, 2020

Edit: updated to reflect that we actually agreed on two variant options.

Based on discussion in the last call, we've decided to add a new field to the genotype creation modal that will capture the variant status. My current plan is to allow two options:

  • Experimental variant (EV)
  • Natural variant (NV)

Once the variant status is selected, it will be shown in abbreviated form in the genotype display name (as EV or NV).

The field could be implemented either as radio buttons (see here for an example) or a drop-down menu:

  • If we use radio buttons, there would be two mutually exclusive options for EV or NV.

  • If we use a drop-down menu, we would have two options in the menu (as described above), and the menu would default to a placeholder.

@ValWood @CuzickA Are you happy with the names chosen for these options? Are you happy with using a drop-down menu for the field, or would you prefer radio buttons?


@kimrutherford Once this is all decided, I'm going to need your help making schema changes to the genotype table to allow this variant information to be recorded.

@jseager7
Copy link
Collaborator

Note also that once this is implemented, the variant status will presumably have to be retroactively applied to every existing genotype in PHI-Canto. Based on querying the JSON export, we have approximately 300 genotypes (as of 14 September 2020). It would help if we had a sensible default for the cases where the variant status hasn't been curated; presumably 'Natural variant unknown' won't be suitable if some of these old genotypes are experimental variants.

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 14, 2020

Hi @jseager7 thanks for writing up the above meeting notes.

I may have misunderstood but I thought that we were planning on having just the two options
Engineered variant (EV)
Natural variant (NV)

and then in the allele type dropdown list
wild type (reference)
wild type (other)

@jseager7
Copy link
Collaborator

I may have misunderstood but I thought that we were planning on having just the two options

@CuzickA Thanks for the reminder. But what should we do about the old genotypes that have not been curated with a variant status? Will we need a placeholder option for these cases, or should we just leave the status empty?

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 14, 2020

@CuzickA Thanks for the reminder. But what should we do about the old genotypes that have not been curated with a variant status? Will we need a placeholder option for these cases, or should we just leave the status empty?

I'm not sure. What do you think will be the easiest way for me to make the appropriate edits to the genotypes? I guess if the status is empty I would know it needed updating.
It would be good to have a current JSON export of all the sessions in case I need to refer back to the 'old genotypes'.

I'm also happy to have a drop down menu for the new EV NV option.

@jseager7
Copy link
Collaborator

I'm also happy to have a drop down menu for the new EV NV option.

I think as long as we only have two options, then radio buttons would be a better choice. (If we had three or four options, then either field type would be okay.)

What do you think will be the easiest way for me to make the appropriate edits to the genotypes? I guess if the status is empty I would know it needed updating.

It might be difficult to discern whether the variant status is empty if it's only shown embedded in the display name. Compare the following two examples:

  • TRI5+[WT product] (bkg: background) (strain: PH-1)
  • TRI5+[WT product][NV] (bkg: background) (strain: PH-1)

...and that's just for a simple genotype. However, if we add a column showing the variant type to the genotype tables (on the Genotype Management page), it would be much easier to check.

Note that allowing the field to be blank for old genotypes implies that the data is optional (and more or less requires it be optional at the database level). My question is: do we want to require a variant status for all new genotypes, or can it be optional there as well?

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 14, 2020

...and that's just for a simple genotype. However, if we add a column showing the variant type to the genotype tables (on the Genotype Management page), it would be much easier to check.

This seems like a good idea.

Note that allowing the field to be blank for old genotypes implies that the data is optional (and more or less requires it be optional at the database level). My question is: do we want to require a variant status for all new genotypes, or can it be optional there as well?

I think we want to require a variant status for all new genotypes. Once we start trialling these new options out we should be able to flag up any genotypes that don't fit into this schema.

@ValWood
Copy link
Member Author

ValWood commented Sep 14, 2020

Screenshot 2020-09-14 at 21 35 13

note that [WT product] refers to the expression level.

Since NV/EV refers to the genotype allele, it should be

WT[NV] [WT product]

@ValWood
Copy link
Member Author

ValWood commented Sep 14, 2020

I think we want to require a variant status for all new genotypes.

I agree, the assumption is that it will always be one or the other.

@CuzickA
Copy link
Collaborator

CuzickA commented Sep 15, 2020

Screenshot 2020-09-14 at 21 35 13

note that [WT product] refers to the expression level.

Since NV/EV refers to the genotype allele, it should be

WT[NV] [WT product]

Yes, I agree with this.

@jseager7
Copy link
Collaborator

Did we want the 'allele variant' to be above the 'allele type' or below it (but above the allele expression)?

I don't think we made a decision, but if the allele variant ever constrains the allele type (meaning we won't want to show some allele types for natural variants or engineered variants), then we should show it above the Allele type field. Otherwise, it doesn't matter where it goes.

@ValWood
Copy link
Member Author

ValWood commented Sep 15, 2020

I think its this option as the '+' already indicates WT
yes

@CuzickA
Copy link
Collaborator

CuzickA commented Oct 7, 2020

If the display looks like this
sgo1+[NV] [WT product]
for allele type 'wild type' being represented with '+'

how will it change with the proposed new allele types
wild type-reference
wild type-other

@CuzickA
Copy link
Collaborator

CuzickA commented Oct 12, 2020

After today's meeting we decided to compile a list of reference proteome strains and add '-ref' or similar to these strain names as a tag. We are using the reference proteome selected by UniProt. Where there is more than one strain we will select the first published genome. (Is there a ticket for this?)

We are still finding it tricky to make a decision on capturing 'variant status nv/ev' and 'wild type -ref/other'.
Here is a list of possible terms that we could include in the allele type dropdown menu

wild type-ref (Query: is this for when the sequence is same as the reference proteome, rather than indicting wild type function which is captured in AE infective ability in the gene-for-gene flow??)
wt-nv-deletion (assume wt=wt-other and not wt-ref?)
wt-nv-disruption
wt-nv-unknown
wt-nv-amino acid insertion
wt-nv-amino acid substitution(s)
wt-nv-amino acid insertion and substitution
wt-nv-amino acid insertion and deletion
wt-nv-partial deletion and amino acid change
wt-nv-partial deletion, amino acid
wt-nv-nucleotide insertion
wt-nv-nucleotide substitution(s)
wt-nv-partial deletion, nucleotide
wt-nv-nonsense mutation
wt-nv-other

ev-wild type (Query: is this needed for overexpression studies?)
ev-deletion
ev-disruption
ev-unknown
ev-amino acid insertion
ev-amino acid substitution(s)
ev-amino acid insertion and substitution
ev-amino acid insertion and deletion
ev-partial deletion and amino acid change
ev-partial deletion, amino acid
ev-nucleotide insertion
ev-nucleotide substitution(s)
ev-partial deletion, nucleotide
ev-nonsense mutation
ev-transformant
ev-other

What do you think @ValWood ?

I'm still a bit confused about the capturing of 'wild type'.
There seem to be several meanings
-reference strain
-other natural strain
-function of allele (KHK)

@ValWood
Copy link
Member Author

ValWood commented Oct 12, 2020

I'm still a bit confused about the capturing of 'wild type'.

I am still confused by the meaning of this. It should really mean one thing, or if it means 2 things we need 2 separate data-types....

ev-wild type (Query: is this needed for overexpression studies?)
I don't think so, because the expression is captured separately from the allele type

@ValWood
Copy link
Member Author

ValWood commented Oct 12, 2020

Another issue if these are precomposed in a single pulldown, how will the user know when NV and EV refer to in these labels?

Also do you want wt in front of all of the natural variations? Some will not be wt?

@CuzickA
Copy link
Collaborator

CuzickA commented Oct 13, 2020

I'm still a bit confused about the capturing of 'wild type'.

I am still confused by the meaning of this. It should really mean one thing, or if it means 2 things we need 2 separate data-types....

Yes, this is why we devised
wild type-reference
wild type-other

although in yesterday's discussion it sounded as if the team wanted to move away from this idea.

@CuzickA
Copy link
Collaborator

CuzickA commented Oct 13, 2020

Another issue if these are precomposed in a single pulldown, how will the user know when NV and EV refer to in these labels?

Also do you want wt in front of all of the natural variations? Some will not be wt?

I guess the NV and EV would have to be spelt out in full. (The mocked up example above with the NV, EV options would be clearer).

I wasn't sure about the 'wt' prefix. It depends on our definition of wild type. Are we defining WT as only the reference strain? or as any naturally occurring collected strain? If it is the latter, all the nv options would be WT-other.

From memory only ~50% of the species in the PHI4.8 data release had reference strain proteomes in UniProt. @jseager7 is repeating this search with the latest PHI4.10 dataset.

@ValWood
Copy link
Member Author

ValWood commented Oct 13, 2020

I was more concerned that the WT in these cases is really an inference from WT-phenotype in another experiment. This seems a bit strange and possibly problematic, but I can see that you require the information that this allele behaves like a wt allele.

It might OK as long as we are clear that what we are saying here is that this is a WT genotype AND it does mean that we would not be able to capture any of the natural variation WT if it is known. Although this may not matter because for the experimental outcome it isn't important. Also, the information should be accessible as we have the locus and strain information recorded which is the important part. So I think it is probably fine, I am probably worrying about this unnecessarily.

However, I find it odd that the natural variant changes have wt in front of them, because the reason we are recording these changes is that they don't behave like the reference WT, so here we would be using the wt meaning differently. Shouldn't these just omit the wt and be called nv? Retaining the WT designation for "any genotype that behaves like the WT reference strain"

You should also spell out nv and ev in full the dropdown (although these can be abbreviated in the genotype view and in the annotations).

I think users will find it confusing that if you have a non-reference strain, with a 'wt-' acting allele where you did not know the sequence your option would be
wt-nv-unknown
it seems that this option should be
wt-non-reference
(because if it is 'unknown' sequence you won't know if there is any natural variation or not).
The nv options are for where you know the sequence or a naturally occurring variant and you can record differences from the canonical WT.

Does this make sense?

@ValWood
Copy link
Member Author

ValWood commented Oct 13, 2020

Also, we might be able to remove some infrequently used allele types and use 'other'. This selection is largely based on PomBase observed types, but you don't need to keep them all.

@ValWood
Copy link
Member Author

ValWood commented Oct 13, 2020

I guess the NV and EV would have to be spelt out in full. (The mocked up example above with the NV, EV options would be clearer).

I wasn't sure about the 'wt' prefix. It depends on our definition of wild type. Are we defining WT as only the reference strain? or as any naturally occurring collected strain? If it is the latter, all the nv options would be WT-other.

OK I see I am repeating myself! I don' know if that is good or bad.

@ValWood
Copy link
Member Author

ValWood commented Oct 13, 2020

wild type-ref (Query: is this for when the sequence is same as the reference proteome, rather than indicting wild type function which is captured in AE infective ability in the gene-for-gene flow??)

Yesterday, my understanding was that you wanted to use wt for any allele which behaved like the identified wt (which might be reference, or non reference) -which is fine.
I keep wondering why you therefore need
to distinguish between a reference-sequence and non sequence wt ?
I think somebody explained this yesterday but could we confirm the reason.

I can only think of the scenarios
behaves like WT
or some natural variation
or some engineered variation

So we need to be clear about why we need to specify the 'reference' information in the workflow. Particularly since the information about the reference strain for the species will be recorded going forward with the strain information.

My question is. why do we need to say whether the wt is reference sequence or not? (It seems that these are really quite different things anyway, a WT allele for any given locus may or may not be WT in the reference strain).

Am I oversimplifying by saying that all we really want to record is:
a) this locus behaves like WT for this allele (in which case we are not recording any variation info - but this is would be available from the locus/strain data)
b) any natural variation that does not behave like WT
c) any engineered variation

Does this minimal set of information make any of the information you want to get out of a gene-for-gene information impossible?

@CuzickA
Copy link
Collaborator

CuzickA commented Oct 13, 2020

Thanks @ValWood

So it sound like we are moving towards defining 'WT' as the allele having 'wild type function', rather than exactly matching the sequence of a reference or other strain.

How do we conclude what the 'wild type function' of the gene is? For pathogen effectors I guess this would be to cause disease, for host resistance genes this would be to recognise at least one effector and trigger resistance. We can capture this information in the gene-for-gene AE which is good.

@jseager7
Copy link
Collaborator

So it sound like we are moving towards defining 'WT' as the allele having 'wild type function', rather than exactly matching the sequence of a reference or other strain.

It seems like the common (dictionary) definition of wild-type is some gene or allele that is most prevalent in a natural population. Is it likely to cause confusion if we use a definition based on gene function? Are these definitions even compatible?

@ValWood
Copy link
Member Author

ValWood commented Oct 13, 2020

Are these definitions even compatible?

I don't know. I think the way Kim-H-K want to use the WT in gene-for-gene. might not be compatible with the use in non-gene for gene annotation.

I don't know how we could know which allele is most prevalent. So as Kim said the community usually decide on an gene-by-gene basis what is WT based on observation.

We often have a 'known WT acting allele' and either a) we don't know the precise genotype or b) the sequence may not be exactly the same as the designated WT allele. My understanding is that in these cases we want to be able to say this is a WT-acting allele and we don't need to record any sequence detail. We are getting hung up on the semantics of how to name and define this.

We don't just want to say that it is some unknown natural variant, because in the gene-for-gene outcomes it is important to know whether the disease is a result of variation in the pathogen or host allele. In this case if the WT pathogen normally causes disease and there is no disease in the host we know this must be due to some change in the host.

Is this what we are trying to say:
WT allele
An allele that contains the designated WT sequence for a given locus, or any naturally occurring variant in a related strain which exhibits the same phenotype?

It seems that a precise definition of how PHI-base interprets a WT designation would be a good starting point. If we have this it should be easier to move to a solution.

Whatever the definition ends up being it needs to be true across the board for gene-for-gene and for non-gene for gene. We can't have 2 differentuses of WT so if something different is meant we need a different label.

@CuzickA
Copy link
Collaborator

CuzickA commented Oct 13, 2020

past idea
image
new idea?? based on WT allele function
image

here are the AE
image

In the new idea mockup both ATR1(emoy2) and ATR1(cala2) would be 'WT' as they can function to cause disease on the correct host strain. Nd (not shown) and Ws respectively.
ATR1(emoy2) is recognised by the host R gene in strain Ws which blocks disease formation. If we are comparing 'cala2' to the reference strain 'emoy2' sequence we can say nv-unknown but do we still say WT allele function.
Key information here is that there is a natural variation between strain emoy2 and cala2 which determines whether RPP1 from host strain Ws can recognise it to trigger defence.

@CuzickA
Copy link
Collaborator

CuzickA commented Oct 13, 2020

WT allele
An allele that contains the designated WT sequence for a given locus, or any naturally occurring variant in a related strain which exhibits the same phenotype?

The tricky part with the gene-for-gene interactions is the combination of both pathogen and host strains.
In example above, both ATR1emoy2 and ATR1cala2 have 'WT effector function' of causing disease when they are on the correct host strain to enable this.

@ValWood
Copy link
Member Author

ValWood commented Oct 13, 2020

WT allele
An allele that contains the designated WT sequence for a given locus, or any naturally occurring variant in a related strain which exhibits the same phenotype. In pathogen host interactions this includes an allele which is is shown to induce disease on at least one susceptible host.

Note that I am not necessarily suggesting we use this as the WT definition. I'm only trying to figure out the usage scope.

@CuzickA
Copy link
Collaborator

CuzickA commented May 18, 2021

It would be good to move forward and make a decision about this NV/EV tag. All the annotated sessions will need to be updated with NV/EV prior to making training materials and screenshots for the PHI-Canto publication.

I think the minimal information we are trying to disambiguate is when a genotype records an alteration is this due to natural sequence variation between this strain and another strain or due to engineered variation. One of the difficulties here is not linking to the exact sequence for a given strain.

Can we make a statement in our documentation that

  1. Different strain names may or may not indicate different allele sequences or function.
  2. When natural variation is responsible for a known alteration within the allele this is recorded with the genotype using a tag NV to indicate the alteration occurred in the wild rather than being engineered in the lab.
  3. When the NV tag is NOT used within an altered genotype, it can be assumed that the genotype was engineered in a lab.

I'm not sure if this helps the discussion but it seems to all be getting a bit complicated and I wanted to reduce it down to address the initial issue.

Let me know what you think :-)

@jseager7
Copy link
Collaborator

Different strain names may or may not indicate different allele sequences or function.

Is this referring to the strain name on the genotype? I'm a bit concerned about the implication that the strain name doesn't indicate different allele sequences, because that begs the question: why indicate the strain at all? The only way I can see this not mattering is if the strain only contains sequence differences outside of the genes / alleles of interest (by 'alleles of interest' I mean the alleles curated in the session), but I would've expected most authors will be using a strain precisely because it contains existing variations to some allele of interest that they want to study. Is that true?

When natural variation is responsible for a known alteration within the allele this is recorded with the genotype using a tag NV to indicate the alteration occurred in the wild rather than being engineered in the lab.

This sounds fine, although we might want to define exactly what the scope of 'natural variation' is – would controlled breeding programmes count as natural variation? – and maybe include some examples of the reference point for natural variation. Maybe one example of natural variation is when a subset of a wheat population expresses greater resistance to some pathogen because of a mutation that was not experimentally (deliberately) induced.

When the NV tag is NOT used within an altered genotype, it can be assumed that the genotype was engineered in a lab.

I don't think this is necessary, because the current plan is to require a variant status for all genotypes: see #2346 (comment). The user would be forced to pick NV or EV, so there's no need to make assumptions.

@jseager7
Copy link
Collaborator

I also don't think we've resolved the following points from @ValWood, at least not in this issue:

It seems that a precise definition of how PHI-base interprets a WT designation would be a good starting point. If we have this it should be easier to move to a solution.

Whatever the definition ends up being it needs to be true across the board for gene-for-gene and for non-gene for gene. We can't have 2 different uses of WT so if something different is meant we need a different label.

@CuzickA
Copy link
Collaborator

CuzickA commented May 19, 2021

Different strain names may or may not indicate different allele sequences or function.

Is this referring to the strain name on the genotype? I'm a bit concerned about the implication that the strain name doesn't indicate different allele sequences, because that begs the question: why indicate the strain at all? The only way I can see this not mattering is if the strain only contains sequence differences outside of the genes / alleles of interest (by 'alleles of interest' I mean the alleles curated in the session), but I would've expected most authors will be using a strain precisely because it contains existing variations to some allele of interest that they want to study. Is that true?

Yes, I was referring to the strain name on the genotype. And yes, the rest of your comment follows my thinking here. In most cases there will be variation to the strain alleles being studied, but I thought it would be better to keep the option open in case there is no variation within the studied gene and the strain variation is elsewhere in the genome. Some studies may collect a variety of eg pathogen strains from the field and test on host for phenotype. We may want to curate this information but the authors themselves may not know whether the allele sequences are the same or not unless they sequence and this is not always done. Again it comes down to the difficulty of not knowing the allele sequence from the strain in many of the cases.

When the NV tag is NOT used within an altered genotype, it can be assumed that the genotype was engineered in a lab.

I don't think this is necessary, because the current plan is to require a variant status for all genotypes: see #2346 (comment). The user would be forced to pick NV or EV, so there's no need to make assumptions.

I couldn't quite decide here on whether it would be better to force a choice of NV or EV for all genotypes or just to add NV in the examples where we have a WT strain that has a known alteration that is captured in the genotype. In these cases the allele type would not be wild type it would be amino acid substitution or similar. I thought the NV with a clear definition would help explain these known natural variation genotypes. In cases where the strain sequence was unknown, the genotype would have the strain name and be wild type. In cases where the genotype were EV, the alleles would usually not be wild type and if they were they would have altered expression.
If we did decide to add NV/EV to all genotypes then all of the control genotypes would need the NV tag and then we have the issue mentioned above about specifying which strains are WT-reference or WT-other.

I'm not sure which idea would work best here, but I thought it was worth suggesting this alternative idea to try and move was away from needing to put too much emphasis on a WT sequence or function. This opens the can of worms about reference genomes, non-reference genomes and pan-genomes.

@jseager7
Copy link
Collaborator

jseager7 commented May 20, 2021

Some studies may collect a variety of eg pathogen strains from the field and test on host for phenotype. We may want to curate this information but the authors themselves may not know whether the allele sequences are the same or not unless they sequence and this is not always done.

Maybe this is a silly question, but if the authors don't know if the allele sequences are the same – presumably because they didn't perform any sequencing – how do they know what the strains are?

@jseager7
Copy link
Collaborator

I'm not sure which idea would work best here, but I thought it was worth suggesting this alternative idea to try and move was away from needing to put too much emphasis on a WT sequence or function. This opens the can of worms about reference genomes, non-reference genomes and pan-genomes.

I don't know what the best answer is, but I suspect it would benefit community curators if we could simplify or reduce the data we need to curate (not to mention the benefit of not having to revisit every curation session). The fact that this issue has been so difficult to understand during its discussion makes me think that it may not be easy for community curators to reason about either.

Now that we have the ability to link metagenotypes to their controls, I'm not sure why it's important to continue to stress this distinction between reference strains and other strains. There's all kinds of problems with the reference strain distinction:

  • we already know that the coverage of reference proteomes is incomplete;
  • I suspect PHI-base will be covering a lot of pathogens that are less studied and probably less sequenced;
  • based on previous discussion, there seem to be pretty arbitrary rules about deciding which strain is the reference strain;
  • there are cases where the reference strain isn't even the most relevant strain for experimental study (which happened with Triticum aestivum, if I remember correctly); and so on.

It sounds like the distinction between mutations arising from natural variation (NV) and mutations caused by experiments (EV) could be useful (and it feels more straightforward), but I don't have the expertise to say how useful it is. Assuming the information is usually present in publications, it might at least be easier to curate.

@jseager7
Copy link
Collaborator

Following the meeting today, we've decided on a simpler solution that mostly follows Alayne's suggestion.

@ValWood I'd appreciate your feedback on these suggestions, particularly points 3 and 4, because these may be difficult to change if we later decide to take another approach – of particular importance is whether we should treat the origin of the variation as a property of the allele or the genotype, especially in cases where a multi-allele genotype contains alleles of natural origin and alleles engineered by the experiment.

  1. We will focus on curating natural variation for mutant genotypes, where relevant. For example, the curator will only have to tick a box to indicate when the allele (i.e. single allele genotype) was caused by natural variation.

  2. Variation will not be specified for control metagenotypes, because it's too difficult to specify what the variation is relative to (leading us back into the WT-reference / WT-other problem). Mutant metagenotypes should not have this problem because the variation is understood to be relative to the control metagenotype.

  3. If an allele is not specified to be caused by natural variation, then it is assumed to be caused by engineered variation. Currently I think the plan is to omit the tag in the case of an engineered variation, but we could default to engineered variation on new alleles if we think that would be clearer. I'm a bit concerned about extending this 'engineered by default' assumption to control metagenotypes, because in that case we truly don't know (or don't care) what the origin of the variation is.

  4. Multi-allele genotypes containing at least one engineered allele will be classified as engineered genotypes. While we considered that users might want to see the origin of the variation for each allele in a genotype, we ultimately decided that the most notable case was when a phenotype arose exclusively from natural variation, and that it would be simpler (from a user interface perspective) to keep the variation at the level of the genotype, and merely classify genotypes including any form of engineered variation as engineered (or at least, non-natural).

@ValWood
Copy link
Member Author

ValWood commented May 24, 2021

It seem that it should apply to the allele. I haven't yet annotated any multi allele genotypes for PHI-base , but I guess later there will be cases later where people have a natural variant, AND engineer another gene in the same species?

@kimrutherford
Copy link
Member

Hi Val. We discussed that on the call. The consensus was that to keep things simple we'd attach the engineered vs natural flag to the genotypes. And if the user combines an engineered single allele genotype and a natural one in the interface, the resulting multi-allele genotype should have the engineered flag.

So we have a plan but I think we should have another chat about this on Skype (including you this time) before starting the implementation. It involves changes how things are stored in the database so it would be good to be sure we've got it right.

@jseager7
Copy link
Collaborator

Another factor that could affect this decision is how the variant status will be displayed in the user interface, depending on whether it's linked to each allele or the combined genotype.

Linking to alleles

Linking the variant status to the allele would be unambiguous in the annotation table rows:

image

and also in the drop-down menu when editing annotations:

image

Linking to genotypes

Linking the variant status to the genotype would mean we'd have to visually delimit the variant status from the individual alleles. For the annotation table rows, we could put the variant status on its own line:

image

but the display for the drop-down menu wouldn't be so simple. It seems the only sensible place for the variant status is after the final allele, delimited with extra white space:

image

but this display could be confused with the variant status only applying to the final allele in the list (TRI5+ in the example above).

I also thought about placing the variant status after the species information, but I thought this would make it seem like the variant status related to the species or strain, instead of the genotype:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss PHI-Canto schema changes Changes to database schema are required
Projects
None yet
Development

No branches or pull requests

4 participants