-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recording variant status of genotypes (natural or engineered) #2346
Comments
This is almost certainly going to require database changes, and it sounds like it would make the most sense to record this as part of the genotype, probably on the same modal (pop-up) as where the strain is specified: @kimrutherford Since this property seems to be specific to PHI-base, would it make sense to add a |
We also need to decide whether or not we want to include the variant type in the genotype display name, as we already do with the genotype background and allele type. |
We thought that NV or EV could be included (maybe mouse over to see the full expanded abbreviation) |
Here's an example of what the inline genotype display name currently looks like: TRI5+[WT product] (bkg: background) (strain: PH-1) (note that the example is in pathogen-host mode; in single organism mode you wouldn't see the strain information) Where would you like to include the NV or EV abbreviation in this display name? Should we introduce new brackets for it, or include it in one of the existing sets of brackets? |
Maybe after [WT product] before (bkg: background)? |
We also discussed in the case of 'nv' that it would be useful to have an option for 'nv-unknown' and 'nv-known' where the sequence variation could be recorded. |
Yes, after the genotype, before the background. Note we said this would always be required, but for WT it would not be relevant (although I guess it would not harm to add NV). |
I don't remember/understand this. |
We were discussing eg avrSEN1 where the nv resulted in an early STOP vs atr1/RPP1 paper where there were 5 different pathogen strains which would be nv but the actual changes in sequence were not recorded. |
The NV/EV call be associated with any allele-type? (in fact it would not usually be WT ). This was proposed as an additional field? The allele variant would be curated (e.g A123L) and NV could be associated with this. We decided that we wouldn't curate natural variation as WT? |
Thanks @ValWood that makes sense. So for the atr1/RPP1 example it probably would be 'WT nv strain' but for the avrSen1 example we can capture the known allele variant in the genotype and label nv. |
or unknown NV. It seems inconsistent to use WT for ones we don't know.... I haven't really thought about this though. |
Maybe we need 'ev', 'nv' for the known variant alleles and 'nvu' for those captured as WT? |
Edit: updated to reflect that we actually agreed on two variant options. Based on discussion in the last call, we've decided to add a new field to the genotype creation modal that will capture the variant status. My current plan is to allow two options:
Once the variant status is selected, it will be shown in abbreviated form in the genotype display name (as EV or NV). The field could be implemented either as radio buttons (see here for an example) or a drop-down menu:
@ValWood @CuzickA Are you happy with the names chosen for these options? Are you happy with using a drop-down menu for the field, or would you prefer radio buttons? @kimrutherford Once this is all decided, I'm going to need your help making schema changes to the |
Note also that once this is implemented, the variant status will presumably have to be retroactively applied to every existing genotype in PHI-Canto. Based on querying the JSON export, we have approximately 300 genotypes (as of 14 September 2020). It would help if we had a sensible default for the cases where the variant status hasn't been curated; presumably 'Natural variant unknown' won't be suitable if some of these old genotypes are experimental variants. |
Hi @jseager7 thanks for writing up the above meeting notes. I may have misunderstood but I thought that we were planning on having just the two options and then in the allele type dropdown list |
@CuzickA Thanks for the reminder. But what should we do about the old genotypes that have not been curated with a variant status? Will we need a placeholder option for these cases, or should we just leave the status empty? |
I'm not sure. What do you think will be the easiest way for me to make the appropriate edits to the genotypes? I guess if the status is empty I would know it needed updating. I'm also happy to have a drop down menu for the new EV NV option. |
I think as long as we only have two options, then radio buttons would be a better choice. (If we had three or four options, then either field type would be okay.)
It might be difficult to discern whether the variant status is empty if it's only shown embedded in the display name. Compare the following two examples:
...and that's just for a simple genotype. However, if we add a column showing the variant type to the genotype tables (on the Genotype Management page), it would be much easier to check. Note that allowing the field to be blank for old genotypes implies that the data is optional (and more or less requires it be optional at the database level). My question is: do we want to require a variant status for all new genotypes, or can it be optional there as well? |
This seems like a good idea.
I think we want to require a variant status for all new genotypes. Once we start trialling these new options out we should be able to flag up any genotypes that don't fit into this schema. |
I agree, the assumption is that it will always be one or the other. |
I don't think we made a decision, but if the allele variant ever constrains the allele type (meaning we won't want to show some allele types for natural variants or engineered variants), then we should show it above the Allele type field. Otherwise, it doesn't matter where it goes. |
|
If the display looks like this how will it change with the proposed new allele types |
After today's meeting we decided to compile a list of reference proteome strains and add '-ref' or similar to these strain names as a tag. We are using the reference proteome selected by UniProt. Where there is more than one strain we will select the first published genome. (Is there a ticket for this?) We are still finding it tricky to make a decision on capturing 'variant status nv/ev' and 'wild type -ref/other'. wild type-ref (Query: is this for when the sequence is same as the reference proteome, rather than indicting wild type function which is captured in AE infective ability in the gene-for-gene flow??) ev-wild type (Query: is this needed for overexpression studies?) What do you think @ValWood ? I'm still a bit confused about the capturing of 'wild type'. |
I am still confused by the meaning of this. It should really mean one thing, or if it means 2 things we need 2 separate data-types....
|
Another issue if these are precomposed in a single pulldown, how will the user know when NV and EV refer to in these labels? Also do you want wt in front of all of the natural variations? Some will not be wt? |
Yes, this is why we devised although in yesterday's discussion it sounded as if the team wanted to move away from this idea. |
I guess the NV and EV would have to be spelt out in full. (The mocked up example above with the NV, EV options would be clearer). I wasn't sure about the 'wt' prefix. It depends on our definition of wild type. Are we defining WT as only the reference strain? or as any naturally occurring collected strain? If it is the latter, all the nv options would be WT-other. From memory only ~50% of the species in the PHI4.8 data release had reference strain proteomes in UniProt. @jseager7 is repeating this search with the latest PHI4.10 dataset. |
I was more concerned that the WT in these cases is really an inference from WT-phenotype in another experiment. This seems a bit strange and possibly problematic, but I can see that you require the information that this allele behaves like a wt allele. It might OK as long as we are clear that what we are saying here is that this is a WT genotype AND it does mean that we would not be able to capture any of the natural variation WT if it is known. Although this may not matter because for the experimental outcome it isn't important. Also, the information should be accessible as we have the locus and strain information recorded which is the important part. So I think it is probably fine, I am probably worrying about this unnecessarily. However, I find it odd that the natural variant changes have wt in front of them, because the reason we are recording these changes is that they don't behave like the reference WT, so here we would be using the wt meaning differently. Shouldn't these just omit the wt and be called nv? Retaining the WT designation for "any genotype that behaves like the WT reference strain" You should also spell out nv and ev in full the dropdown (although these can be abbreviated in the genotype view and in the annotations). I think users will find it confusing that if you have a non-reference strain, with a 'wt-' acting allele where you did not know the sequence your option would be Does this make sense? |
Also, we might be able to remove some infrequently used allele types and use 'other'. This selection is largely based on PomBase observed types, but you don't need to keep them all. |
OK I see I am repeating myself! I don' know if that is good or bad. |
Yesterday, my understanding was that you wanted to use wt for any allele which behaved like the identified wt (which might be reference, or non reference) -which is fine. I can only think of the scenarios So we need to be clear about why we need to specify the 'reference' information in the workflow. Particularly since the information about the reference strain for the species will be recorded going forward with the strain information. My question is. why do we need to say whether the wt is reference sequence or not? (It seems that these are really quite different things anyway, a WT allele for any given locus may or may not be WT in the reference strain). Am I oversimplifying by saying that all we really want to record is: Does this minimal set of information make any of the information you want to get out of a gene-for-gene information impossible? |
Thanks @ValWood So it sound like we are moving towards defining 'WT' as the allele having 'wild type function', rather than exactly matching the sequence of a reference or other strain. How do we conclude what the 'wild type function' of the gene is? For pathogen effectors I guess this would be to cause disease, for host resistance genes this would be to recognise at least one effector and trigger resistance. We can capture this information in the gene-for-gene AE which is good. |
It seems like the common (dictionary) definition of wild-type is some gene or allele that is most prevalent in a natural population. Is it likely to cause confusion if we use a definition based on gene function? Are these definitions even compatible? |
I don't know. I think the way Kim-H-K want to use the WT in gene-for-gene. might not be compatible with the use in non-gene for gene annotation. I don't know how we could know which allele is most prevalent. So as Kim said the community usually decide on an gene-by-gene basis what is WT based on observation. We often have a 'known WT acting allele' and either a) we don't know the precise genotype or b) the sequence may not be exactly the same as the designated WT allele. My understanding is that in these cases we want to be able to say this is a WT-acting allele and we don't need to record any sequence detail. We are getting hung up on the semantics of how to name and define this. We don't just want to say that it is some unknown natural variant, because in the gene-for-gene outcomes it is important to know whether the disease is a result of variation in the pathogen or host allele. In this case if the WT pathogen normally causes disease and there is no disease in the host we know this must be due to some change in the host. Is this what we are trying to say: It seems that a precise definition of how PHI-base interprets a WT designation would be a good starting point. If we have this it should be easier to move to a solution. Whatever the definition ends up being it needs to be true across the board for gene-for-gene and for non-gene for gene. We can't have 2 differentuses of WT so if something different is meant we need a different label. |
The tricky part with the gene-for-gene interactions is the combination of both pathogen and host strains. |
WT allele Note that I am not necessarily suggesting we use this as the WT definition. I'm only trying to figure out the usage scope. |
It would be good to move forward and make a decision about this NV/EV tag. All the annotated sessions will need to be updated with NV/EV prior to making training materials and screenshots for the PHI-Canto publication. I think the minimal information we are trying to disambiguate is when a genotype records an alteration is this due to natural sequence variation between this strain and another strain or due to engineered variation. One of the difficulties here is not linking to the exact sequence for a given strain. Can we make a statement in our documentation that
I'm not sure if this helps the discussion but it seems to all be getting a bit complicated and I wanted to reduce it down to address the initial issue. Let me know what you think :-) |
Is this referring to the strain name on the genotype? I'm a bit concerned about the implication that the strain name doesn't indicate different allele sequences, because that begs the question: why indicate the strain at all? The only way I can see this not mattering is if the strain only contains sequence differences outside of the genes / alleles of interest (by 'alleles of interest' I mean the alleles curated in the session), but I would've expected most authors will be using a strain precisely because it contains existing variations to some allele of interest that they want to study. Is that true?
This sounds fine, although we might want to define exactly what the scope of 'natural variation' is – would controlled breeding programmes count as natural variation? – and maybe include some examples of the reference point for natural variation. Maybe one example of natural variation is when a subset of a wheat population expresses greater resistance to some pathogen because of a mutation that was not experimentally (deliberately) induced.
I don't think this is necessary, because the current plan is to require a variant status for all genotypes: see #2346 (comment). The user would be forced to pick NV or EV, so there's no need to make assumptions. |
I also don't think we've resolved the following points from @ValWood, at least not in this issue:
|
Yes, I was referring to the strain name on the genotype. And yes, the rest of your comment follows my thinking here. In most cases there will be variation to the strain alleles being studied, but I thought it would be better to keep the option open in case there is no variation within the studied gene and the strain variation is elsewhere in the genome. Some studies may collect a variety of eg pathogen strains from the field and test on host for phenotype. We may want to curate this information but the authors themselves may not know whether the allele sequences are the same or not unless they sequence and this is not always done. Again it comes down to the difficulty of not knowing the allele sequence from the strain in many of the cases.
I couldn't quite decide here on whether it would be better to force a choice of NV or EV for all genotypes or just to add NV in the examples where we have a WT strain that has a known alteration that is captured in the genotype. In these cases the allele type would not be wild type it would be amino acid substitution or similar. I thought the NV with a clear definition would help explain these known natural variation genotypes. In cases where the strain sequence was unknown, the genotype would have the strain name and be wild type. In cases where the genotype were EV, the alleles would usually not be wild type and if they were they would have altered expression. I'm not sure which idea would work best here, but I thought it was worth suggesting this alternative idea to try and move was away from needing to put too much emphasis on a WT sequence or function. This opens the can of worms about reference genomes, non-reference genomes and pan-genomes. |
Maybe this is a silly question, but if the authors don't know if the allele sequences are the same – presumably because they didn't perform any sequencing – how do they know what the strains are? |
I don't know what the best answer is, but I suspect it would benefit community curators if we could simplify or reduce the data we need to curate (not to mention the benefit of not having to revisit every curation session). The fact that this issue has been so difficult to understand during its discussion makes me think that it may not be easy for community curators to reason about either. Now that we have the ability to link metagenotypes to their controls, I'm not sure why it's important to continue to stress this distinction between reference strains and other strains. There's all kinds of problems with the reference strain distinction:
It sounds like the distinction between mutations arising from natural variation (NV) and mutations caused by experiments (EV) could be useful (and it feels more straightforward), but I don't have the expertise to say how useful it is. Assuming the information is usually present in publications, it might at least be easier to curate. |
Following the meeting today, we've decided on a simpler solution that mostly follows Alayne's suggestion. @ValWood I'd appreciate your feedback on these suggestions, particularly points 3 and 4, because these may be difficult to change if we later decide to take another approach – of particular importance is whether we should treat the origin of the variation as a property of the allele or the genotype, especially in cases where a multi-allele genotype contains alleles of natural origin and alleles engineered by the experiment.
|
It seem that it should apply to the allele. I haven't yet annotated any multi allele genotypes for PHI-base , but I guess later there will be cases later where people have a natural variant, AND engineer another gene in the same species? |
Hi Val. We discussed that on the call. The consensus was that to keep things simple we'd attach the engineered vs natural flag to the genotypes. And if the user combines an engineered single allele genotype and a natural one in the interface, the resulting multi-allele genotype should have the engineered flag. So we have a plan but I think we should have another chat about this on Skype (including you this time) before starting the implementation. It involves changes how things are stored in the database so it would be good to be sure we've got it right. |
Another factor that could affect this decision is how the variant status will be displayed in the user interface, depending on whether it's linked to each allele or the combined genotype. Linking to allelesLinking the variant status to the allele would be unambiguous in the annotation table rows: and also in the drop-down menu when editing annotations: Linking to genotypesLinking the variant status to the genotype would mean we'd have to visually delimit the variant status from the individual alleles. For the annotation table rows, we could put the variant status on its own line: but the display for the drop-down menu wouldn't be so simple. It seems the only sensible place for the variant status is after the final allele, delimited with extra white space: but this display could be confused with the variant status only applying to the final allele in the list (TRI5+ in the example above). I also thought about placing the variant status after the species information, but I thought this would make it seem like the variant status related to the species or strain, instead of the genotype: |
For pathogens and host genotypes, many natural variants occur and for these, we want to capture the differences to some nominally WT variant. We need to be able to distinguish when a variant is naturally occurring or 'engineered' for any specific allele.
So, we would like an additional field in the genotype pop up to be able to select one of either
I) Natural variant (NV)
or
ii) Engineered variant (EV).
We also thought that later if researchers took a natural variant, and then engineered a different residue, we would be able to combine these as multi-allele phenotypes. Although I am not sure about this? it might imply that 2 copies of the gene are present. I have forgotten how this is specified.
If not, maybe this is something we can look into ( this is more future-proofing, although the NV/EV distinction is required now I don't think we have examples of editing to a natural variant right now @CuzickA can confirm)
The text was updated successfully, but these errors were encountered: