-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge str harmonization update #153
Merge str harmonization update #153
Conversation
# Conflicts: # trtools/utils/tr_harmonizer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need tests running mergeSTR on actual HipSTR VCFs to confirm that it properly handles records with flanking base pairs. The current tests don't confirm that there's been any upgrade in functionality. I think you can build tests taking appropriate records from the example-files/*hipstr*
files, but if that doesn't work I'm happy to provide example HipSTR output.
@@ -34,6 +34,16 @@ def __init__(self, chrom, pos, ref, alts=[], info = {}): | |||
self.ALTS = alts | |||
self.INFO = info | |||
|
|||
|
|||
class DummyHarmonizedRecord: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the compareSTR review is merged, let's import this from test_mergeutils instead of duplicating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is just a temporary solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also created a new test that confirms that flanking bps of hipstr records are ignored, and the repeat part is merged as before.
@@ -172,19 +179,44 @@ def GetRefAllele(current_records, mergelist): | |||
ref : str | |||
Reference allele string. Set to None if conflicting references are found. | |||
""" | |||
refs = [] | |||
|
|||
def DefaultKey(record: trh.TRRecord): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe harmonization should make it so that record.ref_allele is correct regardless of the vcftype, so you don't have to do this sort of check here. So you should just be able to pull that field. (Or if not, we should fix the harmonization to make that so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, when I wrote this I was afraid that the REF field might contain something like , similarly to ALT fields from some callers, but now I realize that something like that wouldn't make any sense. Thanks for pointing that out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the changes but found an additional problem. It looks like EH records usually have only one BP as a value in the REF column. Harmonization of EH records expands the EH record motif and fabricates this REF allele. Use of ref from the harmonized record can cause differences like this:
Output line: CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG
Control line: C
A possible solution is to use the old REF field from the non-harmonized VCF record and to use the old REF and ALT fields from non harmonized records (reverting most of my changes), but this might cause some weird behavior, for example on HipSTR records with and without flanking BPs.
But I guess it depends on what you think is the intended behavior. Is it OK to remove the flanking BPs during the merge or not? In my opinion, removing the flanking BPs during the merge is OK, because it doesn't have an effect on the meaning of the record, and the resulting record still looks like a HipSTR record.
To sum up, using ref attribute from harmonized record breaks things, the question is, is it OK to use info about types of records to decide which attribute to use (the behavior I introduced), or to not use harmonized records during merge at all (only during the record picking phase).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should remove flanking bps during the merge and make the explicit in the documentation.
I had forgotten about the different representations between callers issue. You're right, it doesn't make sense use harmonized records for outputting in general, it needs to look like the original caller's output. I think the solution in the current commit is fine then - check if it's a HipSTR record, and if so use the harmonized record's alleles which have the flanking bps removed (and are sequences which is the same type of output that HipSTR produces). Otherwise just proceed with what's in the record, as this problem only pertains to HipSTR.
That all seem reasonable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. I've reverted my local changes and fixed the case problem from your other comment.
refs.append(current_records[i].REF.upper()) | ||
chrom = current_records[i].chrom | ||
pos = current_records[i].pos | ||
refs.append(ref_picker(current_records[i]).upper()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this comes from the tr harmonizer, it'll already be uppercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversion to upper case now only happens when non harmonized REF is used
This looks done! Please add an 'unreleased changes' section to RELEASE_NOTES.rst in the main directory with a bullet or two about what's changed here and I'll merge this into develop. |
Similar update to the one done here #147
MergeSTR is now able to correctly handle HipSTR records which may or may not have their flanking bps as part of the record