Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge str harmonization update #153

Merged

Conversation

Kulivox
Copy link
Contributor

@Kulivox Kulivox commented Feb 15, 2022

Similar update to the one done here #147
MergeSTR is now able to correctly handle HipSTR records which may or may not have their flanking bps as part of the record

@Kulivox Kulivox marked this pull request as ready for review February 17, 2022 12:08
@Kulivox Kulivox changed the base branch from master to develop February 22, 2022 09:29
Copy link
Contributor

@LiterallyUniqueLogin LiterallyUniqueLogin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tests running mergeSTR on actual HipSTR VCFs to confirm that it properly handles records with flanking base pairs. The current tests don't confirm that there's been any upgrade in functionality. I think you can build tests taking appropriate records from the example-files/*hipstr* files, but if that doesn't work I'm happy to provide example HipSTR output.

@@ -34,6 +34,16 @@ def __init__(self, chrom, pos, ref, alts=[], info = {}):
self.ALTS = alts
self.INFO = info


class DummyHarmonizedRecord:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the compareSTR review is merged, let's import this from test_mergeutils instead of duplicating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just a temporary solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also created a new test that confirms that flanking bps of hipstr records are ignored, and the repeat part is merged as before.

@@ -172,19 +179,44 @@ def GetRefAllele(current_records, mergelist):
ref : str
Reference allele string. Set to None if conflicting references are found.
"""
refs = []

def DefaultKey(record: trh.TRRecord):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe harmonization should make it so that record.ref_allele is correct regardless of the vcftype, so you don't have to do this sort of check here. So you should just be able to pull that field. (Or if not, we should fix the harmonization to make that so).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, when I wrote this I was afraid that the REF field might contain something like , similarly to ALT fields from some callers, but now I realize that something like that wouldn't make any sense. Thanks for pointing that out!

Copy link
Contributor Author

@Kulivox Kulivox Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the changes but found an additional problem. It looks like EH records usually have only one BP as a value in the REF column. Harmonization of EH records expands the EH record motif and fabricates this REF allele. Use of ref from the harmonized record can cause differences like this:

Output line: CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG
Control line: C

A possible solution is to use the old REF field from the non-harmonized VCF record and to use the old REF and ALT fields from non harmonized records (reverting most of my changes), but this might cause some weird behavior, for example on HipSTR records with and without flanking BPs.
But I guess it depends on what you think is the intended behavior. Is it OK to remove the flanking BPs during the merge or not? In my opinion, removing the flanking BPs during the merge is OK, because it doesn't have an effect on the meaning of the record, and the resulting record still looks like a HipSTR record.

To sum up, using ref attribute from harmonized record breaks things, the question is, is it OK to use info about types of records to decide which attribute to use (the behavior I introduced), or to not use harmonized records during merge at all (only during the record picking phase).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should remove flanking bps during the merge and make the explicit in the documentation.

I had forgotten about the different representations between callers issue. You're right, it doesn't make sense use harmonized records for outputting in general, it needs to look like the original caller's output. I think the solution in the current commit is fine then - check if it's a HipSTR record, and if so use the harmonized record's alleles which have the flanking bps removed (and are sequences which is the same type of output that HipSTR produces). Otherwise just proceed with what's in the record, as this problem only pertains to HipSTR.

That all seem reasonable to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. I've reverted my local changes and fixed the case problem from your other comment.

refs.append(current_records[i].REF.upper())
chrom = current_records[i].chrom
pos = current_records[i].pos
refs.append(ref_picker(current_records[i]).upper())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this comes from the tr harmonizer, it'll already be uppercase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversion to upper case now only happens when non harmonized REF is used

@LiterallyUniqueLogin
Copy link
Contributor

This looks done! Please add an 'unreleased changes' section to RELEASE_NOTES.rst in the main directory with a bullet or two about what's changed here and I'll merge this into develop.

@LiterallyUniqueLogin LiterallyUniqueLogin merged commit 261265f into gymrek-lab:develop Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants