-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change UpdateVCFSequenceDictionary to use the specified dictionary uniformly. #5093
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5093 +/- ##
===============================================
- Coverage 87.037% 87.036% -0.001%
- Complexity 31728 31735 +7
===============================================
Files 1943 1943
Lines 146193 146213 +20
Branches 16141 16145 +4
===============================================
+ Hits 127242 127258 +16
- Misses 13064 13068 +4
Partials 5887 5887
|
@jonn-smith Can you do a review on this one when you get a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor questions / comments.
final SAMSequenceDictionary masterDictionary = getMasterSequenceDictionary(); | ||
if (dictionarySource == null) { | ||
if (masterDictionary != null) { | ||
// We'll accept the master dictionary if one was specified. Using the master dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you throw in a logging statement that the dictionary is being overridden? I know it'll be captured in the arguments used to call a tool, but it seems like another explicit warning would be good (particularly in cases where the specified dictionary is different from the embedded dictionary in the file / etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, done.
if (dictionarySource == null) { | ||
if (masterDictionary != null) { | ||
// We'll accept the master dictionary if one was specified. Using the master dictionary | ||
// arg will result in sequence dictionary validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where invalid dictionaries for the data are used (such as in #5087), can we detect it and throw a warning?
If the user specifies a bad sequence dictionary we should let them know that something bad could happen.
(This is essentially the same request as the above comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tool does have code to validate the new SD (in the first few lines of the apply method), and there are also tests for that. The problem in #5087 was that the indexer was getting the old (invalid) SD via the standard GATK machinery that uses the SD from the input. The change in this PR was to unify those so the new one is used everywhere.
final SAMSequenceDictionary expectedSequenceDictionary = SAMSequenceDictionaryExtractor.extractDictionary( | ||
Paths.get(new File(testDir, "exampleFASTA.dict").getAbsolutePath())); | ||
|
||
// verify only the sequence names and lengths, since other attributes such as MD/UR will have been updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Is it worth verifying that the other attributes are different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is. In real life usage, whether or not these attributes these attributes change depends on the state of the original inputs. Since this test uses a starting vcf that has no sequence dictionary at all, the MD5 and UR attributes get added on, but we just need to validate that the correct SD got propagated.
Answers inline and made a minor update based on review comments. Back to @jonn-smith. Also this branch has conflict now I'll rebase and run test again once we're ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This looks good, in the interest of merging things today I'm going to rebase this and try to merge it for you.
0561069
to
772151c
Compare
Comments all addressed and approved.
Thanks @jonn-smith and @jamesemery. |
Fixes #5087