-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure consistent output directory requirements in CNV tools. #4825
Comments
@sooheelee did you encounter this when running at the command line in a python environment, or when using the Docker image? And did the /home/shlee/gcc/hc24_soohee1k_chr1-model/ directory exist prior to running? |
@samuelklee, in the command line with
So definitely not a Docker run. |
Hello, i've had a similar issue:
I created the 190225.181217_K00178.CNVCaller folder beforehand because otherwise I'd get this error:
I'm running these command line in the conda environment explained by the gatk tutorial. Any insight? |
The first issue looks like an out of memory error. You may need to scatter your intervals into separate shards, as is done in the WDLs: https://github.com/broadinstitute/gatk/tree/master/scripts/cnv_wdl/germline The second issue regarding the output directory creation is by design---CNV tools require the output directory to exist beforehand. |
See discussion at https://gatkforums.broadinstitute.org/gatk/discussion/11711/germlinecnvcaller-interval-merging-rule-error#latest regarding output directory creation. |
@sooheelee do you remember what version you were using when you encountered this? As far as I can tell, we should fail early in the most recent version (as you can see from @Yu-jinKim's post). EDIT: Never mind, actually the fail-early check is only for existence and not for write permission. |
Changed the name of the issue. As discussed in the above forum thread, behavior is already consistent across CNV tools---except for DetermineGermlineContigPloidy, where the check is missing, so this issue is really just a reminder to add that (along with checks for write permission). We don't really have conventions for this sort of thing GATK-wide, though. |
@samuelklee Is this type of thing something you'd like for us to start documenting, e.g. in the repo wiki? There is some need for this type of document if we expect more external contributions going forward. We can ask @droazen to weigh in. |
We had some discussion about conventions in the hellbender Slack. Different tools have different requirements or expectations, but it would be good to converge on and document a couple of patterns. |
If you don't mind the list being public, the github wiki, e.g. like this existing article https://github.com/broadinstitute/gatk/wiki/GATK4-Documentation-Guidelines, seems like a good place to start a list. |
I made you a page to start collecting such reminders at https://github.com/broadinstitute/gatk/wiki/Checks-and-tests-guidelines. |
Excellent, thanks! |
Just finished switching over all of the CNV tools to fail early if directories are not writeable---or do not exist and cannot be created---only to realize that this behavior is inconsistent with that of Picard IntervalListTools (which is used in the gCNV pipeline). That tool fails early if the output directory is not writeable or does not exist, and although there is a code path later that suggests that output directories should be created, it is not reached due to this early fail. It might be that this inconsistency was introduced in broadinstitute/picard#1208 and I did not catch it in my PR review. @yfarjoun any opinions what the intended behavior should be? Are there any conventions for Picard tools in general? Perhaps we could enforce this at the engine level (maybe checks that are triggered by annotations such as suggested in #141, if possible)? But this would only work for GATK tools and would still rely on the diligence of developers. In any case, I'll decide on and document a convention for the CNV tools, but I think it might be a quixotic dream to enforce consistent behavior---especially without breaking things downstream which may rely on existing, inconsistent behavior... |
Picard tries to adhere to the following (but it's up to each tool to figure
out):
1. Check input and output files for readability/writability as soon as
possible.
2. Delete incomplete outputs in case of caught exception.
3. Return non-zero value in case of error.
…On Sun, Mar 3, 2019 at 4:45 PM samuelklee ***@***.***> wrote:
Just finished switching over all of the CNV tools to fail early if
directories are not writeable---or do not exist and cannot be
created---only to realize that this behavior is inconsistent with that of
Picard IntervalListTools (which is used in the gCNV pipeline).
That tool fails early if the output directory is not writeable or does not
exist, and although there is a code path later that suggests that output
directories should be created, it is not reached due to this early fail. It
might be that this inconsistency was introduced in
broadinstitute/picard#1208
<broadinstitute/picard#1208> and I did not catch
it in my PR. @yfarjoun <https://github.com/yfarjoun> any opinions what
the intended behavior should be? Are there any conventions for Picard tools
in general?
Perhaps we could enforce this at the engine level (maybe checks that are
triggered by annotations such as suggested in #141
<#141>, if possible)? But
this would only work for GATK tools and would still rely on the diligence
of developers.
In any case, I'll decide on and document a convention for the CNV tools,
but I think it might be a quixotic dream to enforce consistent
behavior---especially without breaking things downstream which may rely on
existing, inconsistent behavior...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4825 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0knrip3kZgweCbYoBj3fwGvlDdPEks5vTEKEgaJpZM4USOnb>
.
|
Thanks, @yfarjoun, I think those are reasonable. Just to be clear, the code for the tool mentioned above is a little confusing, in that an early fail for writability when the directory does not exist prevents us from reaching code that appears to be intended to create the directory. Not a big deal in the end (and I checked that this was also the case before the PR). But minor things like this can easily break downstream scripts, etc., as was demonstrated above, so we should take some care. I agree that it's fine to leave some decisions up to each tool, but we should try to document them for the benefit of users and future devs that might need to maintain the behavior of the tool. |
* Cleaned up intermediate files in gCNV WDL and fixed miscellaneous typos. (#5382) * Added output of MAD values as floats in somatic CNV WDL. (#5591) * Exposed boot disk space for Oncotator in somatic CNV WDL. (#3566) * Added check to skip outlier truncation if number of matrix elements exceeds Integer.MAX_VALUE in CreateReadCountPanelOfNormals. (#4734) * Miscellaneous boy scout activities. * Fixed some issues concerning intervals in DetermineGermlineContigPloidy documentation. * Fixed non-kebab-case argument in CollectAllelicCountsSpark and other minor issues. * Improved consistency of style and input/output validation across CNV tools. (#4825)
The exception is:
Posting this as per @samuelklee's request.
The text was updated successfully, but these errors were encountered: