-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gCNV WDLs should clean up intermediate files and directories. #5382
Comments
Yes, also recall that @asmirnov239 tried to address #4397, but that led to #5217, so we reverted. We can try to address all of these issues again correctly if it's low-hanging fruit (which it probably is) and if it'll bring the overall cost of the pipeline down significantly. However, for the most part, I think bringing down costs in the gCNV step will have more impact. Thanks for diagnosing and pointing out these issues. You should feel free to open PRs against the gCNV code as well! |
@vruano As pointed out by @sooheelee, the current WDL does not clean up the intermediate CALLS_* and MODEL_* directories. This is fine for running on the cloud, but we should clean them up when running locally. Can you take care of this as well? |
Dupe of #4397, but changing the name to reflect the issue mentioned in the previous comment. |
CALLS_* and MODEL_* directories are actually cleaned up in #5414, but there are a few places where contig-ploidy calls are not cleaned up. We could also clean up the out directories generated by DetermineGermlineContigPloidy and GermlineCNVCaller, since the contents of these are sliced and tarred, but it's arguably nice to have all of the output for each shard in a single directory. |
* Cleaned up intermediate files in gCNV WDL and fixed miscellaneous typos. (#5382) * Added output of MAD values as floats in somatic CNV WDL. (#5591) * Exposed boot disk space for Oncotator in somatic CNV WDL. (#3566) * Added check to skip outlier truncation if number of matrix elements exceeds Integer.MAX_VALUE in CreateReadCountPanelOfNormals. (#4734) * Miscellaneous boy scout activities. * Fixed some issues concerning intervals in DetermineGermlineContigPloidy documentation. * Fixed non-kebab-case argument in CollectAllelicCountsSpark and other minor issues. * Improved consistency of style and input/output validation across CNV tools. (#4825)
Due to the way the calling process is sharded upstream ... the current wdl expends 80-90% of the time copying files over ... so a job that takes around 1h only 10 minutes are expend in running the GATK Tool, the result is used to stage the input files.
For example if the genome intervals where shared into 600 ~ different batches this results in 3600 ~ files transferred one by one using their own gsutil command. The reason why these are not batched is in part because files share name and multi-file gsutil cp does not provide the means to indicate indepedendent destination names for each input file. Recursive copy of a parent directory would drag the information from all the samples when the each task just deals with one.
The text was updated successfully, but these errors were encountered: