Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardized use of mem parameters in CNV WDLs. #4193

Merged
merged 3 commits into from
Jan 22, 2018

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Jan 17, 2018

We use the machine_mem/command_mem framework for most tasks; others are unlikely to need any special memory considerations.

There were some tasks for which machine_mem and command_mem seemed to be switched in the original WDL. @jsotobroad was there any reason for this? I'm guessing they were just typos. I'm also not sure that some of the tasks would've actually run even if they were switched, since they would've resulted in non-integer -Xmx arguments. I changed everything over to MB to avoid this.

Closes #4092.

@samuelklee
Copy link
Contributor Author

@davidbenjamin mind reviewing? @LeeTL1220 take note for the style guide, if necessary.

@samuelklee samuelklee force-pushed the sl_standardize_mem_wdl branch from 785f93d to f9b2701 Compare January 17, 2018 23:23
@codecov-io
Copy link

codecov-io commented Jan 18, 2018

Codecov Report

Merging #4193 into master will increase coverage by 0.013%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #4193       +/-   ##
===============================================
+ Coverage     78.475%   78.489%   +0.013%     
  Complexity     16645     16645               
===============================================
  Files           1061      1061               
  Lines          59866     59866               
  Branches        9756      9756               
===============================================
+ Hits           46980     46988        +8     
+ Misses          9103      9097        -6     
+ Partials        3783      3781        -2
Impacted Files Coverage Δ Complexity Δ
...park/sv/discovery/alignment/AlignmentInterval.java 90.038% <0%> (+0.383%) 74% <0%> (ø) ⬇️
...nder/utils/runtime/StreamingProcessController.java 71.193% <0%> (+0.823%) 50% <0%> (ø) ⬇️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 80% <0%> (+1.29%) 39% <0%> (ø) ⬇️
...e/hellbender/engine/spark/SparkContextFactory.java 73.973% <0%> (+2.74%) 11% <0%> (ø) ⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java 90% <0%> (+10%) 3% <0%> (ø) ⬇️

Copy link
Contributor

@jsotobroad jsotobroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay standardization

@@ -92,6 +92,9 @@ task CollectCounts {
Int? preemptible_attempts
Int? disk_space_gb

Int machine_mem = select_first([mem, 8]) * 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as small optimizations go, if these tasks that are asking for 8000 MB by default can make due with 7500 MB then you can save 25% on GCP compute - https://cloud.google.com/compute/pricing#machinetype - n1-standard-2 vsn1-highmem-2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good to know! No particular reason these need Xmx8G, so I'll change them.

@@ -137,8 +140,7 @@ task CollectAllelicCounts {
Int? preemptible_attempts
Int? disk_space_gb

# Mem is in units of GB but our command and memory runtime values are in MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere it should made clear that the user provided mem input should be defined in units of GB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any objection to simply using *_mem_gb and *_mem_mb everywhere? I'd rather not duplicate this comment everywhere.
@LeeTL1220 @davidbenjamin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee Fine by me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any objection to simply using *_mem_gb and *_mem_mb everywhere?

That is sufficiently self-documenting for my tastes.

@@ -110,7 +113,7 @@ task CollectCounts {

runtime {
docker: "${gatk_docker}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per @LeeTL1220's template, we should be putting in values for cpu even in the single threaded case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check to make sure all of the tasks adhered to the template just yet, since I'm assuming that it might still change. (Actually, I don't think any of the somatic CNV tasks specify cpu.) I'll focus on the mem changes in this PR and overhaul the rest later (perhaps once tasks are automatically generated), if you don't mind!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee Fine by me.

@samuelklee samuelklee force-pushed the sl_standardize_mem_wdl branch from f9b2701 to 0bb9c95 Compare January 18, 2018 15:21
@samuelklee
Copy link
Contributor Author

Will merge when tests pass, unless there are further objections.

@LeeTL1220
Copy link
Contributor

@samuelklee From my perspective, merge away. Assuming that you are doing the cpu change later.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency is the hobgoblin of my review.

String gatk_docker
Int? preemptible_attempts
Int? disk_space_gb

Int machine_mem_mb = select_first([mem_gb * 1000, 7500])
Int command_mem_mb = machine_mem_mb - 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you do this in the tasks above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Actually, I think this makes the tests fail---you can't perform the multiply operation inside select_first, apparently.

@@ -69,7 +69,7 @@ task AnnotateIntervals {

runtime {
docker: "${gatk_docker}"
memory: select_first([mem, 5]) + " GB"
memory: select_first([mem_gb, 5]) + " GB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange because if mem_gb is supplied the machine memory and command memory are the same, but if it's not supplied they use different defaults.

Copy link
Contributor Author

@samuelklee samuelklee Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good catch. I think for some of these more minor tasks we didn't do the whole machine_mem/command_mem thing, as noted above. I'll just go back and standardize everything.

@@ -21,7 +21,7 @@ task PreprocessIntervals {
set -e
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk4_jar_override}

gatk --java-options "-Xmx${default="2" mem}g" PreprocessIntervals \
gatk --java-options "-Xmx${default="2" mem_gb}g" PreprocessIntervals \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel great about using Xmx_g in some places and Xmx_m in others.

@@ -137,8 +140,7 @@ task CollectAllelicCounts {
Int? preemptible_attempts
Int? disk_space_gb

# Mem is in units of GB but our command and memory runtime values are in MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any objection to simply using *_mem_gb and *_mem_mb everywhere?

That is sufficiently self-documenting for my tastes.

Int machine_mem = if defined(mem) then select_first([mem]) else 8
Float command_mem = machine_mem - 0.5
Int machine_mem_mb = select_first([mem_gb, 7]) * 1000
Int command_mem_mb = machine_mem_mb - 500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere you have a 1GB difference.

# ModelSegments seems to need at least 3GB of overhead to run
Int command_mem = machine_mem - 3000
Int command_mem_mb = machine_mem_mb - 3000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense to you? I mean, shouldn't the difference between command and machine memory just be the OS, hence independent of the GATK command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, this was from @jsotobroad. I'd be fine with standardizing the difference everywhere if that works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsotobroad can answer this, @davidbenjamin .

@samuelklee
Copy link
Contributor Author

Responded to @davidbenjamin. Still have differences for machine_mem_mb - command_mem_mb = 500, 1000, and 3000---@jsotobroad any reason for this? Can we make everything the same?

@LeeTL1220
Copy link
Contributor

I'm good with this if @jsotobroad is good too.

@davidbenjamin
Copy link
Contributor

Me too.

@davidbenjamin davidbenjamin removed their assignment Jan 18, 2018
@samuelklee
Copy link
Contributor Author

@jsotobroad I'm going to go ahead and merge this and file an issue for the difference in memory overhead.

@samuelklee samuelklee merged commit 32f25b9 into master Jan 22, 2018
@samuelklee samuelklee deleted the sl_standardize_mem_wdl branch January 22, 2018 16:10
jonn-smith added a commit that referenced this pull request Jun 19, 2018
`XsvTableFeature` no longer removes an extra column if start and end in
the config file for a `LocatableXsv` data source are the same.
jonn-smith added a commit that referenced this pull request Jun 19, 2018
`XsvTableFeature` no longer removes an extra column if start and end in
the config file for a `LocatableXsv` data source are the same.
jonn-smith added a commit that referenced this pull request Jun 19, 2018
…ons (#4915)

`XsvTableFeature` no longer removes an extra column if start and end in
the config file for a `LocatableXsv` data source are the same.

Fixes #4193
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants