Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Docker spec when using GCP Batch Backend #7238

Closed
patmagee opened this issue Oct 16, 2023 · 10 comments
Closed

Invalid Docker spec when using GCP Batch Backend #7238

patmagee opened this issue Oct 16, 2023 · 10 comments

Comments

@patmagee
Copy link

patmagee commented Oct 16, 2023

Bug

I am trying to run a workflow using the GCP backend, however no matter what set of configurations I use, I am unable to have it succeed. The workflows Batch task appears to fail on the 3rd task, just after the Setup Container. This is basically causing every task to fail for some strange reason

docker: invalid spec: /mnt/disks/cromwell_root:/mnt/disks/cromwell_root:: empty section between colons.

This thread suggested that the logic in these two lines may be the culprit under specific condirtions

Information

Cromwell Version: 87-c9d4ce4

Backend: GCP Batch

version 1.0

task hello {

  input {
   String name
  }
  command <<<
    echo 'hello ~{name}!'
  >>>

  output {
    File response = stdout()
  }

  runtime {
    docker: "ubuntu:latest"
    cpu: 1
    memory: "3.75 GB"
  }
}
workflow test {
  call hello

  output {
    File response = hello.response
  }
}
backend {
  default = "batch"
  providers {
    batch {
      actor-factory = "cromwell.backend.google.batch.GcpBatchBackendLifecycleActorFactory"
      config {

        # The Project To execute in
        project = "${compute_project}"

        # The bucket where outputs will be written to
        root = "gs://${bucket}"

        # Polling for completion backs-off gradually for slower-running jobs.
        # This is the maximum polling interval (in seconds):
        maximum-polling-interval = 600

        # Optional configuration to use high security network (Virtual Private Cloud) for running jobs.
        # See https://cromwell.readthedocs.io/en/stable/backends/Google/ for more details.
        # virtual-private-cloud {
        #  network-label-key = "network-key"
        #  auth = "application-default"
        # }

        # Global pipeline timeout
        # Defaults to 7 days; max 30 days
        # batch-timeout = 7 days

        genomics {
          auth = "cromwell-service-account"
          location: "${region}"
          compute-service-account = "${compute_service_account}"

          # Specifies the minimum file size for `gsutil cp` to use parallel composite uploads during delocalization.
          # Parallel composite uploads can result in a significant improvement in delocalization speed for large files
          # but may introduce complexities in downloading such files from GCS, please see
          # https://cloud.google.com/storage/docs/gsutil/commands/cp#parallel-composite-uploads for more information.
          #
          # If set to 0 parallel composite uploads are turned off. The default Cromwell configuration turns off
          # parallel composite uploads, this sample configuration turns it on for files of 150M or larger.
          parallel-composite-upload-threshold="150M"
        }

        filesystems {
          gcs {
            auth = "cromwell-service-account"

            # For billing
            project = "${billing_project}"

            caching {
              # When a cache hit is found, the following duplication strategy will be followed to use the cached outputs
              # Possible values: "copy", "reference". Defaults to "copy"
              # "copy": Copy the output files
              # "reference": DO NOT copy the output files but point to the original output files instead.
              #              Will still make sure than all the original output files exist and are accessible before
              #              going forward with the cache hit.
              duplication-strategy = "copy"
            }

          }
          http {}
        }

        # Important!! Some of the workflows take an excessive amount of time to run
        batch-timeout = 28 days

        default-runtime-attributes {
          cpu: 1
          failOnStderr: false
          continueOnReturnCode: 0
          memory: "2 GB"
          bootDiskSizeGb: 10
          # Allowed to be a String, or a list of Strings
          disks: "local-disk 10 SSD"
          noAddress: true
          preemptible: 0
          docker: "ubuntu:latest"
        }

        virtual-private-cloud {
          network-name = "${private_network}"
          subnetwork-name = "${private_subnet}"
        }
      }
    }
  }
}
@Michal-Babins
Copy link

I am running into this exact same issue:

docker: invalid spec: /mnt/disks/cromwell_root:/mnt/disks/cromwell_root:: empty section between colons.

With a very similar backend. I was able to run this workflow last week when I was testing, and have just noticed this behavior this week. Not sure what GCP side change may have caused this.

@dspeck1
Copy link
Collaborator

dspeck1 commented Oct 19, 2023

Fix is in process. Change in how GCP returned mount.

@jgainerdewar
Copy link
Collaborator

Fixed in #7240

@patmagee
Copy link
Author

hey @dspeck1 or @jgainerdewar just wondering if there is a plan for a patch release to fix the current cromwell release?

@jgainerdewar
Copy link
Collaborator

We don't typically patch releases, but we do make the latest version from develop available in a Docker image. Folks who want the latest changes between major releases are advised to use these development versions, ex. 87-225ea5a, which are named with the next major version and short hash of the merge commit.

https://hub.docker.com/r/broadinstitute/cromwell/tags

@kopalgarg24
Copy link

kopalgarg24 commented Feb 21, 2024

thanks for the docker image! just wondering, when should we expect the next cromwell release?

@aednichols
Copy link
Collaborator

We recommend broadinstitute/cromwell:latest which updates regularly.

@patmagee
Copy link
Author

@aednichols understood, however I would really recommend at least a patch release. Building downstream reliance on the latest docker image for any software (especially when latest appears to represent a SNAPSHOT version and not a release) is a recipe for breaking your system that allows unanticipated changes to be applied

@aednichols
Copy link
Collaborator

I hear you @patmagee. Cromwell has had a SaaS continuous development model for a few years now, with new code going to Terra daily.

We learned that most standalone Cromwell users at the Broad upgrade infrequently, such as every 6-12 months. Thus, we committed put in the effort for two "shrink-wrapped" releases per year so we can balance SaaS with standalone.

@aednichols
Copy link
Collaborator

Also, the shrink-wrapped releases never had any additional testing done compared to the daily snapshots; the daily system works for us because we have a TON of tests on every PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants