Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding Cloud Pipelines #90

Merged

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Nov 5, 2019

do not merge this is a draft pull request!

This PR will encompass early work to add Google Cloud Pipelines to Snakemake. Specifically this means that:

  • the user selects --google-life-sciences provider (boolean) when running snakemake
  • GOOGLE_APPLICATION_CREDENTIALS and GOOGLE_PROJECT should be exported
  • a content hash named tarball of the working directory is uploaded to Google Storage
  • the worker uses a custom image to download and extract the working directory, and then run the script for the step in question.
  • the storage tarballs are deleted after the workflow completes, only if the user hasn't specified a flag to keep the cache.

We have several technical questions that still need answering, which I'll summarize here:

  • upload of environment variables vs. secrets
  • another method to discover application credentials and project (without exporting the envar)
  • how to provide an entrypoint + command with && (we've tried almost all variations of this and the second command seems to run outside of the container, or the entire thing is parsed together and the result is command not found.
  • privileged containers OR custom addition of permissions (to allow running Singularity)
  • difference between action and resource environment
  • it would be cool if Google provided a function to choose a machine type based on these criteria, seems like it would be a common thing to need!

A few notes for some choices I made for the implementation:

  • As discussed in our meeting, I was able to use the Google Discovery Build function against the manifest endpoint to create a client.
  • regions is another variable that could be derived from some call to the API (to get a default region) but most likely the requester for storage objects would want to choose a region that is closest to the objects, which might not be where he or she is. The default regions are all zones in the US, which I chose based on guessing that the majority of Snakemake / GCP users are in the US. This might be entirely wrong, or a bad default.
  • the machine type is selected based on comparing the requested memory and cores to those available, and first eliminating the ones that are too small, and then choosing the smallest of those remaining. If the user provides a machine type prefix, we further filter down the choices to that.
  • Google Storage (for public objects) often has a 503 error (yesterday it popped up several times for me in testing) and if this happens again it might be a separate issue that the GS remote should look into handling - for most Google Cloud API stuffs retrying is essential.
  • speaking of retrying - the functions here are run by a client function I wrote to retry three times, and only then error. This is because BrokenPipeError (and other random failures) are common enough to warrant it.

I suspect this will fail linting and I'll push again with fixes :)

Signed-off-by: Vanessa Sochat <[email protected]>
…eractive code bits to figure out where the entrypoint for an executor run is

Signed-off-by: Vanessa Sochat <[email protected]>
Signed-off-by: Vanessa Sochat <[email protected]>
…alk through and test for basic example

Signed-off-by: Vanessa Sochat <[email protected]>
…dir to storage cache, and then the instance can retrieve it.

Signed-off-by: Vanessa Sochat <[email protected]>
…rom pip and install from this branch

Signed-off-by: Vanessa Sochat <[email protected]>
Signed-off-by: Vanessa Sochat <[email protected]>
…achine type prefix, ready for draft PR

Signed-off-by: Vanessa Sochat <[email protected]>
@vsoch
Copy link
Contributor Author

vsoch commented Nov 5, 2019

@johanneskoester this is a little confusing with black, because I installed the exact version that is run in the test suites, and there seems to be a difference in result. Here is what I'm doing locally:

$ black --check snakemake tests/*.py
All done! ✨ 🍰 ✨
51 files would be left unchanged.
$ black --version
black, version 19.3b0

Note that the version is the same installed there. I also tried installing with conda in case it made a difference (it didn't). What am I missing?

…oogle cloud project on command line (derived from storage client or optional environment to override that)

Signed-off-by: Vanessa Sochat <[email protected]>
@vsoch
Copy link
Contributor Author

vsoch commented Nov 5, 2019

That's so weird! The black formatting passed this time. I don't think I made any changes to that file. Must be ghosties.

@johanneskoester
Copy link
Contributor

@vsoch did you select a draft PR? I still have a merge button here and the icon is green instead of grey in the list.

@vsoch
Copy link
Contributor Author

vsoch commented Nov 5, 2019

oops I just totally forgot :/ Merging is still blocked so is it okay? I'll add WIP to the heading so others know.

@vsoch vsoch changed the title Adding Cloud Pipelines [WIP] Adding Cloud Pipelines Nov 5, 2019
@johanneskoester
Copy link
Contributor

Yeah, no problem ;-).

@johanneskoester
Copy link
Contributor

xref #91

@vsoch
Copy link
Contributor Author

vsoch commented Apr 30, 2020

@johanneskoester I'll want to ask for your suggestions for how to reduce "cognitive complexity" of functions - I made them modular at a level I thought made sense, and most of the length for the ones flagged is due to really long WorkflowError messages. I'm happy to refactor in any way you think is better.

@vsoch
Copy link
Contributor Author

vsoch commented May 8, 2020

Not sure what is going on with the build container image step, but it's been going for over 2 hours.

image

@johanneskoester
Copy link
Contributor

That issue is resolved now.

@johanneskoester
Copy link
Contributor

@johanneskoester I'll want to ask for your suggestions for how to reduce "cognitive complexity" of functions - I made them modular at a level I thought made sense, and most of the length for the ones flagged is due to really long WorkflowError messages. I'm happy to refactor in any way you think is better.

I have looked at them again, and I think they are indeed ok to remain like this. Sorry for the noise.

@johanneskoester
Copy link
Contributor

GCloud credentials are now in place. I think you should be able to activate your test case here. In line with the kubernetes and tibanna tests, please move your test case in a separate file tests/test_google_life_sciences.py which shall be called in a separate job step.

@vsoch
Copy link
Contributor Author

vsoch commented May 8, 2020

@johanneskoester what is the name of the envar secret - did you just call it GOOGLE_APPLICATION_CREDENTIALS? I'm adding the step in main.yml now, should be able to push to test as soon as I know that name!

@vsoch
Copy link
Contributor Author

vsoch commented May 8, 2020

E.g., this is what I'd guess, let me know if I'm off

    env:
      AWS_AVAILABLE: ${{ secrets.AWS_ACCESS_KEY_ID }}
      GCP_AVAILABLE: ${{ secrets.GCP_SA_KEY }}
      GOOGLE_APPLICATION_CREDENTIALS: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}

@vsoch
Copy link
Contributor Author

vsoch commented May 8, 2020

Dearest sonarcloud - you've had a change of heart! Because before you said my code was smelly. Maybe it's good smelly now :)

.github/workflows/main.yml Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
vsoch and others added 2 commits May 10, 2020 14:58
Reported to already be set without needing specified here.

Co-authored-by: Johannes Köster <[email protected]>
@sonarcloud
Copy link

sonarcloud bot commented May 10, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 2 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.1% 0.1% Duplication

@vsoch
Copy link
Contributor Author

vsoch commented May 11, 2020

@johanneskoester the test did not run still - I can't see settings/secrets so I'll need your guidance about what needs to be checked.

@johanneskoester johanneskoester changed the base branch from master to google-life-sciences May 11, 2020 08:19
@johanneskoester
Copy link
Contributor

You are on a fork. github does not expose secrets to forks.

@johanneskoester johanneskoester merged commit 8fc06be into snakemake:google-life-sciences May 11, 2020
@johanneskoester
Copy link
Contributor

Let us go on in PR #384

johanneskoester added a commit that referenced this pull request May 11, 2020
* starting figuring out how to not remove wrappers

Signed-off-by: Vanessa Sochat <[email protected]>

* early work to add gcloud executor

Signed-off-by: Vanessa Sochat <[email protected]>

* getting basic bucket creation and test upload working, and adding interactive code bits to figure out where the entrypoint for an executor run is

Signed-off-by: Vanessa Sochat <[email protected]>

* updating to show debugging

Signed-off-by: Vanessa Sochat <[email protected]>

* moving change of working directory to after parsing config files

Signed-off-by: Vanessa Sochat <[email protected]>

* removing old cleanup wrappers logic

Signed-off-by: Vanessa Sochat <[email protected]>

* removing old cleanup wrappers logic

Signed-off-by: Vanessa Sochat <[email protected]>

* work on base skeleton - pipeline requests are created and I need to walk through and test for basic example

Signed-off-by: Vanessa Sochat <[email protected]>

* lots of work on exector for Google Life Sciences - we upload the workdir to storage cache, and then the instance can retrieve it.

Signed-off-by: Vanessa Sochat <[email protected]>

* release 5.7.4 has a bug looking for a workflow to have a workflow attribute

Signed-off-by: Vanessa Sochat <[email protected]>

* more bug fixes to google science pipeline, also want to use version from pip and install from this branch

Signed-off-by: Vanessa Sochat <[email protected]>

* adding docker base example to google cloud tests

Signed-off-by: Vanessa Sochat <[email protected]>

* omg successful run! ✨

Signed-off-by: Vanessa Sochat <[email protected]>

* restoring to original (working) execution with custom image, adding machine type prefix, ready for draft PR

Signed-off-by: Vanessa Sochat <[email protected]>

* fixing formatting errors with black

Signed-off-by: Vanessa Sochat <[email protected]>

* using same version of black as the testing

Signed-off-by: Vanessa Sochat <[email protected]>

* testing again after reformatting, removing need for user to specify google cloud project on command line (derived from storage client or optional environment to override that)

Signed-off-by: Vanessa Sochat <[email protected]>

* adding warning about environment not being secret for google life sciences

Signed-off-by: Vanessa Sochat <[email protected]>

* updating Google Life Sciences executor to programatically select machineType and add verbose comments about envars not being secrets and selection of machinetypes

Signed-off-by: Vanessa Sochat <[email protected]>

* Update snakemake/executors.py

Co-Authored-By: Johannes Köster <[email protected]>

* changing default arguments for regions to be included in arg parser

Signed-off-by: Vanessa Sochat <[email protected]>

* adding hashlib function to snakemake/common.py, tweaks to default arguments and function names

Signed-off-by: Vanessa Sochat <[email protected]>

* renaming google_life_sciences to google_lifesciences and using job resources disk_mb with padding of 10gb

Signed-off-by: Vanessa Sochat <[email protected]>

* custom entrypoint / cmd is working for snakemake base!

Signed-off-by: Vanessa Sochat <[email protected]>

* imports went away?

Signed-off-by: Vanessa Sochat <[email protected]>

* updating common and executors to fix some sonarcube linting issues

Signed-off-by: Vanessa Sochat <[email protected]>

* trivial use of exec_job to fix SonarCloud "bug"

Signed-off-by: Vanessa Sochat <[email protected]>

* updating machine type selection to take first in list before filtering to smallest

Signed-off-by: Vanessa Sochat <[email protected]>

* why was stylesheet changed?

Signed-off-by: Vanessa Sochat <[email protected]>

* why was stylesheet changed also in utils?

Signed-off-by: Vanessa Sochat <[email protected]>

* updating life science executor to only include source files and working directory .snakemake scripts, also accidentally black formatted python scripts in tests, should not be an issue

Signed-off-by: Vanessa Sochat <[email protected]>

* not sure why WorkflowError isnt defined for a file I didnt edit, how did it pass other CI?

Signed-off-by: Vanessa Sochat <[email protected]>

* fixing bug with merge with master - an extra few lines were kept with a function (and should not have been)

Signed-off-by: Vanessa Sochat <[email protected]>

* refactoring to not require additional scripts (passing exec_job directly to entrypoint) and then uploading only one archive

Signed-off-by: Vanessa Sochat <[email protected]>

* e2 prefix machines dont seem to work for google life sciences api

Signed-off-by: Vanessa Sochat <[email protected]>

* gcp life sciences cannot currently support m1 or e2 instance types, adding m1 to be filtered out for machine selection

Signed-off-by: Vanessa Sochat <[email protected]>

* import of time should then use time.sleep

Signed-off-by: Vanessa Sochat <[email protected]>

* bug that archive files silently not being added, and adding more robust function to derive relative snakefile path

Signed-off-by: Vanessa Sochat <[email protected]>

* bump current container image to v5.10.0 and add debug statements to show

Signed-off-by: Vanessa Sochat <[email protected]>

* updating Google Life Sciences to use get_container_image for latest container version

Signed-off-by: vsoch <[email protected]>

* adding default-resources setting for google-lifesciences and better message when resources exceeded.

Signed-off-by: vsoch <[email protected]>

* be more specific about resource limits for LHS

Signed-off-by: vsoch <[email protected]>

* locations api now needs to be used since there is more than one location, * no longer works

Signed-off-by: vsoch <[email protected]>

* adding parameter for location, and default to region or prefix of region

Signed-off-by: vsoch <[email protected]>

* google import should not be at top of file, wont work locally

Signed-off-by: vsoch <[email protected]>

* import google for wrong function

Signed-off-by: vsoch <[email protected]>

* must use lowercase and not camel case for variables

Signed-off-by: vsoch <[email protected]>

* if resources are requested, update default resources

Signed-off-by: vsoch <[email protected]>

* updating event to use debug instead of info

Signed-off-by: vsoch <[email protected]>

* updating stderr event to use debug instead of error

Signed-off-by: vsoch <[email protected]>

* missing adding argument to --skip-script-cleanup to base job, need need to test

Signed-off-by: vsoch <[email protected]>

* adding cleaner implementataion for --skip-script-cleanup to be shared amongst executors

Signed-off-by: vsoch <[email protected]>

* google_lifesciences needs to be treated as a non local exec

Signed-off-by: vsoch <[email protected]>

* adding install of crc32c library for remote

Signed-off-by: vsoch <[email protected]>

* adding support for google-lifesciences gpu

To add support for GPU, the user is limited to n1- general machine
types, and we do this by way of setting the self._machine_type_prefix
to be n1 (if it does not already start with n1). We can then choose
an accelerator type that is greater than or equal to what the user has
requested, and then choose from that set. With the current implementation
the P100 type will be the first chosen, and this makes sense as (I think)
it is the cheapest GPU option, something likely wanted by researchers
running these at scale. In the future (when someone asks for it) we can
add some flag that will allow the user to further specify a GPU accelerator
type, but for now this seems reasonable. Once the accelerator is added
to the VM, we can confirm that it creates an instance with GPU support

Signed-off-by: vsoch <[email protected]>

* snakefile in tests/common.py should switch between tempdir and original

currently, some tests use the snakefile in the temporary working
directory, and others user the original. There is a flag, no_tmpdir,
that can distinguish this for the working directory specification,
and the same needs to be done for the snakefile, as different
tests have different expectations for its location.

Signed-off-by: vsoch <[email protected]>

* fixing test to not expect file to be in same directory

Signed-off-by: vsoch <[email protected]>

* Google Life Sciences executor must use relative paths for build package

Currently, if the user puts a path for a snakefile that is not relative to
the working directory where the build package is generated, the replace function
(that replaces the workdir in each file path) will not work and the build package
will be generated with a hard coded path on the user system. Since our options are
to try and muck around with moving files, or hackily changing paths for the included
files (not ideal) a much cleaner solution is to do similar to what a docker build does,
and require context to be at or for some subfolder level of the working directory.
Most runs that Ive seen will be for a Snakefile in a GitHub repo, meaning the working
directory includes the snakefile, so I think this will be a rare use case. I still
want to catch it, in case it does occur!

Signed-off-by: vsoch <[email protected]>

* remove unneeded line

Signed-off-by: vsoch <[email protected]>

* adding machine and gpu_model job resources

this will add two optional job resources to the Google Life Sciences executor,
including a machine (to mirror --machine-type-prefix) and a gpu_model (for
the user to select from the cloud or vendor specific namespace of gpu models,
e.g., nvidia-tesla-p4. If the user specifies a gpu_model without gpu or nvidia_gpu,
the count will default to 1. The --machine-type-prefix command line arg also
overrides the job resource, since this could be desired. The docs are also
updated with these changes.

Signed-off-by: vsoch <[email protected]>

* typo model -> gpu_model

Signed-off-by: vsoch <[email protected]>

* string resources should be in quotes

Per #329, a string
defined resource like machine or gpu_model should include quotes
around the name

Signed-off-by: vsoch <[email protected]>

* gcloud beta life sciences client change in behavior

we used to be able to do gcloud beta lifesciences describe <id>
but now a full project string is required for this to work.

Signed-off-by: vsoch <[email protected]>

* updating GoogleLifeSciences Executor to look for machine_type instead of machine definition in the job resources

Signed-off-by: vsoch <[email protected]>

* initial changes after review from Johannes

Signed-off-by: vsoch <[email protected]>

* fixing reasonable code smells, the cognitive complexity ones I need suggestions for

Signed-off-by: vsoch <[email protected]>

* Adding persistence.aux_path to be used for workflow tar.gz

Co-authored-by: Johannes Köster <[email protected]>

* adding tests for google life sciences, taking shot that I know the correct envar name

Signed-off-by: vsoch <[email protected]>

* Removing secrets.GOOGLE_APPLICATION_CREDENTIALS

Reported to already be set without needing specified here.

Co-authored-by: Johannes Köster <[email protected]>

* changing env.GOOGLE_ to just use GCP_AVAILABLE.

Co-authored-by: Johannes Köster <[email protected]>

Co-authored-by: Johannes Köster <[email protected]>

Co-authored-by: Vanessasaurus <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants