Skip to content

PMCC-BioinformaticsCore/janis-assistant

Repository files navigation

Welcome to Janis-Assistant

Build Status PyPI version

Janis is a workflow assistant designed to make the process of building and running workflows easier.

More specifically:

  • Janis core is a framework for specifying workflows, that can be transpiled to CWL and WDL.
  • Janis assistant runs manages an engine to these workflows and collects the results.

Quick start

pip3 install janis-pipelines

CWLTool

You can run a workflow in CWLTool with the following command line:

janis run --engine cwltool hello

To use CWLTool, you must have CWLTool in your path with either Docker or Node, see Engine support for more information.

Cromwell

Cromwell is the default engine, and can be ran with:

janis run --engine cromwell hello

To use Cromwell, you must have Java 1.8 available. See Engine support for more information.

CLI options:

  • run - Run a janis workflow (see the run parameters below)
  • watch - Watch an existing execution (folder or workflow ID)
  • abort - Issue an abort request to an existing execution
  • inputs - Generate an inputs file for a workflow
  • translate - Translate a workflow into CWL / WDL
  • metadata - Get the available metadata on an execution
  • version - Print the version of janis submodules.
  • spider - Print documentation for a tool (allows to trace problems with the Janistoolbox)

run

You can run a workflow with the run method, here's an example to run the hello world example:

janis run hello

View the help guide

# $ janis run -h

positional arguments:
  workflow              Run the workflow defined in this file or available
                        within the toolbox
  extra_inputs

optional arguments:
  -h, --help            show this help message and exit
  -i INPUTS, --inputs INPUTS
                        YAML or JSON inputs file to provide values for the
                        workflow (can specify multiple times)
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        This directory to copy outputs to. By default
                        intermediate results are within a janis/execution
                        subfolder (unless overriden by a template)
  -B, --background      Run the workflow engine in the background (or submit
                        to a cluster if your template supports it)
  --progress            Show the progress screen if running in the background
  --keep-intermediate-files
                        Do not remove execution directory on successful
                        complete
  --skip-file-check     Skip checking if files exist before the start of a
                        workflow.
  --allow-empty-container
                        Some tools you use may not include a container, this
                        would usually (and intentionally) cause an error.
                        Including this flag will disable this check, and empty
                        containers can be used.
  --development         Apply common settings (--keep-execution-dir + --mysql)
                        to support incremental development of a pipeline

input manipulation:
  -r RECIPE, --recipe RECIPE
                        Use a provided recipe from a provided template
  --max-cores MAX_CORES
                        maximum number of cores to use when generating
                        resource overrides
  --max-memory MAX_MEMORY
                        maximum GB of memory to use when generating resource
                        overrides

hints:
  --hint-captureType {targeted,exome,chromosome,30x,90x,300x}
  --hint-engine {cromwell}

workflow collection arguments:
  --toolbox            Skip looking through the search path, and only look in
                        the toolbox
  -n NAME, --name NAME  If you have multiple workflows in your file, you may
                        want to help Janis out to select the right workflow to
                        run
  --no-cache            Force re-download of workflow if remote

engine arguments:
  --engine {cwltool,cromwell}
                        Choose an engine to start
  --cromwell-url CROMWELL_URL
                        Location to Cromwell

filescheme arguments:
  -f {local,ssh}, --filescheme {local,ssh}
                        Choose the filescheme required to retrieve the output
                        files where your engine is located. By selecting SSH,
                        Janis will SCP the files using the --filescheme-ssh-
                        binding SSH shortcut.
  --filescheme-ssh-binding FILESCHEME_SSH_BINDING
                        Only valid if you've selected the ssh filescheme. (eg:
                        scp cluster:/path/to/output local/output/dir)

validation arguments:
  --validation-reference VALIDATION_REFERENCE
                        reference file for validation
  --validation-truth-vcf VALIDATION_TRUTH_VCF
                        truthVCF for validation
  --validation-intervals VALIDATION_INTERVALS
                        intervals to validate between
  --validation-fields VALIDATION_FIELDS [VALIDATION_FIELDS ...]
                        outputs from the workflow to validate

beta features:
  --mysql               BETA: Run MySQL for persistence with Cromwell

Configuration

It's possible to configure a number of attributes of janis_assistant. You can provide a YAML configuration file in two ways:

  • CLI: --config /path/to/config.yml
  • Environment variable JANIS_CONFIGPATH=/path/to/config.yml
  • Default: $(HOME)/.janis/janis.conf - will additionally look for a config here.

Configurations aren't currently cascaded, but the intention is they will.

Options

Defaults: janis_assistant/management/configuration.py

  • Config / DB directory: configDir: /path/to/configir/

    • Second priority to environment variable: JANIS_CONFIGDIR
    • Default: (HOME)/.janis/
    • Database: {configDir}/janis.db - Janis global database
  • Execution directory: executionDir

    • Second priority to environment variable: JANIS_EXCECUTIONDIR
    • Default: (HOME)/janis/execution/
  • Search paths: searchPaths

    • Will additionally add from environment variable: JANIS_SEARCHPATH
    • Default: (HOME)/janis/

Engines

There are currently 2 engines that janis_assistant supports:

  1. CWLTool
  2. Cromwell

CWLTool (default)

Due to the way CWLTool provides metadata, support for CWLTool is very basic, and limited to submitting workflows and linking the outputs. Janis can manage CWLTool in the background, except if CWLTool is terminated (through some transient cluster error), Janis is unable to restart it.

Cromwell

Cromwell can be run in two modes:

  1. Connect to an existing instance (well supported) - include the --cromwell-url argument with the port to allow the Janis assistant to correctly connect to this instance.

  2. Run and manage it's own instance. When the task is started, the process_id of the started Cromwell instance is stored in the taskdb, when the task finishes execution, Janis stops this Cromwell instance. Janis can manage a MySQL (in fact MariaDB) instance with the --mysql flag for durability and to reduce memory overhead.

Both of these options provide reporting and progress tracking due to Cromwell's extensive metadata endpoint. The TaskID (6 hex characters) is included as a label on the workflow.

janis watch $tid

A screenshot of the running the example whole genome germline pipeline (for a targeted sample) can be found below. (All engines can support this through a generalised metadata semantic (TaskMetadata), Neither CWLTool or Toil support much polling of metadata).

Screenshot of janis progress screen

Extra Cromwell comments:

  • The TaskID is bound as a label on GCP instances (as wid, allowing you to query this information).
  • Janis uses the development spec of WDL, requiring Cromwell-42 or higher.
  • If asking Janis to start its own Cromwell instance, it requires the jar to be exported as $cromwelljar.

Databases

This feature requires better documentation in the primary Janis documentation.

Some features of Cromwell require a database to use: call-caching, resumability for cluster failures and so on.

Previously, this has been managed through automatically spinning up a mysql instance with Docker / Singularity, however this has been unstable. Now, as Cromwell supports a file-based database, this is now the default.

  • No options -> file-based DB
  • --no-database -> No database is ran
  • --mysql -> Automatically provision and manage a mysql server (unchanged)
  • Configure an existing

Call caching has been enabled by default using the file based method, we strongly recommend downloading Cromwell >50 and using fingerprint, see call caching documentation for more information.

WARNING: fingerprint will become the default once Cromwell 50 has been released. This might break if you're using older versions of Cromwell.

Filesystem

There is a weak concept of a filesystem for where your workflow is executed. This tool is really only developed for using the LocalFileSystem.

Supported filesystems:

  • LocalFileScheme
  • SSHFileScheme (identifier, connectionstring) - I'd recommend creating an SSH shortcut to avoid persisting personal details in database. Janis uses the connection string like so: scp connectionstring:/path/to/output /local/persist/path

Databases

Janis stores a global SQLite database at {configDir}/janis.db of environments and task pointers (default: ~/.janis/janis.db). When a task is started, a database and workflow files are copied to your specified output directory.