Skip to content

A Clojure Frontend to Drake

Aaron Crow edited this page May 23, 2014 · 7 revisions

Overview

The clj-frontend namespace provides the following functions for creating and executing Drake workflows in Clojure.

  • new-workflow -- Create a new workflow.
  • cmd-step -- Add a step with commands to a workflow.
  • method -- Add a method to a workflow.
  • method-step -- Add a step using a method to a workflow.
  • template -- Add a template to a workflow.
  • template-step -- Add a step using a template to a workflow.
  • set-var -- Set a variable in a workflow.
  • base -- Change the value of the "BASE" variable in a workflow.
  • run-workflow -- Run a workflow.

With the exception of new-workflow all these functions accept a workflow as their first argument and return a modified workflow. This API structure was inspired by honeysql. Let's see how this works in practice by translating a trivial drake workflow.

Minimal Example

Add Drake to your project

Your project.clj dependencies should include the latest Drake library, e.g.:

[factual/drake "0.1.6"]

Workflow in Drake

out <-
  echo "We are writing to a file here" > $OUTPUT

Workflow Translated into Clojure

;; Bring all the clj-frontend functions into the current namespace.
(use 'drake.clj-frontend)

;; Define a workflow called minimal-workflow.
(def minimal-workflow
  (->
   (new-workflow)                       ;Create a new workflow
   (cmd-step                            ;Add a command step with the
                                        ;following arguments
    ["out"]                             ;Array of outputs
    []                                  ;Array of inputs
    ["echo \"We are writing to a file here\" > $OUTPUT"] ;Array of commands
    )))

What is happening here is that (new-workflow) runs the new-workflow function to create a new workflow. Then the -> macro passes this new workflow into the cmd-step function as its first argument. The subsequent arguments to cmd-step are arrays of outputs, inputs and commands. Just like the original drake workflow, outputs come before inputs.

With our workflow in hand, if we are working at a repl, we can preview our workflow to see what would happen if we actually ran it.

(run-workflow minimal-workflow :preview true) should generate the following preview.

The following steps will be run, in order:
  1: out <-  [missing output]

If we are satisfied, we can actually run the workflow with the following command.

(run-workflow minimal-workflow)

This will run our workflow and generate an output at the repl similar to the following.

Workflow Started @ 16:35:58

1: out <-  [no-input step] Step Started @ 16:35:58
1: out <-  [no-input step] Step Finished @ 16:35:58

Workflow Finished @ 16:35:58

Practicalities

The most straight forward way to actually use clj-frontend is to create a new project with lein new and add drake as a dependency to the project file using something similar to this Clojars coordinate [factual/drake "0.1.6"]. Be sure to check Clojars for the most current coordinate. Then write your workflow in an appropriate namespace in a file in the src directory. By default, the inputs and outputs of your workflow will then end up in the root directory of the leiningen project. drake/demos/clj-frontend is a leiningen project demonstrating this approach. lein repl inside drake/demos/clj-frontend will let you interact with the code examples from this page which can be found in the clj-frontend.demo namespace contained in drake/demos/clj-frontend/src/clj_frontend/demo.clj.

Alternately you could avoid making a leiningen project by using lein-exec to create a stand alone clj script. The downside to a stand alone lein-exec script is that lein-exec won't currently allow you to open a repl from the command line. If however you open the script in emacs you can actually "cider-jack-in" to get a working repl even though the script is not part of a leiningen project and has no associated project.clj. Opening a lein-exec script like this with emacs and then jacking into a repl is a really nice way to run drake.clj-frontend.

Full Featured Example

This example features variables, methods, variable substitution and multiline commands.

Workflow in Drake

out1, out2 <- [-timecheck]
  echo "This is the first output." > $OUTPUT0
  echo "This is the second output." > $OUTPUT1

test_method()
  echo "Here we are using a method." > OUTPUT

out_method <- [method:test_method]

test_var=TEST_VAR_VALUE
output_three=out3

$[output_three] <- out1
  echo "This is the third output." > $OUTPUT
  echo "test_var is set to $test_var -- $[test_var]." >> $OUTPUT
  echo "The file $INPUT contains:" | cat - $INPUT >> $[OUTPUT]

Workflow in Clojure

(def advanced-workflow
  (->
   (new-workflow)
   (cmd-step
    ["out1"
     "out2"]
    []
    ["echo \"This is the first output.\" > $OUTPUT0"
     "echo \"This is the second output.\" > $OUTPUT1"] ;multiple commands
    :timecheck false)                   ;options are key value pairs
   (method
    "test_method"
    ["echo \"Here we are using a method.\" > $OUTPUT"])
   (method-step
    ["out_method"]                      ;outputs
    []                                  ;inputs
    "test_method")                      ;method name
   (set-var "test_var" "TEST_VAR_VALUE") ;var name, var value
   (set-var "output_three" "out3")
   (cmd-step
    ["$[output_three]"]                 ;inputs and outputs can have
                                        ;$[XXX] substitution
    ["out1"]
    ;; $[XXX] substitution is allowed in commands.
    ["echo \"This is the third output.\" > $OUTPUT"
     "echo \"test_var is set to $test_var - $[test_var].\" >> $OUTPUT"
     "echo \"The file $INPUT contains:\" | cat - $INPUT >> $[OUTPUT]"])))

(run-workflow advanced-workflow :preview true)

(run-workflow advanced-workflow)

Functional Programming with Workflows

clj-frontend really gets powerful when you write functions that take and return workflows and then use reduce to generate a workflow based on a collection. Here is a simple example.

Let's say you want to take several raw data sources from the internet and for each source you want to create a directory, download some data into it, and do several processing steps on the data. We will express this as a map called dir->url-map between the directory names we want to create and the raw data sources we want to process.

(def dir->url-map
  "Hash map of:
  Directory Names => URLs"
  {"Dir1" "http://url1"
   "Dir2" "http://url2"
   "Dir3" "http://url3"})

Now we need a function that takes an existing workflow and adds new steps to it for each directory => url pair from our data-map.

(defn download-and-process
  "I take an existing workflow and add steps to download and process the data at
  url into the directory dir"
  [w-flow [dir url]]                    ;note the argument
                                        ;destructuring
  (-> w-flow
      (base "")                         ;make sure we are in top
                                        ;directory
      (cmd-step
       [dir]
       []
       ["mkdir -p $OUTPUT"])
      (base dir)                        ;move into dir for our
                                        ;subsequent commands
      (cmd-step
       ["raw_data"]
       []
       ["wget -O $OUTPUT "  url]        ;get the data
       :timecheck false)
      (cmd-step
       ["sorted_data"]
       ["raw_data"]
       ["sort -o $OUTPUT"])            ;sort the data
      ;; more steps can be added here
      ))

Finally we can use reduce with download-and-process to add several workflow steps for each dir => url pair in dir->url-map.

(def reduce-workflow
  (reduce
   download-and-process
   (new-workflow)
   dir->url-map))

(run-workflow reduce-workflow :preview true) should now give you the following preview:

The following steps will be run, in order:
  1: Dir1 <-  [missing output]
  2: Dir1/raw_data <-  [missing output]
  3: Dir1/sorted_data <- Dir1/raw_data [projected timestamped]
  4: Dir2 <-  [missing output]
  5: Dir2/raw_data <-  [missing output]
  6: Dir2/sorted_data <- Dir2/raw_data [projected timestamped]
  7: Dir3 <-  [missing output]
  8: Dir3/raw_data <-  [missing output]
  9: Dir3/sorted_data <- Dir3/raw_data [projected timestamped]

Since the data for this workflow is fake, we can't actually run it.