-
Notifications
You must be signed in to change notification settings - Fork 110
A Clojure Frontend to Drake
The clj-frontend namespace provides the following functions for creating and executing Drake workflows in Clojure.
- new-workflow -- Create a new workflow.
- cmd-step -- Add a step with commands to a workflow.
- method -- Add a method to a workflow.
- method-step -- Add a step using a method to a workflow.
- template -- Add a template to a workflow.
- template-step -- Add a step using a template to a workflow.
- set-var -- Set a variable in a workflow.
- base -- Change the value of the "BASE" variable in a workflow.
- run-workflow -- Run a workflow.
With the exception of new-workflow all these functions accept a workflow as their first argument and return a modified workflow. This API structure was inspired by honeysql. Let's see how this works in practice by translating a trivial drake workflow.
Your project.clj dependencies should include the latest Drake library, e.g.:
[factual/drake "0.1.6"]
out <-
echo "We are writing to a file here" > $OUTPUT
;; Bring all the clj-frontend functions into the current namespace.
(use 'drake.clj-frontend)
;; Define a workflow called minimal-workflow.
(def minimal-workflow
(->
(new-workflow) ;Create a new workflow
(cmd-step ;Add a command step with the
;following arguments
["out"] ;Array of outputs
[] ;Array of inputs
["echo \"We are writing to a file here\" > $OUTPUT"] ;Array of commands
)))
What is happening here is that (new-workflow)
runs the new-workflow function to create a new workflow. Then the -> macro passes this new workflow into the cmd-step function as its first argument. The subsequent arguments to cmd-step are arrays of outputs, inputs and commands. Just like the original drake workflow, outputs come before inputs.
With our workflow in hand, if we are working at a repl, we can preview our workflow to see what would happen if we actually ran it.
(run-workflow minimal-workflow :preview true)
should generate the following preview.
The following steps will be run, in order:
1: out <- [missing output]
If we are satisfied, we can actually run the workflow with the following command.
(run-workflow minimal-workflow)
This will run our workflow and generate an output at the repl similar to the following.
Workflow Started @ 16:35:58
1: out <- [no-input step] Step Started @ 16:35:58
1: out <- [no-input step] Step Finished @ 16:35:58
Workflow Finished @ 16:35:58
The most straight forward way to actually use clj-frontend is to create a new project with lein new
and add drake as a dependency to the project file using something similar to this Clojars coordinate [factual/drake "0.1.6"]
. Be sure to check Clojars for the most current coordinate. Then write your workflow in an appropriate namespace in a file in the src
directory. By default, the inputs and outputs of your workflow will then end up in the root directory of the leiningen project. drake/demos/clj-frontend
is a leiningen project demonstrating this approach. lein repl
inside drake/demos/clj-frontend
will let you interact with the code examples from this page which can be found in the clj-frontend.demo
namespace contained in drake/demos/clj-frontend/src/clj_frontend/demo.clj
.
Alternately you could avoid making a leiningen project by using lein-exec to create a stand alone clj script. The downside to a stand alone lein-exec script is that lein-exec won't currently allow you to open a repl from the command line. If however you open the script in emacs you can actually "cider-jack-in" to get a working repl even though the script is not part of a leiningen project and has no associated project.clj
. Opening a lein-exec script like this with emacs and then jacking into a repl is a really nice way to run drake.clj-frontend.
This example features variables, methods, variable substitution and multiline commands.
out1, out2 <- [-timecheck]
echo "This is the first output." > $OUTPUT0
echo "This is the second output." > $OUTPUT1
test_method()
echo "Here we are using a method." > OUTPUT
out_method <- [method:test_method]
test_var=TEST_VAR_VALUE
output_three=out3
$[output_three] <- out1
echo "This is the third output." > $OUTPUT
echo "test_var is set to $test_var -- $[test_var]." >> $OUTPUT
echo "The file $INPUT contains:" | cat - $INPUT >> $[OUTPUT]
(def advanced-workflow
(->
(new-workflow)
(cmd-step
["out1"
"out2"]
[]
["echo \"This is the first output.\" > $OUTPUT0"
"echo \"This is the second output.\" > $OUTPUT1"] ;multiple commands
:timecheck false) ;options are key value pairs
(method
"test_method"
["echo \"Here we are using a method.\" > $OUTPUT"])
(method-step
["out_method"] ;outputs
[] ;inputs
"test_method") ;method name
(set-var "test_var" "TEST_VAR_VALUE") ;var name, var value
(set-var "output_three" "out3")
(cmd-step
["$[output_three]"] ;inputs and outputs can have
;$[XXX] substitution
["out1"]
;; $[XXX] substitution is allowed in commands.
["echo \"This is the third output.\" > $OUTPUT"
"echo \"test_var is set to $test_var - $[test_var].\" >> $OUTPUT"
"echo \"The file $INPUT contains:\" | cat - $INPUT >> $[OUTPUT]"])))
(run-workflow advanced-workflow :preview true)
(run-workflow advanced-workflow)
clj-frontend really gets powerful when you write functions that take and return workflows and then use reduce
to generate a workflow based on a collection. Here is a simple example.
Let's say you want to take several raw data sources from the internet and for each source you want to create a directory, download some data into it, and do several processing steps on the data. We will express this as a map called dir->url-map between the directory names we want to create and the raw data sources we want to process.
(def dir->url-map
"Hash map of:
Directory Names => URLs"
{"Dir1" "http://url1"
"Dir2" "http://url2"
"Dir3" "http://url3"})
Now we need a function that takes an existing workflow and adds new steps to it for each directory => url pair from our data-map.
(defn download-and-process
"I take an existing workflow and add steps to download and process the data at
url into the directory dir"
[w-flow [dir url]] ;note the argument
;destructuring
(-> w-flow
(base "") ;make sure we are in top
;directory
(cmd-step
[dir]
[]
["mkdir -p $OUTPUT"])
(base dir) ;move into dir for our
;subsequent commands
(cmd-step
["raw_data"]
[]
["wget -O $OUTPUT " url] ;get the data
:timecheck false)
(cmd-step
["sorted_data"]
["raw_data"]
["sort -o $OUTPUT"]) ;sort the data
;; more steps can be added here
))
Finally we can use reduce
with download-and-process
to add several workflow steps for each dir => url pair in dir->url-map.
(def reduce-workflow
(reduce
download-and-process
(new-workflow)
dir->url-map))
(run-workflow reduce-workflow :preview true)
should now give you the following preview:
The following steps will be run, in order:
1: Dir1 <- [missing output]
2: Dir1/raw_data <- [missing output]
3: Dir1/sorted_data <- Dir1/raw_data [projected timestamped]
4: Dir2 <- [missing output]
5: Dir2/raw_data <- [missing output]
6: Dir2/sorted_data <- Dir2/raw_data [projected timestamped]
7: Dir3 <- [missing output]
8: Dir3/raw_data <- [missing output]
9: Dir3/sorted_data <- Dir3/raw_data [projected timestamped]
Since the data for this workflow is fake, we can't actually run it.