-
Notifications
You must be signed in to change notification settings - Fork 110
Tutorial
This tutorial is a work in progress. If there's a specific topic you'd like covered, please let us know via the Google Group for Drake.
Your Drake workflow file specifies what steps you want to run. Generally speaking, each step relies on one or more input sources, and is expected to create one or more output artifacts.
A Drake workflow file is organized primarily by step. In addition to specifying inputs and outputs, a step will generally contain explicit commands for that step, and possibly extra options.
Here's an example of a single step in a Drake workflow file:
; we only like lines with lowercase "i" in them
out.csv <- in.csv [shell]
grep i $INPUT > $OUTPUT
The above step uses Drake's "shell" protocol, meaning the commands are shell commands. (There are other protocols available, which must be specified explicitly. But for this tutorial, we'll focus on using the shell protocol.)
Let's break down the specific elements of the above step:
- out.csv: the output file to produce
- in.csv: the input file to use
- [shell]: the brackets hold the options for the step. A very important option is the step protocol. In this case we're choosing the "shell" protocol, which allows us to run shell commands in this step.
- the indented line: indented lines following the first line of the step are the commands of the step. In this case, there's exactly one command, which performs line filtering. Note that the command is a shell command, per our use of the shell protocol.
- $INPUT: A Drake shell step automatically loads the shell environment variables with useful information before running the step's shell command(s). For example, it loads the INPUT environment variable with the file path of the first input specified by the step. Therefore, the step's shell commands have access to variables such as $INPUT.
- $OUTPUT: Similar to $INPUT, a Drake shell step automatically loads the OUTPUT environment variable with the file path of the first output specified by the step, before running the step's shell commands.
A Drake workflow may have many steps, which might depend on each other in various ways. As a simple example, consider this additional step we could add to our workflow file:
; produce an extraordinarily fancy report
count.txt <- out.csv
wc $INPUT > $OUTPUT
This step depends on out.csv
(that is, it uses out.csv
as its input file), and produces count.txt
. Because of the dependence on out.csv
, Drake will by default make sure out.csv
is up-to-date. This means Drake will run the step(s) required to create out.csv
if necessary. (This behaviour is a tenet of basic dependency management that we've come to know and love through tools like Make.)
Drake's command line interface allows us to specify which step we want to start with, and other various target selection options. By default, however, Drake will attempt to run all the steps in your workflow.
For more details on Drake command line options, including target selection, please see the full user manual.
But we're getting ahead of ourselves. Let's learn by doing...
Drake is built to run data workflows. By default, it looks for your workflow file at ./Drakefile
. This is why Drake will complain that it can't find your workflow file, if you run it from somewhere that does not have a ./Drakefile
file.
Let's start with a fresh workflow, in a new directory:
$ mkdir /myworkflow
$ cd /myworkflow
Now create a simple workflow to play with. Create a file named workflow.d
and put this in it (stolen from the earlier example above):
; we only like lines with lowercase "i" in them
out.csv <- in.csv
grep i $INPUT > $OUTPUT
That's a very simple Drake workflow, with exactly one step. The step runs a single shell command, using in.csv
as the input file and writing the output to output.csv
.
We don't have an input file yet, so let's create it. Create a file named in.csv
and put some CSV lines in it, like so:
Artem,Boytsov,artem
Aaron,Crow,aaron
Alvin,Chyan,alvin
Maverick,Lou,maverick
Vinnie,Pepi,vinnie
Will,Lao,will
Cool, now we have a Drake workflow and a simple input file on which to run the workflow. Let's run it!
$ drake -w workflow.d
Let's check the output:
$ more out.csv
Alvin,Chyan,alvin
Maverick,Lou,maverick
Vinnie,Pepi,vinnie
Will,Lao,will