Skip to content
Mike Fisk edited this page May 27, 2014 · 19 revisions

Word Frequency Example

The classic map-reduce example is to compute word frequency in a text corpus:

fm map -i "/etext/*" "egrep -o '\S+' | fm partition -Zn 9 <> sort | uniq -c"

Let’s walk through this example step-by-step: First, the FileMap command is “fm”. You must always specify a sub-command and in this case it is “map”, which is used to specify and execute a computation. The “-i” option specifies a wildcard expression that specifies the input files to process. These files need to have been stored in advance into the replicated, virtual filesystem provided by FileMap. To do so, use commands like this:

fm mkdir /etext
fm store /tmp/*.txt /etext/

If you want to follow along, you can download a few e-texts from Project Gutenberg:

cd /tmp
for f in shplk10 sread10 spoem10 mysky10 leave10 gryms10; do wget --user-agent 'definitely-not-wget' http://www.gutenberg.org/dirs/etext05/$f.txt; done

(Mac users can use curl -O instead of wget.) The remainder of the command is an execution pipeline that is applied to each input file.

The first pipeline element uses egrep to extract a list of words, one per line, from a text document.

The “partition” command partitions an input file into a specified number of files (9 in this case). In this case, a hash is computed on each input line to determine which output partition the line belongs to. Importantly, the same input word always gets put in the same partition.

The ><> operators do a shuffle and reduce. First, all of the input files prior to this point in the pipeline are redistributed among the nodes. All files with the same suffix will be stored to the same node(s). The command on the right-hand side of the > operator is responsible for reducing all of the input files for a given partition. This means that it will be executed with a list of input files constituting a partition. Each partition is processed separately (in parallel). In the case of the Unix sort command, it produces a single sorted output from multiple inputs.

Finally, the output of each sorted partition is piped to a “uniq -c” instance, which is the Unix command to tally up the number of identical consecutive lines (words in this case). The “fm map” command implicitly concatenates the outputs of the last stage of the pipeline and outputs them to stdout. In this case, this prints our final word frequency list.

Looking at the intermediate results

Let’s do some directory listings to look at the input and intermediate files. The format is very similar to traditional POSIX ls, but with a few differences. First, the second column, which would be a hard-link count normally, shows which nodes contained the file or directory (* means all nodes). If more than one node contains a file of the same size, permissions, and ownership, then the file’s modification time is not shown (since it is unlikely to be the same for all nodes). Also, directories are always considered to have a 0 size.

$ ./fm ls -n /etext/* /etext/* /etext/*/* /etext/*/*/* /etext/*/*/*/* /etext/*/*/*/*/* /reduce/*/* /reduce/*/*/* reduce/*/*/*/* reduce/*/*/*/*/*  | grep '^-'
-rw-r--r--     20a    mfisk    ydata  8.3e4 May 18  2003 /etext/gryms10.txt
-rwxrwxr-x     20a    mfisk    ydata  8.9e4 Jan 02 17:25 /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0
-rwxrwxr-x 10b,20a    mfisk    ydata      0            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/0
-rw-rw-r-- 20a,24a    mfisk    ydata  1.6e4            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/1
-rw-rw-r-- 20a,26a    mfisk    ydata  1.1e4            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/2
-rw-rw-r-- 10a,20a    mfisk    ydata  8.9e3            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/3
-rw-rw-r-- 20a,21a    mfisk    ydata  7.6e3            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/4
-rw-rw-r-- 10a,20a    mfisk    ydata  7.2e3            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/5
-rw-rw-r-- 10b,20a    mfisk    ydata    1e4            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/6
-rw-rw-r-- 10b,20a    mfisk    ydata  9.7e3            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/7
-rw-rw-r-- 20a,32a    mfisk    ydata  5.8e3            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/8
-rw-rw-r-- 20a,29a    mfisk    ydata  1.3e4            - /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/9
-rw-rw-r--     20a    mfisk    ydata  2.4e2 Jan 02 17:25 /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/status
-rwxrwxr-x     20a    mfisk    ydata      0 Jan 02 17:25 /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/stderr
-rw-rw-r--     20a    mfisk    ydata  2.2e2 Jan 02 17:25 /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/status
-rwxrwxr-x     20a    mfisk    ydata      0 Jan 02 17:25 /etext/gryms10.txt.d/egrep=20-o=20=27=5cS=2b=27/stderr
-rw-r--r--     23a    mfisk    ydata  1.2e5 Jun 28  2003 /etext/mysky10.txt
-rwxrwxr-x     23a    mfisk    ydata  1.3e5 Jan 02 17:25 /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0
-rwxrwxr-x 10b,23a    mfisk    ydata      0            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/0
-rw-rw-r-- 23a,24a    mfisk    ydata  2.1e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/1
-rw-rw-r-- 23a,26a    mfisk    ydata  1.7e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/2
-rw-rw-r-- 10a,23a    mfisk    ydata  1.4e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/3
-rw-rw-r-- 21a,23a    mfisk    ydata  1.4e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/4
-rw-rw-r-- 10a,23a    mfisk    ydata  1.2e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/5
-rw-rw-r-- 10b,23a    mfisk    ydata  1.4e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/6
-rw-rw-r-- 10b,23a    mfisk    ydata  1.2e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/7
-rw-rw-r-- 23a,32a    mfisk    ydata  9.7e3            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/8
-rw-rw-r-- 23a,29a    mfisk    ydata  1.8e4            - /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/9
-rw-rw-r--     23a    mfisk    ydata  2.5e2 Jan 02 17:26 /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/status
-rwxrwxr-x     23a    mfisk    ydata      0 Jan 02 17:26 /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/stderr
-rw-rw-r--     23a    mfisk    ydata  2.3e2 Jan 02 17:25 /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/status
-rwxrwxr-x     23a    mfisk    ydata      0 Jan 02 17:25 /etext/mysky10.txt.d/egrep=20-o=20=27=5cS=2b=27/stderr
-rw-r--r--     23a    mfisk    ydata  5.4e5 Jul 25  2003 /etext/shplk10.txt
-rwxrwxr-x     23a    mfisk    ydata  5.7e5 Jan 02 17:25 /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0
-rwxrwxr-x 10b,23a    mfisk    ydata      0            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/0
-rw-rw-r-- 23a,24a    mfisk    ydata  8.4e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/1
-rw-rw-r-- 23a,26a    mfisk    ydata  7.5e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/2
-rw-rw-r-- 10a,23a    mfisk    ydata  6.2e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/3
-rw-rw-r-- 21a,23a    mfisk    ydata    6e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/4
-rw-rw-r-- 10a,23a    mfisk    ydata  4.3e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/5
-rw-rw-r-- 10b,23a    mfisk    ydata  7.6e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/6
-rw-rw-r-- 10b,23a    mfisk    ydata  6.3e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/7
-rw-rw-r-- 23a,32a    mfisk    ydata  4.3e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/8
-rw-rw-r-- 23a,29a    mfisk    ydata  6.9e4            - /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/9
-rw-rw-r--     23a    mfisk    ydata  2.4e2 Jan 02 17:25 /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/status
-rwxrwxr-x     23a    mfisk    ydata      0 Jan 02 17:25 /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/stderr
-rw-rw-r--     23a    mfisk    ydata  2.3e2 Jan 02 17:25 /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/status
-rwxrwxr-x     23a    mfisk    ydata      0 Jan 02 17:25 /etext/shplk10.txt.d/egrep=20-o=20=27=5cS=2b=27/stderr
-rwxrwxr-x     23a    mfisk    ydata  5.7e5 Jan 02 17:22 /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0
-rwxrwxr-x 10b,23a    mfisk    ydata      0            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/0
-rw-rw-r-- 23a,24a    mfisk    ydata  8.4e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/1
-rw-rw-r-- 23a,26a    mfisk    ydata  7.5e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/2
-rw-rw-r-- 10a,23a    mfisk    ydata  6.2e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/3
-rw-rw-r-- 21a,23a    mfisk    ydata    6e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/4
-rw-rw-r-- 10a,23a    mfisk    ydata  4.3e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/5
-rw-rw-r-- 10b,23a    mfisk    ydata  7.6e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/6
-rw-rw-r-- 10b,23a    mfisk    ydata  6.3e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/7
-rw-rw-r-- 23a,32a    mfisk    ydata  4.3e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/8
-rw-rw-r-- 23a,29a    mfisk    ydata  6.9e4            - /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/9
-rw-rw-r--     23a    mfisk    ydata  2.4e2 Jan 02 17:23 /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/status
-rwxrwxr-x     23a    mfisk    ydata      0 Jan 02 17:23 /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/0.d/fm=20partition=20=2Dn=209/stderr
-rw-rw-r--     23a    mfisk    ydata  2.3e2 Jan 02 17:22 /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/status
-rwxrwxr-x     23a    mfisk    ydata      0 Jan 02 17:22 /etext/shplk10.txt.d/sed=20=2Df=20=7Emfisk=2Fwords=2Esed/stderr
-rw-r--r--     10b    mfisk    ydata  1.5e5 Jul 02  2003 /etext/spoem10.txt
-rwxrwxr-x     10b    mfisk    ydata  1.6e5 Jan 02 17:25 /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0
-rwxrwxr-x       *    mfisk    ydata      0            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/0
-rw-rw-r-- 10b,24a    mfisk    ydata  2.9e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/1
-rw-rw-r-- 10b,26a    mfisk    ydata  2.1e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/2
-rw-rw-r-- 10a,10b    mfisk    ydata  1.6e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/3
-rw-rw-r-- 10b,21a    mfisk    ydata  1.5e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/4
-rw-rw-r-- 10a,10b    mfisk    ydata  1.2e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/5
-rw-rw-r--       *    mfisk    ydata  1.9e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/6
-rw-rw-r--       *    mfisk    ydata  1.5e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/7
-rw-rw-r-- 10b,32a    mfisk    ydata  1.1e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/8
-rw-rw-r-- 10b,29a    mfisk    ydata    2e4            - /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/9
-rw-rw-r--     10b    mfisk    ydata  2.4e2 Jan 02 17:25 /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:25 /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/stderr
-rw-rw-r--     10b    mfisk    ydata  2.3e2 Jan 02 17:25 /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:25 /etext/spoem10.txt.d/egrep=20-o=20=27=5cS=2b=27/stderr
-rw-r--r--     18a    mfisk    ydata  6.9e5 Sep 03  2003 /etext/sread10.txt
-rwxrwxr-x     18a    mfisk    ydata  7.3e5 Jan 02 17:25 /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0
-rwxrwxr-x 10b,18a    mfisk    ydata      0            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/0
-rw-rw-r-- 18a,24a    mfisk    ydata  1.3e5            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/1
-rw-rw-r-- 18a,26a    mfisk    ydata  1.1e5            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/2
-rw-rw-r-- 10a,18a    mfisk    ydata  7.2e4            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/3
-rw-rw-r-- 18a,21a    mfisk    ydata  6.8e4            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/4
-rw-rw-r-- 10a,18a    mfisk    ydata  5.7e4            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/5
-rw-rw-r-- 10b,18a    mfisk    ydata  8.1e4            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/6
-rw-rw-r-- 10b,18a    mfisk    ydata  6.7e4            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/7
-rw-rw-r-- 18a,32a    mfisk    ydata    5e4            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/8
-rw-rw-r-- 18a,29a    mfisk    ydata    1e5            - /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/9
-rw-rw-r--     18a    mfisk    ydata  2.4e2 Jan 02 17:25 /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/status
-rwxrwxr-x     18a    mfisk    ydata      0 Jan 02 17:25 /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/0.d/fm=20partition=20=2Dn=209/stderr
-rw-rw-r--     18a    mfisk    ydata  2.3e2 Jan 02 17:25 /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/status
-rwxrwxr-x     18a    mfisk    ydata      0 Jan 02 17:25 /etext/sread10.txt.d/egrep=20-o=20=27=5cS=2b=27/stderr
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/0/0
-rw-rw-r--     10b    mfisk    ydata    2e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/0/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/0/stderr
-rwxrwxr-x     24a    mfisk    ydata  2.8e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/1/0
-rw-rw-r--     24a    mfisk    ydata  2.1e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/1/status
-rwxrwxr-x     24a    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/1/stderr
-rwxrwxr-x     26a    mfisk    ydata  2.3e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/2/0
-rw-rw-r--     26a    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/2/status
-rwxrwxr-x     26a    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/2/stderr
-rwxrwxr-x     10a    mfisk    ydata  1.7e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/3/0
-rw-rw-r--     10a    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/3/status
-rwxrwxr-x     10a    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/3/stderr
-rwxrwxr-x     21a    mfisk    ydata  1.6e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/4/0
-rw-rw-r--     21a    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/4/status
-rwxrwxr-x     21a    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/4/stderr
-rwxrwxr-x     10a    mfisk    ydata  1.3e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/5/0
-rw-rw-r--     10a    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/5/status
-rwxrwxr-x     10a    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/5/stderr
-rwxrwxr-x     10b    mfisk    ydata    2e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/6/0
-rw-rw-r--     10b    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/6/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/6/stderr
-rwxrwxr-x     10b    mfisk    ydata  1.7e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/7/0
-rw-rw-r--     10b    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/7/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/7/stderr
-rwxrwxr-x     32a    mfisk    ydata  1.2e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/8/0
-rw-rw-r--     32a    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/8/status
-rwxrwxr-x     32a    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/8/stderr
-rwxrwxr-x     29a    mfisk    ydata  2.2e5 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/9/0
-rw-rw-r--     29a    mfisk    ydata  2.3e2 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/9/status
-rwxrwxr-x     29a    mfisk    ydata      0 Jan 02 17:26 /reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/9/stderr
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/0/0.d/uniq=20=2Dc/0
-rw-rw-r--     10b    mfisk    ydata    2e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/0/0.d/uniq=20=2Dc/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/0/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     24a    mfisk    ydata  4.1e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/1/0.d/uniq=20=2Dc/0
-rw-rw-r--     24a    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/1/0.d/uniq=20=2Dc/status
-rwxrwxr-x     24a    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/1/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     26a    mfisk    ydata  3.9e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/2/0.d/uniq=20=2Dc/0
-rw-rw-r--     26a    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/2/0.d/uniq=20=2Dc/status
-rwxrwxr-x     26a    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/2/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     10a    mfisk    ydata  3.7e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/3/0.d/uniq=20=2Dc/0
-rw-rw-r--     10a    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/3/0.d/uniq=20=2Dc/status
-rwxrwxr-x     10a    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/3/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     21a    mfisk    ydata    4e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/4/0.d/uniq=20=2Dc/0
-rw-rw-r--     21a    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/4/0.d/uniq=20=2Dc/status
-rwxrwxr-x     21a    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/4/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     10a    mfisk    ydata  3.7e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/5/0.d/uniq=20=2Dc/0
-rw-rw-r--     10a    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/5/0.d/uniq=20=2Dc/status
-rwxrwxr-x     10a    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/5/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     10b    mfisk    ydata  3.7e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/6/0.d/uniq=20=2Dc/0
-rw-rw-r--     10b    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/6/0.d/uniq=20=2Dc/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/6/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     10b    mfisk    ydata  3.9e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/7/0.d/uniq=20=2Dc/0
-rw-rw-r--     10b    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/7/0.d/uniq=20=2Dc/status
-rwxrwxr-x     10b    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/7/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     32a    mfisk    ydata  3.7e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/8/0.d/uniq=20=2Dc/0
-rw-rw-r--     32a    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/8/0.d/uniq=20=2Dc/status
-rwxrwxr-x     32a    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/8/0.d/uniq=20=2Dc/stderr
-rwxrwxr-x     29a    mfisk    ydata  3.9e4 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/9/0.d/uniq=20=2Dc/0
-rw-rw-r--     29a    mfisk    ydata  2.3e2 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/9/0.d/uniq=20=2Dc/status
-rwxrwxr-x     29a    mfisk    ydata      0 Jan 02 17:26 reduce/0du3dWjGBIYbiA+PW1c9tOIUch8/9/0.d/uniq=20=2Dc/stderr

As you see, the output of a map operation is put in a file whose name is the map operation and whose location is a subdirectory of the input file (subdirectories have the same name as the input file, but with a .d added to the end). For example, the input file /etext/gryms10.txt was used as an input, the output will be under “gryms10.txt.d”. Since the command run was “sed -f ~/words.sed” operation, the stdout of that operation would be “/etext/gryms10.txt.d/sed -f ~/words.sed/0”. To simplify filenames, the Quoted-Printable encoding is used for all but simple alphanumeric characters in the filename. Thus, the actual file name is egrep=20-o=20=27=5cS=2b=27/0. In addition, you will find a file with the name stderr that contains the stderr output from the process and a status file created at completion containing statistics about the execution of the process. At the completion of a map operation, any stderr files for any stage of the entire pipeline are displayed.

Since the output of the sed stage was used as input to the “fm partition” stage, a “sed=20=2Df=20words=2Esed/0.d” directory is created. Any command can create multiple output files to support additional downstream parallelism. “fm partition” is one of these “partitioning” commands. So in addition to the 0 file for stdout, “fm partition” creates a number of output files numbered 1 through 9. (All map commands are run with their current working directory set to the output directory).

The shuffle operator, “<” (visualize the symbol as a fan-out) redistributes the data so that all files from the previous stage of the pipeline with the same basename reside on the same node for the reduce operation. The reduce operation itself is placed in a new directory since it is derived from multiple inputs. Each stage of a pipeline is called a job and the “fm jobs” command will list any currently-running jobs. The reduce output is placed in a subdirectory in /reduce which is named by the job-id. (Job IDs are digests of the internal job description files which can be found in /jobs). Reduce operations are run with a list of files as their arguments. In our case, one instance of sort will get all of the 0 inputs, another instance will get all of the 1 inputs, and so on.

The output of the reduce operation is input to the final pipeline stages using this the same process as the initial map operations. When the final stage of the pipeline completes, all of the final output files will be concatenated together and delivered on stdout.

Clone this wiki locally