Skip to content

Quick Start Guide

dhruvbird edited this page Nov 29, 2012 · 22 revisions

Quick Start Guide

Software required to run lib-face

Download (and extract) lib-face & its dependencies

$> wget "https://github.com/duckduckgo/cpp-libface/archive/master.tar.gz"
$> tar -zvxf cpp-libface-master.tar.gz
$> cd cpp-libface-master
$> cd deps
$> wget https://github.com/joyent/http-parser/archive/master.tar.gz -O http-parser-master.tar.gz
$> wget https://github.com/joyent/libuv/archive/master.tar.gz -O libuv-master.tar.gz
$> rmdir http-parser
$> rmdir libuv
$> mv http-parser-master http-parser
$> rmdir libuv-master libuv
$> cd ..

Or get it from the github repository as shown below:

$> git clone --recursive [email protected]:duckduckgo/cpp-libface.git
$> cd cpp-libface

Compile lib-face

$> make

Start lib-face

$> ./lib-face -f path_to_input_file.tsv

That's it!!

Input data file format

Your input data file should be a text file containing separate lines for each phrase you want auto-completion on. Each line should be formatted as:

FREQUENCY_COUNT\tPHRASE\tSNIPPET\n

The last component (\tSNIPPET) is optional and may be skipped.

For example (shows lines with and without the SNIPPET):

10        hello world        
15        How I met your mother
21        How do you do?            Not an answer
9         Great Days ahead          Show in blue background

The importer will convert every phrase (not snippet) to lowercase. Your input text file should be utf8 encoded.

Important URLs

You will mainly use the following URLs:

  1. The import URL: http://localhost:6767/face/import/?file=PATH_TO_INPUT_FILE

    This will import everything in the file at PATH_TO_INPUT_FILE into the currently running lib-face instance's memory.

  • Selecting the file to import: The file=PATH_TO_INPUT_FILE parameter controls which file to import data from.

  • Importing Sorted Files: You may also import a pre-sorted file to lib-face. lib-face will automatically deduce the sortedness of the input file and speed up the import process. Importing sorted files is much (up to 4x faster [actually O(log n)]) faster than importing files that are not sorted.

Example of importing a sorted file: http://localhost:6767/face/import/?file=PATH_TO_INPUT_FILE

  • Limiting Lines Imported: If you only want to import the first N lines of the input file, you can also pass in a limit=N GET parameter.

Example of importing only up to the first 10 lines of a file: http://localhost:6767/face/import/?file=PATH_TO_INPUT_FILE&limit=10

  1. The query URL: http://localhost:6767/face/suggest/?q=PREFIX&n=LIMIT&callback=CB
  • Setting the query prefix: The query prefix can be set by passing q=PREFIX as the query parameter

  • Limiting the result set: The n=LIMIT is optional and the maximum value you can supply is 32. If you specify a number greater than 32, you will receive only up to 32 results. This behaviour can be changed by modifying the handler's code. The return value is JSON formatted.

  • Passing a callback for JSONP: You may also optionally pass a callback=CB parameter which treats the request as a JSONP request and wraps the output in the callback named CB.

  1. The export URL: http://localhost:6767/face/export/?file=PATH_TO_INPUT_FILE

    Exports the in-memory data store to the file specified by the file= parameter. This file is written sorted so that the next time you import this file, the import will take much lesser time.

  2. The status URL: http://localhost:6767/face/stats/

    Displays various stats such as the current up time, number of items in the data store and the number of queries handled.

Command line options

You can import files and change the port on which lib-face listens for new HTTP connections using the optional command like arguments:

-h, --help           This screen
-f, --file=PATH      Path of the file containing the phrases
-p, --port=PORT      TCP port on which to start lib-face (default: 6767)
-l, --limit=LIMIT    Load only the first LIMIT lines from PATH (default: -1 [unlimited])

Compilation Options

You can pass compiler options to make using the COPT variable.

  1. RMQ=[SparseTable|SegmentTree|BenderRMQ]: The default is SegmentTree. This governs the Range Max. Query data structure to use. This table summarizes the cost for each structure.
Data Structure Space Required Build Cost Query Cost Build Time (sec) Query Time (sec)
SegmentTree O(n) = 4n O(n) O(lg n) = lg n 0.09 9.73
SpareseTable O(n lg n) = 2n lg n O(n lg n) O(1) = 2 0.49 0.70
BenderRMQ O(n) = 8n O(n) O(1) = 10 0.44 4.00

The constants after the = sign show the relative hidden constants for each method as implemented here. You can see that while BenderRMQ is asymptotically the best, it may not be the best for your data set because of the high hidden constant. That is the reason why SegmentTree is the default RMQ data structure. You should definitely experiment with your data set to find the best structure for your needs.

The running times are for querying 10M ranges in an array containing 4M elements. See the file tests/rmq_perf.cpp for more details. All benchmarks were run using the command make perf COPT="-O2 -DNDEBUG"

Data Structure Build Time (sec) Query Time (sec)
SegmentTree 0.09 9.73
SpareseTable 0.49 0.70
BenderRMQ 0.44 4.00

The following table shows the same information, but for querying 30M ranges in an array containing 8M elements.

Data Structure Build Time (sec) Query Time (sec)
SegmentTree 0.23 32.61
SpareseTable 1.16 2.96
BenderRMQ 1.14 12.76

You can read more about these RMQ data structures here.

  1. NMAX=N: This regulates the maximum number of results every returned to a query. The default value is 32. This parameter is mainly a security measure to prevent attackers from requesting very large result sets.

  2. INPUT_LINE_SIZE=N: The maximum length of a line expected to be seen in the input file. lib-face internally uses an FSM (Finite State Automata) to parse input lines. If any line in your input file is greater than this, then lib-face will either crash or you will get undefined behaviour. Please set this value carefully. The default is 8192.

You can pass these options to make as:

make COPT="-DRMQ=SegmentTree -DNMAX=40"