Skip to content

Data Wrangling

jcanny edited this page Nov 15, 2015 · 72 revisions

Table of Contents

ASCII Numeric Data

Dense Data

The goal of data wrangling is to get data from outside BIDMach into a numeric form that it can process. In the simplest case, you may have some tabular numerical data in a text file that looks like this:

1.2 4.3 5.4 0.2 0.1
0.2 7.2 8.0 1.5 3.2
...

Assuming this is a regular table (same number of elements per row), you can read it straight into BIDMach with

val a = loadFMat(<filename>.txt)

Its important that the filename ends in ".txt". That's the clue to the loader that this is a text file. The file should contain numerical elements only (no letters or other symbols). The elements can be separated by spaces, commas or tabs (or a combination).

You can also read integer data using "loadIMat", assuming the filename ends with ".txt". For double precision data the function is "loadDMat".

Sparse Matrices

Sparse matrices come in a variety of forms, which will often require some custom code. But one common format is row/column/value triples, which will look like this:

1 0 0.75
1 1 0.33
2 1 1.57
5 1 0.02
...

The data are of mixed types: two integers and one float or double per line. But the double reader can handle all of them, so you can load this with "loadDMat". Note that loadFMat will sometimes work, but the float type can only hold 24 bits of integer precision. This wont work if the indices are bigger than 16 million. After loading the triples with loadDMat, you can convert to a sparse matrix using "cols2sparse" like this:

val a = loadDMat("<filename>")
val s = cols2sparse(a, [fliprc], [issorted], [ibase])

where the arguments in [] are optional. [fliprc] is a Boolean specifying where the indices are really row, column or flipped (column, row). [issorted] is a Boolean which specifies whether the data are already sorted in column major order, i.e. lexicographically by column, row. [ibase] is an integer {0,1} specifying where the input data are zero- (default) or one-based. zero-based row/column indices start at zero, while one-based start at one. For the example above (which must be zero-based), the data can be converted as

val s = cols2sparse(a)

since the parameters default to fliprc=false, issorted=true, ibase=0.

Checking Sparse Matrices

The complexities of generating large data often lead to "glitches". e.g. indices that are apparently sorted may contain out-of-order elements, or indices may be negative or otherwise out-of-range. You can check the integrity of a sparse matrix with

s.check

which checks index order and range. It will throw an exception if any problem is found, so that it can be used in the middle of an application program.

Table Parser (Tparse)

BIDMach includes a parser for common tabular data formats (tab- or comma-separated). The data should be in ASCII format, divided into rows, and with the same number and types of elements in each row. Empty fields are fine, but there should be exactly k-1 separators on each line of a k-column table. The data format is specified in a schema file. The schema file for the sparse data file from the last section looks like this:

int rows
int cols
float values

To parse this file, we use the "tparse" binary which will be in BIDMach/cbin. tparse is invoked like this:

tparse -i <infile> -f <format_file> -o <output_prefix>

where <output_prefix> is a partial file name for the collection of output files. The actual file names will be appended to the field names and field types. e.g. if the prefix is "c:/data/nytimes." with the above schema, the output files will be:

c:/data/nytimes.rows.imat
c:/data/nytimes.cols.imat
c:/data/nytimes.values.fmat

Adding a "-c" option causes the output files to be compressed using gzip, and generating files:

c:/data/nytimes.rows.imat.gz
c:/data/nytimes.cols.imat.gz
c:/data/nytimes.values.fmat.gz

To parse comma separated data, add the option -d ',' to change the separator from the default. In general, the -d flag changes the column delimiter to a value of your choice.

Both the compressed and uncompressed files are in BIDMat's binary format and can be loaded with "loadIMat" and "loadFMat". These loads are much faster (at least an order of magnitude) than loading the ascii data.

Word Fields

There are two fields types for string data. The first is "word" and the second is "string". The word field type produces a single numeric token for each field, using a dictionary. e.g. "the white house" encodes as the literal string "the white house". The word type produces two output matrices, an IMat containing the indices of the tokens, and an SBMat (sparse Byte matrix) which is the dictionary. The IMat has the same number of rows (and one column) as the input table.

SBMat's are a fast way to encode a lot of string data. An SBMat does not have explicit row indices. Its elements are implicitly packed to the lowest row indices. i.e. the j^th element in a column is in row j. Each column encodes a single string token, and the column number is the index of the token.

Let's look at an example. Here is the schema:

word sometext

and here is a datafile:

hot
hot 
hot or
not

running tparse on this data/format file produces two matrices:

data.sometext.imat
data.sometext.sbmat

and if we load them:

val a = loadIMat("data.sometext.imat")
val b = loadSBMat("data.sometext.sbmat")

we see

> a.t
0 0 1 2
> b
hot,hot or,not

String Fields

The string field allows a field from a table to be split into individual tokens using sub-separator characters. This separator(s) must be distinct from the field separator(s). The string field separator(s) are specified in the format file, on a separate line. So to parse the above data file as string data, we create a schema file that looks like this:

string text
<space>
Where the space is literally a space character. You may need to use a low-level editor to make sure this whitespace isnt removed. The separator for fields can be altered with the "-d" option to tparse. It defaults to a tab character. In this example there is only one field so the main separator isn't needed.

Running tparse on the new schema and the old input file produces two output files:

data.text.imat
data.text.sbmat

where the data file is now a two-column imat. The first column of this imat contains the row index from the input file, and each entry in the second column is a word_id from that row. The word_id is the dictionary position of that token, as a zero-based index (Note: pre-1.1 versions of BIDMach use one-based indexing). Thus if we do:

> val a=loadSMat("data.text.imat")
> val b=loadSBMat("data.text.sbmat")
> a
0 0
1 0
2 0
2 1
3 2
> b
hot,or,not

from which we can see that column 0 contains 4 string ids. Most of these contain a single token (column 2), but string 2 contains two wordids for "hot" and "or".

If we want to reconstruct the input, we can't directly use SBMat, which can only represent a linear array of strings (since rows encode chars within strings). But there is a related type called CSMat (C-string matrix) which can hold 2D arrays with string values. If we construct a CSMat from b, the indices of a will directly index it, with zeros indexing a null string. In other words

> val cs = CSMat(b)
hot,or,not
> cs(a(?,1)).t
hot, hot, hot, or, not

Why is string data stored in this apparently-transposed format in SBMats? It goes back to the difference between databases and scientific computing. The former typically stores instances as rows while the latter has mostly used compressed sparse column CSC format (or at least Matlab and many numerical libraries have). With CSC format, one can easily access an instance (a single column), or range of instances if the instances are stored as columns. But there is no way to access a row or range of rows without scanning the entire array. For the same reason, SBMat stores the letters of a token as a column which allows fast lookup.

Flex Tokenizer

Many data sources produce semi-structured data such as HTML, XML or JSON. Its possible to use parsers for each of these, but this is often challenging because of

  • The need for accurate schema specifying the types of objects.
  • The difficulty of recovering from format errors in badly-formed inputs.
  • The need to know and extract the data fields of interest at parse time.
Most of the datasets we have worked with use "shallow" formats, e.g. wikipedia, TREC, twitter etc., and it turns out to be very burdensome to use schema with these sources. They also have a lot of non-compliant data, and its easy to spend most of your time trying to recover from format errors.

Instead we have come to rely on separate tokenization functions in C++ which produce binary files, followed by "extraction" from the binary data in BIDMach. This workflow is quite effective even for very large datasets. i.e. we manage about 10TB of Twitter data on a single machine this way. The tokenizers use Flex, the Fast Lexical Analyzer, to break the data into tokens, build a dictionary from them, and create an output file containing binary indices for tokens into the dictionary. Each tokenizer produces three output files, such as:

mydata.imat
mydata.dict.sbmat
mydata.dict.imat

The first of these contain indices into the dictionary and interpreted entities. The second two form the dictionary with strings in the SBMat, and counts of tokens in the "dict" IMat. The tokenizers include an individual flex source file for that particular data format (such as xmltweet.flex), and common driver code in newparse.cpp and utils.cpp. As well as splitting the input into tokens which reference dictionary entries, the tokenizer recognizes and decodes common entities such as integers, floating point numbers and dates. Each non-entity token appears in the output as a positive 31-bit integer, which is its index into the dictionary. Entity tokens are truncated to 31 bits and stored with sign bit set (i.e. as negative integers). Actually the representation is up to the user. You can store non-entity and entity objects with 32 bits if you are confident of being able to differentiate them from context. Or you could add a special integer (which could not be a dictionary index) before each non-entity token.

The sample tokenizers (twitter, wikipedia and TREC) use the sign bit approach. Integer and date entities lose their leading bit, while floating point numbers lose their trailing bit. This is a compromize which works well with a pretty wide variety of input data, but obviously has limitations. A tokenizer can be run like this:

> xmlwiki -i data.txt -o mydata.

and for a data file like:

<text> the first 3 digits of pi are 3.14 and the first 6 are 3.14159</text>

If we load

> val a = loadIMat("mydata.imat").t
1,2,3,-2147483645,4,5,6,7,-1608221983,8,2,3,-2147483642,7,-1608218648,9
> val d = loadSBMat("mydata.dict.sbmat")
<text>,the,first,digits,of,pi,are,and,</text>

we see that all the positive indices are one-based indices into the dictionary, and that negative entities directly hold values that can be extracted with some bit-fiddling. The floating point values have to be extracted from the integer fields by masking and right shifting, which can be done in Java. To convert to floats, you need to "cast" them which is possible but difficult in Java. The routine edu.berkeley.bid.UTILS.memcpy(len, ptr1, off1, ptr2, off2) copies and casts an Array[Int] ptr1 with a word offset of off1 into an Array[Float] ptr2 with a word offset of off2.

Converting to Bag-of-Words representation

The most common representation of data for BIDMach is as a sparse matrix representing the bag of words for a sentence or string. The parsers above produce integer data with word ids. To convert to BoW, the first step is to partition the word ids into input samples, i.e. sentences or paragraphs. For tparse, this has already been done. i.e. if the IMat "parsed" contains the data for a given input file, then do

val words = parsed(?,1)
val samples = parsed(?,0)
For the XML tokenizers, some user logic is needed to set the appropriate sample id. e.g. if the data "parsed" has a <status> tag at the start of each sample, you can produce a vector of sample ids like this:
val d = Dict(loadSBMat("<the dictionary file>"))
val words = parsed
val samples = max(0, cumsum(parsed == d("<status>"))-1) // subtract one so sample index is zero-based
This code finds all the occurrences of the wordid of "<status>", and then increments a moving count (cumsum) there. Now to construct a BoW SMat for this data do:
val m = sparse(words, samples, ones(words.length,1), nwords, nsamples);
where "nwords" is the number of distinct words, and "nsamples" is the number of distinct samples. Typically for a single data file:
val nwords = maxi(words).v + 1;
val nsamples = maxi(samples).v + 1;
but if you have multiple data files, nwords should match the dictionary size.

Sparse Value Encoding

Another important encoding (used e.g. by the LSTM code) represents the data with all the words ordered along the columns of a sparse matrix. e.g. given a dictionary "d" which encodes the words "one", "two",... with their integer values, i.e.

d(1->11)
"one","two","three","four",...
and some sentences such as
two two 
one five four
three one
The encoding is a sparse matrix with values:
2   1   3
2   5   1
    4
so that the i^th column is the sequence of wordids in the i^th sentence. To produce this encoding, you can do:
val a = sparse(null, samples, words, sentlen, nsamples)
the first argument "null" cause the sparse function to create a sparse matrix with implicit row indices. Implicit row indices start at zero on each column, and increment with each data item. That means a column with k non-zeros has elements only in the first k rows. "sentlen" should be the maximum number of rows, here 3. Then we can examine the full matrix with:
> full(a)
2   1   3
2   5   1
0   4   0
btw, if you wanted to compute the row indices explicitly, you can do it with:
val rowinds = cumsumByKey(iones(a.nrows,1), samples);

Caveats with Sparse Value Encoding

There are several caveats with sparse value encoding:

  • Sparse matrices do not encode "0", so you need to make sure the token with value "0" does not occur in this representation. Add a dummy first token to the dictionary if needed and add one to all the wordids.
  • Single precision sparse matrices (SMat's) having floating point values so the largest wordid you can represent is 2^24 or about 16 million. If you need more wordids, use SDMat (double precision sparse matrices).

Variable Types: Continuous, Discrete, Categorical

Continuous Data

Continuous (real-valued) data usually require minimal pre-processing. It may be valuable to normalize the data, i.e. subtract the population mean for that feature and divide by its standard deviation. But regression, SVM and random forests are insensitive to feature offset and (relatively) insensitive to scale. If the input data is sparse, its important to make sure any value transformation preserves sparseness. Offsets by the mean don't do this, but division by standard deviation does.

Discrete Data

Discrete data often derive from counts (e.g. word instances in a document), and can benefit from a few transformations. e.g. one of the best transformations for count data used for SVM inference is a square root, which normalizes the relative error of the counts. log transformations, e.g. tfidf, may also help.

Categorical Data

Categorical data include many types of unordered labels, e.g. document topic labels. Given a field that encodes a category, the tokenizers will produce an integer index for that field that points to a text representation of the category in a dictionary. This representation can't be used directly in most learning algorithms. Instead we need a binarized version that shows membership (or not) in one of k categories. This is called a one-hot encoding. So to transform the index value i into a vector [0, 0, 0,... 1,...,0, 0]T with a 1 in the ith position. For a sequence of category indices, we would like to produce a sequence of column vector features that can be combined with other feature types into column-based feature data. There is a utility function oneHot for this purpose. You can use it like this:

> val a = irow(0, 2, 5, 2, 1, 5, 3)
0,2,5,2,1,5,3
> val b = oneHot(a, 6)
(   0,   0)   1
(   2,   1)   1
(   5,   2)   1
(   2,   3)   1
(   1,   4)   1
(   5,   5)   1
(   3,   6)   1
> full(b)
1   0   0   0   0   0   0
0   0   0   0   1   0   0
0   1   0   1   0   0   0
0   0   0   0   0   0   1
0   0   0   0   0   0   0
0   0   1   0   0   1   0

where oneHot accepts the matrix to convert and an integer which specifies how many rows to use for the output.