You want to create sample (dummy) data on the fly in Hadoop by specifying some simple rules? Yeah, so did I!
SampleDataInputFormat generates sample records, which are then passed to the map() calls of your MapReduce job. The key has no content, and the value contains an ASCII-1 separated string of the sample values.
Each field can be generated by one of three methods:
-
Range: Specify a start and end value and SampleDataInputFormat will pick a random value in the range.
-
Enum: Specify a list of values and SampleDataInputFormat will pick a random value from the list. Use a single entry list to define a static value for the field.
-
UUID: SampleDataInputFormat will use the Java UUID library to generate a random 128-bit value. This can be used for unique key fields.
You also specify the chance that each field value will be NULL.
The rules for SampleDataInputFormat can be passed either by MapReduce job properties or by Hive TBLPROPERTIES.
Property | Definition | Domain |
---|---|---|
sampledata.mappers | Number of mappers (Hive forces 1) | >=1 |
sampledata.records | Number of records across all mappers | >=1 |
sampledata.fieldnames | Field names to use for later properties | Comma-separated list |
sampledata.fields.{fieldname}.type | Data type | "string", "int", "double" or "date" |
sampledata.fields.{fieldname}.date.format | Format of date strings | As per java.text.SimpleDateFormat, e.g. "yyyy/MM/dd" |
sampledata.fields.{fieldname}.nulls.weight | Chance that a value will be NULL | 0.0 to 1.0 |
sampledata.fields.{fieldname}.method | How to generate this field | "range", "enum" or "uuid" |
sampledata.fields.{fieldname}.range.start | Lower-bound of range. Not valid for string. | Inclusive |
sampledata.fields.{fieldname}.range.end | Upper-bound of range. Not valid for string. | Exclusive |
sampledata.fields.{fieldname}.enum.values | List of enum values | Comma-separated list |
Caveat emptor - there is nearly no error checking!
The easiest way to use SampleDataInputFormat is through Hive. An external table is created that points to a real HDFS directory, however no data is read from it.
The data generation rules are specified in the Hive table DDL's TBLPROPERTIES. Each SELECT from the table will dynamically bring back a new set of sample records.
The Hive DDL syntax can be gleaned from the example DDL script.
It is important when using Hive that you force the query to use MapReduce. If you run a SELECT * with no filters or joins, such as SELECT * FROM table LIMIT 100;
Hive will skip MapReduce and just use the InputFormat to return the rows to screen, but the rules will not be passed to it. You can force MapReduce by adding a tautology filter, such as SELECT * FROM table WHERE 1=1 LIMIT 100;
You can override the parameters from a script or within the Hive shell, for example: SET sampledata.records=100000;
Note that due to some poor design decisions in Hive it will require extra code to SampleDataInputFormat to enable Hive to use multiple mappers. Until that is added Hive will only run this with a single mapper.
I have not tested this with plain MapReduce, but it should work fine by passing all the parameters to the job.