The Druid Data Driver is a python script that simulates a workload that generates data for Druid ingestion. You can use a JSON config file to describe the characteristics of the workload you want the Druid Data Driver to simulate. The script uses this config file to generate JSON records.
Here are the commands to set up the Python environment:
apt-get install python3
apt-get update
apt-get install -y python3-pip
pip install confluent-kafka
pip install python-dateutil
pip install kafka-python
pip install numpy
pip install sortedcontainers
Run the program as follows:
python DruidDataDriver.py <options>
Options include:
-f <configuration file name>
-s <start time in ISO format (optional)>
-n <total number of records to generate>
-t <duration for generating records>
Use the -f option to designate a configuration file name. If you omit the -f option, the script reads the configuration from stdin.
The -s option tells the driver to use simulated time instead of wall clock time (the default). The simulated clock starts at the time specified by the argument (or the current time if no argument is specified) and advances the simulated clock based on the generated events (i.e., records). When used with the -t option, the simulated clock simulates the duration. This option is useful for generating batch data as quickly as possible.
The other two options control how long the script runs. If neither option is present, the script will run indefinitely. The -t and -n options are exclusive (use one or the other). Time durations may be specified in terms of seconds, minutes or hours. For example, specify 30 seconds as follows:
-t 30S
Specify 10 minutes as follows:
-t 10M
Or, specify 1 hour as follows:
-t 1H
The config file contains JSON describing the characteristics of the workload you want to simulate (see the examples folder for example config files). A workload consists of a state machine, where each state outputs a record. The state machine is probabilistic, which means that the state transitions may be stochastic based on probabilities. Each state in the state machine performs four operations:
- First, the state sets any variable values
- Next, the state emits a record (based on an emitter description)
- The state delays for some period of time (based on a distribution)
- Finally, the state selects and transitions to a different state (based on a probabilistic transition table)
Emitters are record generators that output records as specified in the emitter description. Each state employs a single emitter, but the same emitter may be used by many states.
The config file has the following format:
{
"target": {...},
"emitters": [...],
"interarrival": {...},
"states": [...]
}
The target object describes the output destination. The emitters list is a list of record generators. The interarrival object describes the inter-arrival times (i.e., inverse of the arrival rate) of entities to the state machine The states list is a description of the state machine
Use distribution descriptor objects to parameterize various characteristics of the config file (e.g., inter-arrival times, dimension values, etc.) according to the config file syntax described in this document.
There are four types of distribution descriptor objects: Constant, Uniform, Exponential and Normal. Here are the formats of each of these types:
The constant distribution generates the same single value.
{
"type": "constant",
"value": <value>
}
Where value is the value generated by this distribution.
Uniform distribution generates values uniformly between min and max (inclusive).
{
"type": "uniform",
"min": <value>,
"max": <value>
}
Where:
- min is the minimum value sampled
- max is the maximum value sampled
Exponenital distributions generate values following an exponential distribution around the mean.
{
"type": "exponential",
"mean": <value>
}
Where mean is the resulting average value of the distribution.
Normal distributions generate values with a normal (i.e., bell-shaped) distribution.
{
"type": "normal",
"mean": <value>,
"stddev": <value>
}
Where:
- mean is the average value
- stddev is the stadard deviation of the distribution
Note that negative values generated by the normal distribution may be forced to zero when necessary (e.g., interarrival times).
There are four flavors of targets: stdout, file, kafka, and confluent.
stdout targets print the JSON records to standard out and have the form:
"target": {
"type": "stdout"
}
file targets write records to the specified file and have the following format:
"target": {
"type": "file",
"path": "<filename goes here>"
}
Where:
- path is the path and file name
kafka targets write records to a Kafka topic and have this format:
"target": {
"type": "kafka",
"endpoint": "<ip address and optional port>",
"topic": "<topic name>",
"topic_key": [<list of key fields>],
"security_protocol": "<protocol designation>",
"compression_type": "<compression type designation>"
}
Where:
- endpoint is the IP address and optional port number (e.g., "127.0.0.1:9092") - if the port is omitted, 9092 is used
- topic is the topic name as a string
- topic_key (optional) is the list of generated fields used to build the key for each message
- security_protocol (optional) a protocol specifier ("PLAINTEXT" (default if omitted), "SSL", "SASL_PLAINTEXT", "SASL_SSL")
- compression_type (optional) a compression specifier ("gzip", "snappy", "lz4") - if omitted, no compression is used
confluent targets write records to a Confluent topic and have this format:
"target": {
"type": "confluent",
"servers": "<bootstrap servers>",
"topic": "<topic name>",
"topic_key": [<list of key fields>],
"username": "<username>",
"password": "<password>"
}
Where:
- servers is the confluent servers (e.g., "pkc-lzvrd.us-west4.gcp.confluent.cloud:9092")
- topic is the topic name as a string
- topic_key (optional) is the list of generated fields used to build the key for each message
- username cluster API key
- password cluster API secret
The emitters list is a list of record generators. Each emitter has a name and a list of dimensions, where the list of dimensions describes the records the emitter will generate.
An example of an emitter list looks as follows:
"emitters": [
{
"name": "short-record",
"dimensions": [...]
},
{
"name": "long-record",
"dimensions": [...]
}
]
The dimensions list contains specifications for all dimensions (except the __time dimension, which is always the first dimension in the output record specified in UTC ISO 8601 format, and has the value of when the record is actually generated).
Many dimension types may include a Cardinality (the one exception is the enum dimension type). Cardinality defines how many unique values the driver may generate. Setting cardinality to 0 provides no constraint to the number of unique values. But, setting cardinality > 0 causes the driver to create a list of values, and the length of the list is the value of cardinality.
When cardinality is greater than 0, cardinality_distribution informs how the driver selects items from the cardinality list. We can think of the cardinality list as a list with zero-based indexing, and the cardinality_distribution determines how the driver will select an index into the cardinality list. After using the cardinality_distribution to produce an index, the driver constrains the index so as to be a valid value (i.e., 0 <= index < length of cardinality list). Note that while uniform and normal distributions make sense to use as distribution specifications, constant distributions only make sense if the cardinality list contains only a single value. Further, for cardinality > 0, avoid an exponential distribution as it will round down any values that are too large and produces a distorted distribution.
As an example of cardinality, imagine the following String element definition:
{
"type": "string",
"name": "Str1",
"length_distribution": {"type": "uniform", "min": 3, "max": 6},
"cardinality": 5,
"cardinality_distribution": {"type": "uniform", "min": 0, "max": 4},
"chars": "abcdefg"
}
This defines a String element named Str1 with five unique values that are three to six characters in length. The driver will select from these five unique values uniformly (by selecting indices in the range of [0,4]). All values may only consist of strings containing the letters a-g.
Dimension list entries include:
Enum dimensions specify the set of all possible dimension values, as well as a distribution for selecting from the set. Enums have the following format:
{
"type": "enum",
"name": "<dimension name>",
"values": [...],
"cardinality_distribution": <distribution descriptor object>,
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>
}
Where:
- name is the name of the dimension
- values is a list of the values
- cardinality_distribution informs the cardinality selection of the generated values
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
String dimension specification entries have the following format:
{
"type": "string",
"name": "<dimension name>",
"length_distribution": <distribution descriptor object>,
"cardinality": <int value>,
"cardinality_distribution": <distribution descriptor object>,
"chars": "<list characters used to build strings>",
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>
}
Where:
- name is the name of the dimension
- length_distribution describes the length of the string values - Some distribution configurations may result in zero-length strings
- cardinality indicates the number of unique values for this dimension (zero for unconstrained cardinality)
- cardinality_distribution informs the cardinality selection of the generated values (omit if cardinality is zero)
- chars (optional) is a list (e.g., "ABC123") of characters that may be used to generate strings - if not specified, all printable characters will be used
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
Counter dimensions are values that increment each time they occur in a record (counters are not incremented when they are missing or null). Counters may be useful for dimensions simulating serial numbers, etc. Counter dimension specification entries have the following format:
{
"type": "counter",
"name": "<dimension name>",
"start": "<counter starting value (optional)>",
"increment": "<counter increment value (optional)>",
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>
}
Where:
- name is the name of the dimension
- start is the initial value of the counter. (optional - the default is 0)
- increment is the amount to increment the value (optional - the default is 1)
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
Integer dimension specification entries have the following format:
{
"type": "int",
"name": "<dimension name>",
"distribution": <distribution descriptor object>,
"cardinality": <int value>,
"cardinality_distribution": <distribution descriptor object>,
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>
}
Where:
- name is the name of the dimension
- distribution describes the distribution of values the driver generates (rounded to the nearest int value)
- cardinality indicates the number of unique values for this dimension (zero for unconstrained cardinality)
- cardinality_distribution skews the cardinality selection of the generated values
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
Float dimension specification entries have the following format:
{
"type": "float",
"name": "<dimension name>",
"distribution": <distribution descriptor object>,
"cardinality": <int value>,
"cardinality_distribution": <distribution descriptor object>,
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>,
"precision": <number of digits after decimal>
}
Where:
- name is the name of the dimension
- distribution describes the distribution of float values the driver generates
- cardinality indicates the number of unique values for this dimension (zero for unconstrained cardinality)
- cardinality_distribution skews the cardinality selection of the generated values
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
- precision (optional) the number digits after the decimal - if omitted all digits are included
Timestamp dimension specification entries have the following format:
{
"type": "timestamp",
"name": "<dimension name>",
"distribution": <distribution descriptor object>,
"cardinality": <int value>,
"cardinality_distribution": <distribution descriptor object>,
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>
}
Where:
- name is the name of the dimension
- distribution describes the distribution of timestamp values the driver generates
- cardinality indicates the number of unique values for this dimension (zero for unconstrained cardinality)
- cardinality_distribution skews the cardinality selection of the generated timestamps (optional - omit for unconstrained cardinality)
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
IP address dimension specification entries have the following format:
{
"type": "ipaddress",
"name": "<dimension name>",
"distribution": <distribution descriptor object>,
"cardinality": <int value>,
"cardinality_distribution": <distribution descriptor object>,
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>
}
Where:
- name is the name of the dimension
- distribution describes the distribution of IP address values the driver generates
- cardinality indicates the number of unique values for this dimension (zero for unconstrained cardinality)
- cardinality_distribution skews the cardinality selection of the generated IP addresses (optional - omit for unconstrained cardinality)
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
Note that the data driver generates IP address values as ints according to the distribution, and then converts the int value to an IP address.
Variables are values that may be set by states and have the following format:
{
"type": "variable",
"name": "<dimension name>"
"variable": "<name of variable>"
}
Where:
- name is the name of the dimension
- variable is the name of variable with a previously set value
Object dimensions create nested data. Object dimension specification entries have the following format:
{
"type": "object",
"name": "<dimension name>",
"cardinality": <int value>,
"cardinality_distribution": <distribution descriptor object>,
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>,
"dimensions": [<list of dimensions nested within the object>]
}
Where:
- name is the name of the object
- cardinality indicates the number of unique values for this dimension (zero for unconstrained cardinality)
- cardinality_distribution skews the cardinality selection of the generated objects (optional - omit for unconstrained cardinality)
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
- dimensions is a list of nested dimensions
list dimensions create lists of dimesions. List dimension specification entries have the following format:
{
"type": "list",
"name": "<dimension name>",
"length_distribution": <distribution descriptor object>,
"selection_distribution": <distribution descriptor object>,
"elements": [<a list of dimension descriptions>],
"cardinality": <int value>,
"cardinality_distribution": <distribution descriptor object>,
"percent_missing": <percentage value>,
"percent_nulls": <percentage value>
}
Where:
- name is the name of the object
- length_distribution describes the length of the resulting list as a distribution
- selection_distribution informs the generator which elements to select for the list from the elements list
- elements is a list of possible dimensions the generator may use in the generated list
- cardinality indicates the number of unique values for this dimension (zero for unconstrained cardinality)
- cardinality_distribution skews the cardinality selection of the generated lists (optional - omit for unconstrained cardinality)
- percent_missing a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for omitting this dimension from records (optional - the default value is 0.0 if omitted)
- percent_nulls a value in the range of 0.0 and 100.0 (inclusive) indicating the stochastic frequency for generating null values (optional - the default value is 0.0 if omitted)
List configuration can seem a bit confusing. So to clarify, the generator will generate a list that is the length of a sample from the length_distribution. The types of the elements of the list are selected from the elements list by using an index into the elements list that is determined by sampling from the selection_distribution. The other field values (e.g., cardinality, percent_nulls, etc.) operate like the other types, but in this case apply to the entire list.
The interarrival object is a distribution descriptor object that describes the inter-arrival times (in seconds) between records that the driver generates.
One can calculate the mean inter-arrival time by dividing a period of time by the number of records to generate during the time period. For example, 100 records per hour has an inter-arrival time of 36 seconds per record (1 hour * 60 minutes/hour * 60 seconds/minute / 100 records).
See the previous section on Distribution descriptor objects for the syntax.
The states list is a list of state objects. These state objects describe each of the states in a probabilistic state machine. The first state in the list is the initial state in the state machine, or in other words, the state that the machine enters initially.
State objects have the following form:
{
"name": <state name>,
"emitter": <emitter name>,
"variables": [...]
"delay": <distribution descriptor object>,
"transitions": [...]
}
Where:
- name is the name of the state
- emitter is the name of the emitter this state will use to generate a record
- variables (optional) a list of dimension objects where the dimension name is used as the variable name
- delay a distribution function indicating how long (in seconds) to remain in the state before transitioning
- transitions is a list of transition objects specifying possible transitions from this state to the next state
Transition objects describe potential state transitions from the current state to the next state. These objects have the following form:
{
"next": <state name>,
"probability": <probability value>
}
Where:
- next is the name of the next state
- probability is a value greater than zero and less than or equal to one.
Note that the sum of probabilities of all probabilities in the transitions list must add up to one. Use:
"next": "stop",
to transition to a terminal state.