Because everybody loves test data.
es_test_data.py
lets you generate and upload randomized test data to your ES cluster so you can start running queries, see what performance is like, and verify your cluster is able to handle the load.
It allows for easy configuring of what the test documents look like, what kind of data types they include and what the field names are called.
Let's assume you have an Elasticsearch cluster running. If not, set it up locally and point your browser to http://localhost:9200 to see if it's up.
Python and Tornado and NumPy are used, run pip install tornado numpy
to install Tornado and NumPy if you don't have it already.
It's as simple as this:
$ python es_test_data.py --es_url=http://localhost:9200
[I 150604 15:43:19 es_test_data:42] Trying to create index http://localhost:9200/test_data
[I 150604 15:43:19 es_test_data:47] Guess the index exists already
[I 150604 15:43:19 es_test_data:184] Generating 10000 docs, upload batch size is 1000
[I 150604 15:43:19 es_test_data:62] Upload: OK - upload took: 25ms, total docs uploaded: 1000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 25ms, total docs uploaded: 2000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 19ms, total docs uploaded: 3000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 18ms, total docs uploaded: 4000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 27ms, total docs uploaded: 5000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 19ms, total docs uploaded: 6000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 15ms, total docs uploaded: 7000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 24ms, total docs uploaded: 8000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 32ms, total docs uploaded: 9000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 31ms, total docs uploaded: 10000
[I 150604 15:43:20 es_test_data:216] Done - total docs uploaded: 10000, took 1 seconds
[I 150604 15:43:20 es_test_data:217] Bulk upload average: 23 ms
[I 150604 15:43:20 es_test_data:218] Bulk upload median: 24 ms
[I 150604 15:43:20 es_test_data:219] Bulk upload 95th percentile: 31 ms
$
Without any command line options, it will generate and upload 1000 documents of the format
{
"name":<<str>>,
"age":<<int>>,
"last_updated":<<ts>>
}
to an Elasticsearch cluster at http://localhost:9200
to an index called test_data
.
python es_test_data.py --help
gives you the full set of command line options, here are the most important ones:
--es_url=http://localhost:9200
the base URL of your ES node, don't include the index name--count=###
number of documents to generate and upload--index_name=test_data
the name of the index to upload the data to. If it doesn't exist it'll be created with these options--num_of_shards=2
the number of shards for the indexnum_of_replicas=0
the number of replicas for the index
--batch_size=###
we use bulk upload to send the docs to ES, this option controls how many we send at a time--force_init_index=False
ifTrue
it will delete and re-create the index--dict_file=filename.dic
if provided thedict
data type will use words from the dictionary file, format is one word per line. The entire file is loaded at start-up so be careful with (very) large files. You can download wordlists e.g.. from here.--cities_file=filename.cvs
if provided the cities will be loaded from the CSV file. Default isworldcities.csv
which can be downloaded from here.--num_of_cities
if provided, sets the number of cities to use when generating city points. Default is to use all cities loaded via--cities_file
.
Glad you're asking, let's get to the doc format.
The doc format is configured via --format=<<FORMAT>>
with the default being name:str,age:int,last_updated:ts
.
The general syntax looks like this:
<<field_name>>:<<field_type>>,<<field_name>>::<<field_type>>, ...
For every document, es_test_data.py
will generate random values for each of the fields configured.
Currently supported field types are:
bool
returns a random true or falsets:min_days:max_days
a timestamp (in milliseconds), randomly picked between now - min_days and now+max_days. Defaults to +/- 30. If you wanted just days in the past you would put zero for max_days (e.g. ts:30:0)ts_series:min_days:max_days:delta:interval
a timestamp (in milliseconds) starting between now-min_days
and now+max_days
incrementeddelta
ms each data point. Starts over afterinterval
number of pointsipv4
returns a random ipv4tstxt
a timestamp in the "%Y-%m-%dT%H:%M:%S.000-0000" format, randomly picked between now +/- 30 daysint:min:max
a random integer betweenmin
andmax
. Ifmin and
max` are not provided they default to 0 and 100000str:min:max
a word ( as in, a string), made up ofmin
tomax
random upper/lowercase and digit characters. Ifmin
andmax
are optional, defaulting to3
and10
str_series:min:max:interval
a word ( as in, a string), made up ofmin
tomax
random upper/lowercase and digit characters. Ifmin
andmax
are optional, defaulting to3
and10
. Stay the same forinerval
number of documentskeyword:min:max
a word ( as in, a string), made up ofmin
tomax
random upper/lowercase and digit characters. Ifmin
andmax
are optional, defaulting to3
and10
words:min:max
a random number ofstrs
, separated by space,min
andmax
are optional, defaulting to '2' and10
dict:min:max
a random number of entries from the dictionary file, separated by space,min
andmax
are optional, defaulting to '2' and10
text:words:min:max
a random number of words seperated by space from a given list of-
seperated words, the words are optional defaulting totext1
text2
andtext3
, min and max are optional, defaulting to1
and1
geo_point:min_lat:max_lat:min_lon:max_lon
a random geopoint betweenmin_lat
,max_lat
,min_lon
, andmax_lon
. Ifmin
andmax
values are not provided the default to the entire worldellipse:major_mean:minor_mean:major_std:minor:std:num_points
a random ellipse of random size and tilt based at a random location based on mean and standard deviation provided. Ellipse is drawn as a polygon withnum_points
verticiescities:min_rad:max_rad
return a random geopoint withinmin_rad
andmax_rad
meters from a chosen random city loaded via--cities_file
.cities_path_series:length:min_rad:max_rad:heading_std:speed_start:speed_std:interval:interval_std
Creates a series of geo_points oflength
starting at a random geopoint withinmin_rad
andmax_rad
meters from a chosen random city loaded via--cities_file
. Path starts at a random heading and varies withheading_std
and has a startingspeed_start
(m/s) varying withspeed_std
. A new point is created everyinterval
seconds but varies withinterval_std
ellipse_cities:major_mean:minor_mean:major_std:minor:std:num_points:sigma_degrees
a random ellipse of random size and tilt based near a random city based on mean and standard deviation provided. Ellipse is drawn as a polygon withnum_points
verticies. Centers are a normial distribution away from city center with sigma_degree std dev.path:num_points:heading_std:speed_start:speed:std
creates a path of num_points long that starts at a random points on a random heading. It changes heading based on a normal distribution with heading_std as the standard deviation. It starts at speed_start (m/s) and changes based on a normal distribution with speed_std as the standard deviation.
- document the remaining cmd line options
- more different format types
- ...
All suggestions, comments, ideas, pull requests are welcome!