Skip to content
Charlie Martin edited this page Feb 14, 2017 · 19 revisions

InfluxDB is a DBMS optimized for time series applications. It is part of a suite of tools from influxdata. In their words, it is "...built from the ground up to handle high write and query loads."

Baron Schwartz has a good blog entry which discusses time-series database offerings, with a focus on InfluxDB.

Concepts

Carefully read the influxdb key concepts. It's very helpful.

  • Performance is achieved by implementing influxdb on a non-relational database. However, it has a query language that is very much in the flavor of SQL. They also provide a command interface that is useful for experimentation and administration.
  • Many of the administrative features of standard RDMS are included, such as user accounts, multiple databases, etc. If you find you need a certain administrative capability, it is probably there.
  • By design, the system is not efficient for arbitrary data updates and deletes. This is in keeping with its target as a time series application.
  • Retention policy is a key capability. Data older than the retention period will be automatically aged off. This is in keeping with the CHORDS philosophy, and should be very handy.
  • Data points are saved in a structure called a measurement, which feels very similar to an SQL table. (Comment: the similarity is stated by influxdb. Measurement does not seem to be a very appropriate name; it implies an individual observation rather than a table, But that's the way it is).
  • The measurement table contains the points, which are keyed by time, tags and fields. Time and tags are indexed, making it very efficient in search queries. Fields are not indexed. The collection of measurements, tags and fields could be considered the schema for your database.
  • A combination of the tag values is known as a series, and the number of unique series is the cardinality of the measurement. (Again, series seems like an odd identifier for this concept.)
  • The cardinality determines the database performances, and thus it pays to tailor this appropriately for the application. According to the docs, cardinalities of less than 100k are quite manageable on low end servers. For instance, this would support a schema of three tags, each with 464 distinct values (464^3 == 100k).
  • The influx program provides a command-line interface to the database server.

Shards and Retention Policies

A database is broken into storage chunks called shards. A shard is assigned a data retention policy (age), and once a shard is older than that period, it is dropped. A shard also has an assigned time period, which defaults to 7 days.

If not specified at database creation, a retention policy named autogen is created and assigned. autogen has an infinite duration. Databases can have multiple retention policies, with one of them tagged as the default.

When data is inserted, you can specify which retention policy to apply. If none is specified, the default policy is used. The chosen policy defines which shard the sample will go to. Likewise, when selecting data, if you don't specify a retention policy, data is fetched from the default one.

Data that are in a non-default policy can only be retrieved by specifying the policy name in the select statement.

Use the CREATE RETENTION POLICY statement to create a new retention policy. Use ALTER RETENTION POLICY "policy_name" DURATION <time> to change the length of an existing policy. Use ALTER RETENTION POLICY "policy_name" DEFAULT to select which policy is the default.

It seems that multiple retention policies for a single database could lead to huge confusion. For example, creating a new default policy will instantly make your existing data invisible to default queries. (Just use the ALTER RETENTION POLICY "old_policy_name" DEFAULT statement to return to the old policy, and your data will magically reappear.

If you start out with the infinite autogen policy, it seems that simply changing its retention period (with ALTER) is a rational approach for enabling auto-pruning of the database.

Database maintenance occurs approximately every 10 minutes. If you decrease the retention period, overage shards will be dropped at the next maintenance interval.

There are many useful commands available with influx for examining the retention and shard characteristics, e.g.:

Connected to http://localhost:8086 version 1.2.0
InfluxDB shell version: v1.1.1
> show shard groups
name: shard groups
id	database		retention_policy	start_time		end_time		expiry_time
--	--------		----------------	----------		--------		-----------
1	_internal		monitor			2017-02-07T00:00:00Z	2017-02-08T00:00:00Z	2017-02-15T00:00:00Z
...
20	chords_ts_production	autogen			2016-07-18T00:00:00Z	2016-07-25T00:00:00Z	2017-02-20T00:00:00Z
21	chords_ts_production	autogen			2016-07-25T00:00:00Z	2016-08-01T00:00:00Z	2017-02-27T00:00:00Z
...
60	chords_ts_production	autogen			2017-02-13T00:00:00Z	2017-02-20T00:00:00Z	2017-09-18T00:00:00Z

> show retention policies
name		duration	shardGroupDuration	replicaN	default
----		--------	------------------	--------	-------
autogen		5040h0m0s	168h0m0s		1		true
50_weeks	8400h0m0s	168h0m0s		1		false

A Schema for CHORDS

Retain and use the ActiveRecord schema for all models except Measurement.

Create an influxdb schema as:

Tag Field # of Unique CHORDS Active Record
site 10 Site.id
instrument 20 Instrument.id
var 200 Var.id 10 variables per instrument
test 2 Measurement.test
value N/A Measurement.value

This schema, with a generous number of unique values, results in a cardinality of 10x20x200 = 40,000.

The cardinality could be reduce by a factor of the number of instruments, if the variable id was repeated between instruments. Should we add an "inst_var_id" id to the rails schema? It would be an integer which runs from 1 to the number of variables for a given instrument. Then the number of unique var tags would be equal to the maximum number of variables for all instruments.

In the current Active Record implementation, the variable id is unique across all instruments and sites. This might be taken advantage of in order to reduce the cardinality.

Influxdata provides guidance for hardware sizing. Note that their minimum recommendation calls for at least 2 GB ram / 2 cpu cores. AWS t2.micro only provides 1GB/1core, and AWS t2.small provides 2GB/1core.

influxdb-ruby

influxdata/influxdb-ruby provides a ruby binding for InfluxDB. The InfluxDB::Client is the way that you access the database. A significant number of hashed values are provided to the constructor in order to establish the database configuration. These are described mostly in the InfluxDB::Config class.

influxer

palkan/influxer provides an ActiveRecord style interface to InfluxDB. It is built on influxdb-ruby. A gem is provided.

Queries return an Influxer::Relation object, which seems to be similar to ActiveRecord query results. It holds a collection of points. Convert it to an array (.to_a) in order to iterate through the points.

Administration

The database server provides a web interface that is great for database maintenance and query testing. It is available on port 8083. The CHORDS docker-compose exports this port from the chords_influxdb container. However, it is not currently it is not exported by the AWS CloudFormation template (bin/cloudformation/docker.template).

We are not currently making use of the authentication mechanisms that are supported. This will need to be implemented.

Development Status

Reported in an email 22 Nov 2016:

I've got the CHORDS influxdb prototype to the point where we can do some side by side 
performance testing. Most CHORDS features are available, but the following are 
notably absent:

- table relationships, I.e if you change an instrument, variable or site identifier, 
  these changes are not propagated to the time series database.
- "data trimming" is not implemented.
- the dashboard storage summary is not implemented.
- the json download format needs to be updated.

We now have docker hub images tagged as :mysql and :influxdb, reflecting the 
docker and influxdb branches in github.

I updated the cloud formation template so that you can select either branch when 
bringing up an AWS CHORDS system. 

I have two t2.micro instances running on Amazon:
MySQL image: http://<redacted>.compute.amazonaws.com
InfluxDB image: http://<redacted>.us-west-2.compute.amazonaws.com

I'm running the simulator to add measurements to the databases. We can compare 
performance and usability as the databases grow. My impression during development 
is that the dashboard generation and data download creation are significantly faster, 
but we need a true dual test to validate this.
Clone this wiki locally