-
Notifications
You must be signed in to change notification settings - Fork 13
InfluxDB Notes
InfluxDB is a DBMS optimized for time series applications. It is part of a suite of tools from influxdata. In their words, it is "...built from the ground up to handle high write and query loads."
Baron Schwartz has a good blog entry which discusses time-series database offerings, with a focus on InfluxDB.
Carefully read the influxdb key concepts. It's very helpful.
- Performance is achieved by implementing influxdb on a non-relational database. However, it has a query language that is very much in the flavor of SQL. They also provide a command interface that is useful for experimentation and administration.
- Many of the administrative features of standard RDMS are included, such as user accounts, multiple databases, etc. If you find you need a certain administrative capability, it is probably there.
- By design, the system is not efficient for arbitrary data updates and deletes. This is in keeping with its target as a time series application.
- Retention policy is a key capability. Data older than the retention period will be automatically aged off. This is in keeping with the CHORDS philosophy, and should be very handy.
- Data points are saved in a structure called a measurement, which feels very similar to an SQL table. (Comment: the similarity is stated by influxdb. Measurement does not seem to be a very appropriate name; it implies an individual observation rather than a table, But that's the way it is).
- The measurement table contains the points, which are keyed by time, tags and fields. Time and tags are indexed, making it very efficient in search queries. Fields are not indexed. The collection of measurements, tags and fields could be considered the schema for your database.
- A combination of the tag values is known as a series, and the number of unique series is the cardinality of the measurement. (Again, series seems like an odd identifier for this concept.)
- The cardinality determines the database performances, and thus it pays to tailor this appropriately for the application. According to the docs, cardinalities of less than 100k are quite manageable on low end servers. For instance, this would support a schema of three tags, each with 464 distinct values (464^3 == 100k).
- The
influx
program provides a command-line interface to the database server.
A database is broken into storage chunks called shards. A shard is assigned a data retention policy (age), and once a shard is older than that period, it is dropped. A shard also has an assigned time period, which defaults to 7 days.
If not specified at database creation, a retention policy named autogen is created and assigned. autogen has an infinite duration. Databases can have multiple retention policies, with one of them tagged as the default.
When data is inserted, you can specify which retention policy to apply. If none is specified, the default policy is used. The chosen policy defines which shard the sample will go to. Likewise, when selecting data, if you don't specify a retention policy, data is fetched from the default one.
Data that are in a non-default policy can only be retrieved by specifying the policy name in the select statement.
Use the CREATE RETENTION POLICY
statement to create a new retention policy.
Use ALTER RETENTION POLICY "policy_name" DURATION <time>
to change the length of an existing policy.
Use ALTER RETENTION POLICY "policy_name" DEFAULT
to select which policy is the default.
It seems that multiple retention policies for a single database could lead to huge confusion. For example,
creating a new default policy will instantly make your existing data invisible to default queries. (Just
use the ALTER RETENTION POLICY "old_policy_name" DEFAULT
statement to return to the old policy, and your data will magically
reappear.
If you start out with the infinite autogen policy, it seems that simply changing its retention period (with ALTER) is a rational approach for enabling auto-pruning of the database.
Database maintenance occurs approximately every 10 minutes. If you decrease the retention period, overage shards will be dropped at the next maintenance interval.
There are many useful commands available with influx
for examining the retention and shard characteristics, e.g.:
Connected to http://localhost:8086 version 1.2.0
InfluxDB shell version: v1.1.1
> show shard groups
name: shard groups
id database retention_policy start_time end_time expiry_time
-- -------- ---------------- ---------- -------- -----------
1 _internal monitor 2017-02-07T00:00:00Z 2017-02-08T00:00:00Z 2017-02-15T00:00:00Z
...
20 chords_ts_production autogen 2016-07-18T00:00:00Z 2016-07-25T00:00:00Z 2017-02-20T00:00:00Z
21 chords_ts_production autogen 2016-07-25T00:00:00Z 2016-08-01T00:00:00Z 2017-02-27T00:00:00Z
...
60 chords_ts_production autogen 2017-02-13T00:00:00Z 2017-02-20T00:00:00Z 2017-09-18T00:00:00Z
> show retention policies
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 5040h0m0s 168h0m0s 1 true
50_weeks 8400h0m0s 168h0m0s 1 false
Retain and use the ActiveRecord schema for all models except Measurement.
Create an influxdb schema as:
Tag | Field | # of Unique | CHORDS Active Record | |
---|---|---|---|---|
site | 10 | Site.id | ||
instrument | 20 | Instrument.id | ||
var | 200 | Var.id | 10 variables per instrument | |
test | 2 | Measurement.test | ||
value | N/A | Measurement.value |
This schema, with a generous number of unique values, results in a cardinality of 10x20x200 = 40,000.
The cardinality could be reduce by a factor of the number of instruments, if the variable id was repeated between instruments. Should we add an "inst_var_id" id to the rails schema? It would be an integer which runs from 1 to the number of variables for a given instrument. Then the number of unique var tags would be equal to the maximum number of variables for all instruments.
In the current Active Record implementation, the variable id is unique across all instruments and sites. This might be taken advantage of in order to reduce the cardinality.
Influxdata provides guidance for hardware sizing. Note that their minimum recommendation calls for at least 2 GB ram / 2 cpu cores. AWS t2.micro only provides 1GB/1core, and AWS t2.small provides 2GB/1core.
influxdata/influxdb-ruby provides a ruby binding for InfluxDB. The InfluxDB::Client is the way that you access the database. A significant number of hashed values are provided to the constructor in order to establish the database configuration. These are described mostly in the InfluxDB::Config class.
palkan/influxer provides an ActiveRecord style interface to InfluxDB. It is built on influxdb-ruby. A gem is provided.
Queries return an Influxer::Relation object, which seems to be similar to ActiveRecord query results. It holds a collection of points. Convert it to an array (.to_a) in order to iterate through the points.
The database server provides a web interface that is great for database maintenance and query testing. It is available on port 8083. The CHORDS docker-compose exports this port from the chords_influxdb container. However, it is not currently it is not exported by the AWS CloudFormation template (bin/cloudformation/docker.template).
We are not currently making use of the authentication mechanisms that are supported. This will need to be implemented.
Reported in an email 22 Nov 2016:
I've got the CHORDS influxdb prototype to the point where we can do some side by side
performance testing. Most CHORDS features are available, but the following are
notably absent:
- table relationships, I.e if you change an instrument, variable or site identifier,
these changes are not propagated to the time series database.
- "data trimming" is not implemented.
- the dashboard storage summary is not implemented.
- the json download format needs to be updated.
We now have docker hub images tagged as :mysql and :influxdb, reflecting the
docker and influxdb branches in github.
I updated the cloud formation template so that you can select either branch when
bringing up an AWS CHORDS system.
I have two t2.micro instances running on Amazon:
MySQL image: http://<redacted>.compute.amazonaws.com
InfluxDB image: http://<redacted>.us-west-2.compute.amazonaws.com
I'm running the simulator to add measurements to the databases. We can compare
performance and usability as the databases grow. My impression during development
is that the dashboard generation and data download creation are significantly faster,
but we need a true dual test to validate this.
Project Management
- Stakeholders
- Communication
- Use Cases
- Requirements
- Deliverables
- Milestones Associated with a Release
- "Sandbox" Milestones
- Github Workflow
AWS
- AWS Portal Migrations
- Amazon Appliance Workflow
- Overview
- Bringing up a new CHORDS Portal
- Cloud Formation
- EC2 Costing and Memory Constraints
Docker
- Running CHORDS
- Docker on AWS
- Duplicating Docker/Influxdb portals
- Docker on Raspberry Pi
- Docker Details and Tips
- Running CHORDS on Windows 10
Influxdb
Data Formats
Google Maps
Ingest Utilities
Miscellaneous
- Recovery from a full disk
- Github/Dockerhub release scheme
- CHORDS gh-pages and jeykll
- Bootstrap
- CHORDS Portal Web Site
- Dashboard Helper Refactor
- Development Notes
- Heroku
- Meteobridge
- Migrating from mysql to mysql/influxdb portals
- NCAR Wx Stations into CHORDS
- PAWS to CHORDS
- Post Get Query Syntax
- Postgres Testing
- Rails Tips
- Ruby and Rails Resources
- CUAHSI Archive
Historical Archive