[0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) #4331

CVTJNII · 2015-10-05T21:50:07Z

In troubleshooting the following Grafana generated query:

SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m GROUP BY time(1s)

I have found that, on sharded data, the time field returned from mean isn't handled correctly if the data is not on the host running the query.

I'm currently testing in a 5 node cluster, version 0.9.5-nightly-b079d20, branch master, commit b079d20, built '2015-09-28T04:00:39+0000', with the following replication policy:

> show retention policies on riemannDataGen
name    duration        replicaN        default
default 168h0m0s        2               true

This is causing the data to shard across 4/5 of the servers. Sharding is visible with 'show series' (likely part of the bug or a side effect, as I'd expect 'show series' to show all series):

On node 1:

> show shards
name: riemannDataGen
--------------------
id      start_time              end_time                expiry_time             owners
2       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    3,4
3       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    5,1

> show series
name: Test Data
---------------
_key                            host
Test\ Data,host=Datagen\ 1      Datagen 1
Test\ Data,host=Datagen\ 3      Datagen 3
Test\ Data,host=Datagen\ 5      Datagen 5
Test\ Data,host=Datagen\ 7      Datagen 7
Test\ Data,host=Datagen\ 9      Datagen 9

On node 3:

> show shards
name: riemannDataGen
--------------------
id      start_time              end_time                expiry_time             owners
2       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    3,4
3       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    5,1

> show series
name: Test Data
---------------
_key                            host
Test\ Data,host=Datagen\ 0      Datagen 0
Test\ Data,host=Datagen\ 2      Datagen 2
Test\ Data,host=Datagen\ 4      Datagen 4
Test\ Data,host=Datagen\ 6      Datagen 6
Test\ Data,host=Datagen\ 8      Datagen 8

The problem I'm seeing is that, on node 1, the select mean query does not handle time correctly which breaks the group by clause:

On node 1:

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m
name: Test Data
---------------
time    value
0       57.01889591527808

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m GROUP BY time(1s)
ERR: invalid character 'j' looking for beginning of value

On node 3 it works (limit clause added to truncate output):

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m
name: Test Data
---------------
time                    value
1444080582551051963     56.90043095744928

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m GROUP BY time(1s) LIMIT 2
name: Test Data
---------------
time                    value
1444080584000000000     
1444080585000000000     53.193519424975506

>

On node 1 this also breaks the JSON response in the API:

In [99]: r = requests.get(("http://tminflux01.vagrant.tjnii.local:8086/query?%s" % urllib.urlencode({'db': 'riemannDataGen', 'epoch': 'ms', 'q': 'SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = \'Datagen 0\' AND time > now() - 15m GROUP BY time(1s)'})), auth=('DataGenreader', password))

In [100]: r.text
Out[100]: u'json: error calling MarshalJSON for type httpd.Response: json: error calling MarshalJSON for type *influxql.Result: json: unsupported value: NaN'

In [101]: r.json()
<Snip trace>
ValueError: No JSON object could be decoded

This has the effect of breaking my Grafana dashboards on sharded data as Grafana combines queries across all shards into a single request. As such the response for the combined query is the MarshalJSON error from above and none of the dashboards work.

The text was updated successfully, but these errors were encountered:

beckettsean · 2015-10-06T00:02:55Z

the time field returned from mean isn't handled correctly if the data is not on the host running the query.

Not sure I quite follow this. I can see from your examples that without the GROUP BY time() the query succeeds, and with it it fails. How does that relate to "the time field returned from mean"? The epoch 0 timestamp is proper behavior for queries that have an aggregation but lack a GROUP BY time() statement.

ERR: invalid character 'j' looking for beginning of value

This is an error we see when there are NaNs in the source data: #4104. Is there a possibility of not-a-number values in your source?

Can you paste in the results of SHOW SERVERS?

CVTJNII · 2015-10-06T02:42:29Z

Can you paste in the results of SHOW SERVERS?

> show servers
id      cluster_addr            raft
1       192.168.253.137:8088    true
2       192.168.253.140:8088    true
3       192.168.253.141:8088    true
4       192.168.253.143:8088    false
5       192.168.253.142:8088    false

Not sure I quite follow this. I can see from your examples that without the GROUP BY time() the query succeeds, and with it it fails. How does that relate to "the time field returned from mean"? The epoch 0 timestamp is proper behavior for queries that have an aggregation but lack a GROUP BY time() statement.

So while troubleshooting I noticed that the time field returns 0 on the node without the shards but does return a time on the node with the shards. As I'd expect the query to return the same regardless of shard location I assumed that was root cause and stopped. As for the exact semantics of what that field should be, I'm going to admit I'm a but out of my depth as I'm troubleshooting a query built by Grafana, I'm still coming up to speed on InfluxDB query language. So if that part of the bug report is wrong or misleading I apologize, I tried to describe it as well as I could. I'll be happy to edit it.

As for my dataset containing NaNs, it shouldn't. This dataset is being fed from a test data generator which is returning floats 0<= f <= 100. The configuration works if InfluxDB is not clustered, I have not seen any errors like this with a single InfluxDB server, it crops up when I try and cluster InfluxDB. Assuming this query is correct the answer is no:

> select value from "test_data" where value=NaN
>

CVTJNII · 2015-10-06T17:57:57Z

I've reproduced this in a Docker Compose environment. See the following repo with details on how to reproduce: https://github.com/CVTJNII/Influx_4331_proof

otoolep · 2015-10-07T00:28:25Z

Thanks for the detailed information @CVTJNII -- I will try to take a look at this.

bwhaley · 2015-10-22T19:44:42Z

+1, seeing the same behavior with 0.9.5-nightly-3a5044d. Thanks for the detailed report, describes almost exactly my scenario (but with N=3, not N=5).

sergzin · 2015-11-10T14:17:54Z

+1 on 0.9.5-nightly-827c513 with 5 servers in cluster and replication factor 3.
I don't see similar issues on testdb with replication factor = No of servers in cluster

jsternberg · 2016-05-17T23:57:20Z

I'm going to close this since clustering is no longer supported in the open source version. Thank you.

beckettsean changed the title ~~Time not handled correctly from mean() query on sharded data (with nasty side effects)~~ [0.9.5-nightly] Time not handled correctly from mean() query on sharded data (with nasty side effects) Oct 5, 2015

beckettsean changed the title ~~[0.9.5-nightly] Time not handled correctly from mean() query on sharded data (with nasty side effects)~~ [0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) Oct 5, 2015

beckettsean added area/queries category/clustering labels Oct 6, 2015

otoolep self-assigned this Oct 7, 2015

otoolep added this to the 0.9.5 milestone Oct 7, 2015

pauldix modified the milestones: 0.10.0, 0.9.5 Dec 8, 2015

jwilder unassigned otoolep Jan 15, 2016

jwilder removed this from the 0.10.0 milestone Feb 1, 2016

jsternberg closed this as completed May 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) #4331

[0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) #4331

CVTJNII commented Oct 5, 2015

beckettsean commented Oct 6, 2015

CVTJNII commented Oct 6, 2015

CVTJNII commented Oct 6, 2015

otoolep commented Oct 7, 2015

bwhaley commented Oct 22, 2015

sergzin commented Nov 10, 2015

jsternberg commented May 17, 2016

[0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) #4331

[0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) #4331

Comments

CVTJNII commented Oct 5, 2015

beckettsean commented Oct 6, 2015

CVTJNII commented Oct 6, 2015

CVTJNII commented Oct 6, 2015

otoolep commented Oct 7, 2015

bwhaley commented Oct 22, 2015

sergzin commented Nov 10, 2015

jsternberg commented May 17, 2016