Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) #4331

Closed
CVTJNII opened this issue Oct 5, 2015 · 7 comments

Comments

@CVTJNII
Copy link

CVTJNII commented Oct 5, 2015

In troubleshooting the following Grafana generated query:

SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m GROUP BY time(1s)

I have found that, on sharded data, the time field returned from mean isn't handled correctly if the data is not on the host running the query.

I'm currently testing in a 5 node cluster, version 0.9.5-nightly-b079d20, branch master, commit b079d20, built '2015-09-28T04:00:39+0000', with the following replication policy:

> show retention policies on riemannDataGen
name    duration        replicaN        default
default 168h0m0s        2               true

This is causing the data to shard across 4/5 of the servers. Sharding is visible with 'show series' (likely part of the bug or a side effect, as I'd expect 'show series' to show all series):

On node 1:

> show shards
name: riemannDataGen
--------------------
id      start_time              end_time                expiry_time             owners
2       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    3,4
3       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    5,1

> show series
name: Test Data
---------------
_key                            host
Test\ Data,host=Datagen\ 1      Datagen 1
Test\ Data,host=Datagen\ 3      Datagen 3
Test\ Data,host=Datagen\ 5      Datagen 5
Test\ Data,host=Datagen\ 7      Datagen 7
Test\ Data,host=Datagen\ 9      Datagen 9

On node 3:

> show shards
name: riemannDataGen
--------------------
id      start_time              end_time                expiry_time             owners
2       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    3,4
3       2015-10-05T00:00:00Z    2015-10-06T00:00:00Z    2015-10-13T00:00:00Z    5,1

> show series
name: Test Data
---------------
_key                            host
Test\ Data,host=Datagen\ 0      Datagen 0
Test\ Data,host=Datagen\ 2      Datagen 2
Test\ Data,host=Datagen\ 4      Datagen 4
Test\ Data,host=Datagen\ 6      Datagen 6
Test\ Data,host=Datagen\ 8      Datagen 8

The problem I'm seeing is that, on node 1, the select mean query does not handle time correctly which breaks the group by clause:

On node 1:

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m
name: Test Data
---------------
time    value
0       57.01889591527808

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m GROUP BY time(1s)
ERR: invalid character 'j' looking for beginning of value

On node 3 it works (limit clause added to truncate output):

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m
name: Test Data
---------------
time                    value
1444080582551051963     56.90043095744928

> SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = 'Datagen 0' AND time > now() - 15m GROUP BY time(1s) LIMIT 2
name: Test Data
---------------
time                    value
1444080584000000000     
1444080585000000000     53.193519424975506

> 

On node 1 this also breaks the JSON response in the API:

In [99]: r = requests.get(("http://tminflux01.vagrant.tjnii.local:8086/query?%s" % urllib.urlencode({'db': 'riemannDataGen', 'epoch': 'ms', 'q': 'SELECT mean("value") AS "value" FROM "Test Data" WHERE "host" = \'Datagen 0\' AND time > now() - 15m GROUP BY time(1s)'})), auth=('DataGenreader', password))

In [100]: r.text
Out[100]: u'json: error calling MarshalJSON for type httpd.Response: json: error calling MarshalJSON for type *influxql.Result: json: unsupported value: NaN'

In [101]: r.json()
<Snip trace>
ValueError: No JSON object could be decoded

This has the effect of breaking my Grafana dashboards on sharded data as Grafana combines queries across all shards into a single request. As such the response for the combined query is the MarshalJSON error from above and none of the dashboards work.

@beckettsean beckettsean changed the title Time not handled correctly from mean() query on sharded data (with nasty side effects) [0.9.5-nightly] Time not handled correctly from mean() query on sharded data (with nasty side effects) Oct 5, 2015
@beckettsean beckettsean changed the title [0.9.5-nightly] Time not handled correctly from mean() query on sharded data (with nasty side effects) [0.9.5-nightly] GROUP BY time() not handled correctly from mean() query on distributed cluster data (with nasty side effects) Oct 5, 2015
@beckettsean
Copy link
Contributor

the time field returned from mean isn't handled correctly if the data is not on the host running the query.

Not sure I quite follow this. I can see from your examples that without the GROUP BY time() the query succeeds, and with it it fails. How does that relate to "the time field returned from mean"? The epoch 0 timestamp is proper behavior for queries that have an aggregation but lack a GROUP BY time() statement.

ERR: invalid character 'j' looking for beginning of value

This is an error we see when there are NaNs in the source data: #4104. Is there a possibility of not-a-number values in your source?

Can you paste in the results of SHOW SERVERS?

@CVTJNII
Copy link
Author

CVTJNII commented Oct 6, 2015

Can you paste in the results of SHOW SERVERS?

> show servers
id      cluster_addr            raft
1       192.168.253.137:8088    true
2       192.168.253.140:8088    true
3       192.168.253.141:8088    true
4       192.168.253.143:8088    false
5       192.168.253.142:8088    false

Not sure I quite follow this. I can see from your examples that without the GROUP BY time() the query succeeds, and with it it fails. How does that relate to "the time field returned from mean"? The epoch 0 timestamp is proper behavior for queries that have an aggregation but lack a GROUP BY time() statement.

So while troubleshooting I noticed that the time field returns 0 on the node without the shards but does return a time on the node with the shards. As I'd expect the query to return the same regardless of shard location I assumed that was root cause and stopped. As for the exact semantics of what that field should be, I'm going to admit I'm a but out of my depth as I'm troubleshooting a query built by Grafana, I'm still coming up to speed on InfluxDB query language. So if that part of the bug report is wrong or misleading I apologize, I tried to describe it as well as I could. I'll be happy to edit it.

As for my dataset containing NaNs, it shouldn't. This dataset is being fed from a test data generator which is returning floats 0<= f <= 100. The configuration works if InfluxDB is not clustered, I have not seen any errors like this with a single InfluxDB server, it crops up when I try and cluster InfluxDB. Assuming this query is correct the answer is no:

> select value from "test_data" where value=NaN
> 

@CVTJNII
Copy link
Author

CVTJNII commented Oct 6, 2015

I've reproduced this in a Docker Compose environment. See the following repo with details on how to reproduce: https://github.com/CVTJNII/Influx_4331_proof

@otoolep
Copy link
Contributor

otoolep commented Oct 7, 2015

Thanks for the detailed information @CVTJNII -- I will try to take a look at this.

@otoolep otoolep self-assigned this Oct 7, 2015
@otoolep otoolep added this to the 0.9.5 milestone Oct 7, 2015
@bwhaley
Copy link
Contributor

bwhaley commented Oct 22, 2015

+1, seeing the same behavior with 0.9.5-nightly-3a5044d. Thanks for the detailed report, describes almost exactly my scenario (but with N=3, not N=5).

@sergzin
Copy link

sergzin commented Nov 10, 2015

+1 on 0.9.5-nightly-827c513 with 5 servers in cluster and replication factor 3.
I don't see similar issues on testdb with replication factor = No of servers in cluster

@pauldix pauldix modified the milestones: 0.10.0, 0.9.5 Dec 8, 2015
@jwilder jwilder removed this from the 0.10.0 milestone Feb 1, 2016
@jsternberg
Copy link
Contributor

I'm going to close this since clustering is no longer supported in the open source version. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants