Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different query results depending on time WHERE clause #5693

Closed
jhedlund opened this issue Feb 15, 2016 · 32 comments
Closed

Different query results depending on time WHERE clause #5693

jhedlund opened this issue Feb 15, 2016 · 32 comments

Comments

@jhedlund
Copy link

This problem just cropped up today, I am using 0.10.0 (upgraded last Tuesday).

I am missing data depending on my WHERE clause query, for example, if I do:

SELECT first(open) AS open, max(high) AS high, min(low) AS low, last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)

I am currently missing data from today 15:28 UTC to 16:52 UTC (the last 84 minutes, but it's been stuck at missing data from 15:28 UTC since 15:28 UTC).

However, if I simply change the time WHERE clause from "1d" to "500m", I get all the data upto current time (16:52 UTC), eg:

SELECT first(open) AS open, max(high) AS high, min(low) AS low, last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 500m AND symbol = 'EURUSD' GROUP BY time(1m)

What could cause that problem?

@jhedlund
Copy link
Author

This also happened earlier today where the first query was stuck at 12:27 UTC, but it automatically started returning the missing data about an hour or two after 12:27 UTC.

@jhedlund
Copy link
Author

Now the "1d" query is returning results again for upto current (19:50 UTC). So it was stuck at 15:28 UTC for almost 4.5 hours... watching to see if it gets stuck again

Does group by perform any caching based on the time in the WHERE clause?

@jhedlund
Copy link
Author

Started to happen again starting with 21:56 UTC. I don't see anything in the logs that look problematic...

@e-dard
Copy link
Contributor

e-dard commented Feb 16, 2016

@jhedlund just to be sure, could you provide the influxd logs?

@jhedlund
Copy link
Author

Here you go:
influxd.log.gz

It happened again a couple times last night, currently it is reading the data correctly to the minute but I expect it to happen again ..

@jhedlund
Copy link
Author

It's happening again. Playing around with the queries a bit, it's interesting: time > now() - 1d stops with data at 14:16 UTC. If I do time > now() - 4h, it has data up-to 16:18 UTC. If I do time > now() - 2h, I get up to the minute data (16:57 UTC)

@jhedlund
Copy link
Author

Looking again now, the previous comment holds true. 1d still stops at 14:16 UTC, 4h still stops at 16:18 UTC, and 2h is complete up-to the latest minute (17:42 UTC). Seems like it's looking at some sort of cache that isn't being updated properly?

@t-orb
Copy link

t-orb commented Feb 16, 2016

I have this problem too and as in the comment above it looks like a caching thing to me too. I run queries with different time based where clauses and get different results (data missing in all but the first one usually). If I go away for a while and come back the other where clauses sometimes work fine. Go away for a long time and they all work for older data but newer data sometimes has this problem again. I can't reliably reproduce it, but when it strikes it sticks for some time. I'll have one query that reliably shows that I have all the data (usually I'll do a sum() of everything by time(5m)), but any other query at all trying to show the same data will show nothing or more frequently just a tiny fraction of the expected data. Stopping and restarting influxd makes no difference, just have to wait and eventually it sorts itself out. Nothing unusual in the logs at all.

@jhedlund
Copy link
Author

Agreed on stopping/starting, that doesn't help as that was my first attempt to fix the problem.

How long have you seen the problem? I have not seen this problem until upgrading to 0.10, but it took almost a week to show itself. It ran smooth last week

@jwilder
Copy link
Contributor

jwilder commented Feb 16, 2016

This might be upgrade issues similar to #5474.

@jhedlund @t-orb Can you both run show servers, show shards, and cat the contents of your meta/node.json file? If you could put it in a gist that would be great.

@jhedlund
Copy link
Author

@t-orb
Copy link

t-orb commented Feb 16, 2016

https://gist.github.com/t-orb/134916765f84abf1885a

I can rule out an upgrade issue (at least for me), this is a very simple and clean "apt-get" install of 0.10.

I haven't had this problem today so it may be related to loading a lot of data. Last week I was inserting about a weeks worth of data as fast I could go (a bit slower than I'd have liked), since then I'm only inserting about 200k datapoints every 5 minutes (in just 5 POSTs). I was seeing this problem after all the data had been loaded though, but there may be a lot of internal work going on for quite a while after a lot of data has been loaded.

@jwilder
Copy link
Contributor

jwilder commented Feb 16, 2016

Doesn't appear to be an upgrade issue from both of those gists.

@jhedlund
Copy link
Author

I write quite a bit of data all the time, but this problem just started happening this week. I don't think it's writing any more data than usual...

Note: I have not yet upgraded my old data to tsm yet. I had previously tried twice but I ran into an issue where a lot of data went missing (I believe to be issue #5606).

I don't think the old database would be an issue here though since the missing "cached" (?) data is all stored in the new tsm anyway.

@jhedlund
Copy link
Author

I may have found a clue. I do my own rollup queries from the tick table to the minute table (the minute table is where this problem is occurring). Normally I query the last 10 minutes from the tick table and push into the minute table. I forgot that I had it set temporarily to look at the last 120 minutes this week to back fill in some missing data.

I've moved it back to 10 minutes and will check to see if that recovers and stays recovered. It could have been the overwriting of the past 120 minutes of data causing the data to not appear for some reason. (for my rollup query, the insert is always overwriting the data for the past x minutes)

@jhedlund
Copy link
Author

Note: Previous comment did not do much, if anything.

@jhedlund
Copy link
Author

This is continuing to cause production issues for me. Is there anything else I can do to "flush" the cache to force the data to be up to date? Reminder: Restarting influx does not help.

Really appreciate any help here, thanks.

@jwilder
Copy link
Contributor

jwilder commented Feb 19, 2016

@jhedlund Do you have anyway for us to reproduce this locally? Can you provide some sample writes in line protocol format that your sending? #5751 is a fix for the cache related to writing duplicate points (same timestamp) for the same series that may help here.

@jhedlund
Copy link
Author

Is the only way for me to test if #5751 fixes my issue to download the latest code and compile?

My points eventually do show up, but I don't know what causes them to show up. And I don't know if in #5751 this happened, but the points are all there if I use a shorter time in the WHERE clause.

Another potential difference here is that this is only showing up as missing points when I am GROUP BY, the raw data points are there if I don't GROUP BY

@jwilder
Copy link
Contributor

jwilder commented Feb 19, 2016

@jhedlund Ok. #5751 does not seem relevant to your issue if you see the data without a group by.

When you see the points missing, does it make any difference if you ran each aggregate as separate query? So instead of:

SELECT first(open) AS open, max(high) AS high, min(low) AS low, last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)

You run:

SELECT first(open) AS open FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)
SELECT max(high) AS high FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)
SELECT min(low) AS low FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)
SELECT last(close) AS close FROM prices."default".ohlc1m WHERE time > now() - 1d AND symbol = 'EURUSD' GROUP BY time(1m)

@jhedlund
Copy link
Author

@jwilder I will try this the next time it's happening, which I suspect will not be until Monday at this point since the weekend is not submitting data.

I'll let you know, thanks.

@jhedlund
Copy link
Author

I take that back, it is still happening.

Separate queries are not making a difference.

@jhedlund
Copy link
Author

Right now with time > now() - 2d, last data is showing at 2016-02-18T12:05:00

With time > now() -1d, last data is current as of 2016-02-19T21:45:00

So with the now() - 2d, missing about 2020 minutes of data with that where clause difference.

@randywallace
Copy link

The same exact thing happens to me, and it seems to be related to #5751 oddly enough. I have a couple metrics this happens on that I intentionally duplicate because they are based on things happening in log files. I, every 30s, grep (and .. a few 'other' things) some odd 2M lines of logs on ~16 servers for the the last 2 minutes worth of counts of things that happened, then push them onto influxDB with time-rollup timestamps (i.e. at minute boundaries so I can easily graph how many of a log message happen per minute without fighting with InfluxQL needlessly). I expect that every single metric is duplicated (occasionally actually updated) at least once. What is odd is that the data all 'seems' to be there, just the most recent data is only viewable in a shorter timerange. I would rather not poll InfluxDB to test if a metric exists already at the correct value prior to posting it... but it does seem at least like that would fix it b/c the only metrics this happens to are the ones that I know have duplicate data. I have only started testing these new expected-duplicate metrics and so far have just been DROP MEASUREMENT ... 'ing the set to clear out this behavior.

@jhedlund
Copy link
Author

@randywallace Have you tested if #5751 fixes your issue?

I am overwriting data points, and probably quite often it's duplicate data, but it's only the last 10 points that I overwrite (about every minute).

I also noticed another clue this morning. When the data goes missing, it jumps backwards instantly, it's not a gradual process. For example, this morning at 7:25 ET the data was current, then at 7:28 ET the last couple of hours of data points disappeared leaving me with the last data point of 5:39 ET.

@jwilder
Copy link
Contributor

jwilder commented Feb 22, 2016

#5751 has been merged and will be in tonights nightly.

@jhedlund
Copy link
Author

Well I tried installing the nightly build but it failed to start with an unknown engine error:

[filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000004-000000003.tsm (#0) opened in 646.672µs
[filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000008-000000003.tsm (#1) opened in 446.411µs
[filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000012-000000003.tsm (#2) opened in 479.614µs
[filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000014-000000002.tsm (#3) opened in 439.237µs
[filestore] 2016/02/23 18:02:49 /var/lib/influxdb/data/_internal/monitor/16293/000000015-000000001.tsm (#4) opened in 393.18µs
[cacheloader] 2016/02/23 18:02:49 reading file /var/lib/influxdb/wal/_internal/monitor/16293/_00046.wal, size 2191601
[cacheloader] 2016/02/23 18:02:50 reading file /var/lib/influxdb/wal/_internal/monitor/16293/_00059.wal, size 0
run: open server: open tsdb store: failed to open shard 1: new engine: unknown engine type

I assume that's some other issue going on. So I reverted back to v0.10.1 (which I then ran into #5772 but just restored the node.js property to get past that).

Should I open a new issue for the nightly failure above?

@jwilder
Copy link
Contributor

jwilder commented Feb 23, 2016

@jhedlund That is expected for #5723. 0.11 removes support for b1 and bz1 shards. If you want to run the nightly and 0.11, you would need to convert your shards to tsm using influx_tsm.

@jhedlund
Copy link
Author

@jwilder Ok, thanks, I was not aware of that.

@jhedlund
Copy link
Author

I have upgraded to nightly and upon start up it has solved the problem. So perhaps #5751 has resolved this problem. I'll know more I suppose in the hours that come if it does not crop back up

@jwilder
Copy link
Contributor

jwilder commented Mar 11, 2016

Not sure if this issue is still occurring or not but #5996 may fix this.

@jwilder
Copy link
Contributor

jwilder commented Mar 21, 2016

Assuming this is fixed via #5996. Let us know if this issue is still occurring with latest nightlies.

@jwilder jwilder closed this as completed Mar 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants