A solution to solve 'instable system time making metricd ignore incoming measurements' #307

Asu4ni · 2017-08-31T01:12:50Z

Sometimes my ntp client change system time too over like this:
10:00 -> 11:30 -> 10:50 ....... (current time is actually about 10:50)
And as expected, not until 11:30 did metricd start processing incoming measurements.

Is there a way to solve the issue right now? Like: a command / api to delete all the future metrics (metric that has timestamp later than current time).
If not, should we implement such a feature? I guess a wait for 30 minutes to recover from the unstable state is fine, but what if the system time was accidentally modified to several days later?

The text was updated successfully, but these errors were encountered:

chungg · 2017-08-31T01:23:27Z

huh? metricd doesn't work based on your system time. it just uses whatever timestamp you write in and if it fits into the back window (if not a new metric)

Asu4ni · 2017-08-31T01:33:52Z

Sorry that I didn't mention that I test gnocchi in an all-in-one openstack cloud. So actually the timestamps from the ceilometer were wrong as well.
I guess that my problem could possibly be extend to a more general one:
How do we delete or fix the wrong measurements sent to gnocchi before?
According to my understanding, the only way is to delete the metric directly in the storage backend. But I think it will be nice if we have an api to delete metrics in a certain period.

Asu4ni · 2017-08-31T02:10:18Z

There are some log on my system if anyone would want to take a look.

Ntp (chrony) tracking.log
the times in this log have to be +8 since there are utc times.

2017-08-30 23:30:55 103.226.213.30   2    -27.107      0.049 -6.065e+03 N  1  1.496e+03  3.109e-10
2017-08-31 01:12:30 61.216.153.104   4    -27.107      0.049 -1.483e+03 N  1  1.486e+03  2.695e-10
2017-08-31 01:37:48 103.226.213.30   2    -27.107      0.049 -1.830e+02 N  1  2.092e+03  3.110e-10
2017-08-31 01:41:20 61.216.153.104   4    -27.107      0.049 -1.074e+03 N  1  2.392e+03  2.672e-10
2017-08-31 01:59:50 103.226.213.30   2    -27.107      0.049 -4.912e+02 N  1  2.487e+03  3.175e-10
2017-08-31 02:08:30 61.216.153.104   4    -27.107      0.049  1.909e+03 N  1  3.031e+03  2.600e-10
2017-08-31 01:37:16 103.226.213.30   2    -27.107      0.049 -3.777e+03 N  1  2.535e+03  3.177e-10
2017-08-31 02:40:42 0.0.0.0          0    -27.107      0.049  0.000e+00 ?  0  0.000e+00  2.573e-10
2017-08-31 02:44:32 61.216.153.104   4    -27.107      0.058  9.276e+03 N  2  1.510e-03  2.322e-09
2017-08-31 00:10:25 61.216.153.104   4    -27.106      0.062  1.467e-04 N  1  1.203e-03  3.520e-10
2017-08-31 00:11:29 61.216.153.104   4    -27.106      0.068 -9.852e-04 N  2  2.543e-03 -1.199e-04

graph of cpu_util of an instance (aggregation: count)
the time is local time(+8). polling interval is 10s (6 points/min)

I have check the log of ceilometer and the api log of gnocchi. I have restarted ceilometer service after the time got stabilized and the timestamps sent to gnocchi thereafter were correct. Gnocchi api get batch measures every 10s. So the only issue resides on the metric daemon side I suppose.

chungg · 2017-08-31T02:11:43Z

oh. i understand now. you sent a point with timestamp that is incorrectly set to the future.

hmm... i don't know if we can handle this. the reason is, once that future point is processed, gnocchi will rebuild the backwindow around that point. therefore, even if we support deleting points in the archive policy timespan, there is a very good chance the backwindow cannot be recovered to the correct spot. this means, the next point that comes in could possibly wiped out good points.

ie. if we do hourly aggregates: points at 10:30, 10:35, 12:00 <- wrong, 10:40. even if we delete 12:00, we don't have 10:30 10:35 points anymore, so if we pass in 10:40 and process it, it will effectively throw away 10:30, 10:35 points

Asu4ni · 2017-08-31T02:31:42Z

Yes, I understand the issue now. thanks!

This seems to be a difficult problem to solve. But I assume that this kind of accident does happen now and then, so we should solve it somehow.

And continuing your example, the current situation has something even worse.
points: 10:30 10:35 13:00(wrong) 10:40 10:45 .... 11:00 11:05....
10:40, 10:45 ... 10:55 cannot be processed since points like 10:30 were lost. What's worse now is 11:00, 11:05 are not processed either. And I think this is somewhat easier to be solved. Maybe offer a request to reset current window to the current time.

Some possible solution I come up with:

delete metrics in a certain period, and reset current window (maybe clear the measurements already there as well)
reject measurement from the future. (this only fix timestamp error. If the time of the machine gnocchi is on is also incorrect, the problem is not solved)

BTW, in the screenshot of grafana I posted above, I supposed that I shouldn't see any metrics appeared until around 10:44 (02:44:32 + 8hour), but some metrics seemed to pop up randomly. I don't know why, and maybe it is a bug?

Here is another screenshot with closer time: (6 points/min is the normal case)

jd · 2017-08-31T07:40:07Z

But I assume that this kind of accident does happen now and then, so we should solve it somehow

Not really. A system with a wrong clock will always plenty of other problems, so operators don't usually let that go into production.

Maybe offer a request to reset current window to the current time.

That would basically be a call to delete point in the timeseries. I've logged a feature request in #309.

reject measurement from the future. (this only fix timestamp error. If the time of the machine gnocchi is on is also incorrect, the problem is not solved)

Yeah thought if your computers are not completely synchronized, this could be a problem. I don't like the idea of tieing anything to $NOW in general. Many TSDB do that and it makes them terrible in this regard.

BTW, in the screenshot of grafana I posted above, I supposed that I shouldn't see any metrics appeared until around 10:44 (02:44:32 + 8hour), but some metrics seemed to pop up randomly.

Why shouldn't be there any metric at 10:44? Do you have a history of the point you sent?

I don't know why, and maybe it is a bug?

So try to understand why and we'll know if it's a bug. :)

Asu4ni · 2017-08-31T08:04:30Z

Why shouldn't be there any metric at 10:44? Do you have a history of the point you sent?

Since from my chrony log (also pasted above):
2017-08-31 02:44:32 61.216.153.104 4 -27.107 0.058 9.276e+03 N 2 1.510e-03 2.322e-09
The furthest time chrony somehow changed to is this: 02:44(UTC), 10:44(Local).
My granularity is 1m and back_window=1, so I suppose that any incoming measurement sent to gnocchi whose timestamp < 10:44:00 would be ignored. But as you saw, there are metrics here and there which appear inregularly.
Another interesting thing is that at the 10:44, the number of points of each metric exceeds 6. I guess that the incoming measurements received when the time was accidentally changed to 10:44, combined with the incoming measurenents just received, were aggregated together. I'm not sure whether this is an intended behavior.

As for the history of the point I sent, I don't have the details since they were sent from the ceilometer. But I have checked the log file of gnocchi-api, and the batch measures had been correctly received every 10s.

jd · 2017-09-01T08:44:40Z

Chrony is a NTP server, that does not tell which measures you sent in which order. WIthout that it's impossible to assert if there's a bug or not. The back window applies on processing: if you set 2 days of metric in one batch, they won't be ignored, whatever the back window is.

The back window is a guarantee of what will be processed, not what will be ignored.

I'm closing this because they are no interesting details. If you have a way to actually reproduce what you think is a problem, please feel free to reopen with a way to reproduce.

Asu4ni · 2017-09-01T08:55:49Z

OK, thanks! I'll dig into that further sometimes.

Asu4ni · 2017-09-06T07:21:12Z

( @jd please help reopening this thanks! )

I did a little experiment: (timezone +8)

granularity: 60s / 30min (only show points from 60s below)
back_window: 0
aggregation: count
polling interval: 20s (3 points/min)

time: 14:30

start sending measures from ceilometer

time: 14:35

change system time to 15:00(incorrect)

time: 15:05(incorrect) / 14:40

change system back to 14:40
restart services including ceilometer-*, metricd, api

time: 15:10

api.txt
The attachment is the log file of gnocchi api, I also output the timestamp of each incoming measurements. Check it out if you're interested. Note that the time is in UTC, +8 before comparing to the screenshot.

I don't know if the behavior corresponds to our intention. Some metrics have 6 points which is kind of weird.

jd · 2017-09-06T07:34:17Z

back_window: ?

Ah, and you want to debug a back window problem without knowing which value it is? Nice. :)

It's impossible to know the back window without knowing the full archive policy definition. You said "granularity: 60s" but is that the only definition in your AP?

If that's the case your back window is between 0 and 60s.

Then you do not define how often you run metricd. If you run metricd every hour or so. Again, I'm gonna repeat myself but the back window is a guarantee of what will be processed, not what will be ignored. So when metricd runs will change what will or will not be ignored.

First graph:

your 14:35-15:00 never existed since you bumped to 15:00, so you have a gap, Ceilometer did not generate any data for 14:35-15:00.

Second graph:

you come back to 14:40, so Ceilometer generates points for 14:40, 14:41, etc… except that these are out of the back window, so they are ignored. Gnocchi restarts to process points when they fall into the back window again, which seems to be at 15:00 – which makes me thing your back window is more 5 minutes than 1 minute. Obviously that makes your graph having 6 points since Ceilometer data twice for 15:00, 15:01, etc.

Asu4ni · 2017-09-06T07:46:06Z

@jd , I put ? as a placeholder when I was typing, I just forgot to replace it. I have updated it.

Asu4ni · 2017-09-06T07:48:17Z

There's also another 30 minutes aggregation.

jd · 2017-09-06T09:28:40Z

There's also another 30 minutes aggregation.

So your back window is between 0 and 30 minutes – I knew it was at least 5 minutes according to your graph. In that case it is 14:30-15:00 and then 15:00-15:30.

No bug then. Sorry!

Asu4ni · 2017-09-06T14:19:27Z

@jd No need to say sorry. ^^ I'm just trying to figure out the mechanism. And now I'm really curious about how the time is divided. I originally supposed that the time interval starts at the time when the metric is created. For example, a metric with granularity 30min is created at 08:09, then the intervals should be 08:09 ~ 08:39, 08:39 ~ 09:09, etc. Then I found out that they aren't. What will happen then if the granularity cannot divide an hour evenly, 29 min for example.
Anyway, thanks for the help.

jd · 2017-09-06T14:35:23Z

I originally supposed that the time interval starts at the time when the metric is created.

No, time is divided starting at 0 (1970-01-01T00:00:00) and adding your aggregation period from that.

Asu4ni · 2017-09-06T15:09:47Z

OK, got it. I'd post something if I find something interesting.

jd mentioned this issue Aug 31, 2017

Allow to delete ranges of measures in a metric #309

Open

jd added the question label Aug 31, 2017

jd closed this as completed Sep 1, 2017

Asu4ni mentioned this issue Oct 11, 2017

Add 'aggregates' API endpoint (aggregates/measures distinguishing) #421

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A solution to solve 'instable system time making metricd ignore incoming measurements' #307

A solution to solve 'instable system time making metricd ignore incoming measurements' #307

Asu4ni commented Aug 31, 2017

chungg commented Aug 31, 2017

Asu4ni commented Aug 31, 2017 •

edited

Loading

Asu4ni commented Aug 31, 2017

chungg commented Aug 31, 2017

Asu4ni commented Aug 31, 2017 •

edited

Loading

jd commented Aug 31, 2017

Asu4ni commented Aug 31, 2017

jd commented Sep 1, 2017

Asu4ni commented Sep 1, 2017

Asu4ni commented Sep 6, 2017 •

edited

Loading

jd commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

jd commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

jd commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

A solution to solve 'instable system time making metricd ignore incoming measurements' #307

A solution to solve 'instable system time making metricd ignore incoming measurements' #307

Comments

Asu4ni commented Aug 31, 2017

chungg commented Aug 31, 2017

Asu4ni commented Aug 31, 2017 • edited Loading

Asu4ni commented Aug 31, 2017

chungg commented Aug 31, 2017

Asu4ni commented Aug 31, 2017 • edited Loading

jd commented Aug 31, 2017

Asu4ni commented Aug 31, 2017

jd commented Sep 1, 2017

Asu4ni commented Sep 1, 2017

Asu4ni commented Sep 6, 2017 • edited Loading

time: 14:30

time: 14:35

time: 15:05(incorrect) / 14:40

time: 15:10

jd commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

jd commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

jd commented Sep 6, 2017

Asu4ni commented Sep 6, 2017

Asu4ni commented Aug 31, 2017 •

edited

Loading

Asu4ni commented Aug 31, 2017 •

edited

Loading

Asu4ni commented Sep 6, 2017 •

edited

Loading