Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A solution to solve 'instable system time making metricd ignore incoming measurements' #307

Closed
Asu4ni opened this issue Aug 31, 2017 · 17 comments
Labels

Comments

@Asu4ni
Copy link
Contributor

Asu4ni commented Aug 31, 2017

Sometimes my ntp client change system time too over like this:
10:00 -> 11:30 -> 10:50 ....... (current time is actually about 10:50)
And as expected, not until 11:30 did metricd start processing incoming measurements.

Is there a way to solve the issue right now? Like: a command / api to delete all the future metrics (metric that has timestamp later than current time).
If not, should we implement such a feature? I guess a wait for 30 minutes to recover from the unstable state is fine, but what if the system time was accidentally modified to several days later?

@chungg
Copy link
Member

chungg commented Aug 31, 2017

huh? metricd doesn't work based on your system time. it just uses whatever timestamp you write in and if it fits into the back window (if not a new metric)

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Aug 31, 2017

Sorry that I didn't mention that I test gnocchi in an all-in-one openstack cloud. So actually the timestamps from the ceilometer were wrong as well.
I guess that my problem could possibly be extend to a more general one:
How do we delete or fix the wrong measurements sent to gnocchi before?
According to my understanding, the only way is to delete the metric directly in the storage backend. But I think it will be nice if we have an api to delete metrics in a certain period.

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Aug 31, 2017

There are some log on my system if anyone would want to take a look.

  • Ntp (chrony) tracking.log
    the times in this log have to be +8 since there are utc times.
2017-08-30 23:30:55 103.226.213.30   2    -27.107      0.049 -6.065e+03 N  1  1.496e+03  3.109e-10
2017-08-31 01:12:30 61.216.153.104   4    -27.107      0.049 -1.483e+03 N  1  1.486e+03  2.695e-10
2017-08-31 01:37:48 103.226.213.30   2    -27.107      0.049 -1.830e+02 N  1  2.092e+03  3.110e-10
2017-08-31 01:41:20 61.216.153.104   4    -27.107      0.049 -1.074e+03 N  1  2.392e+03  2.672e-10
2017-08-31 01:59:50 103.226.213.30   2    -27.107      0.049 -4.912e+02 N  1  2.487e+03  3.175e-10
2017-08-31 02:08:30 61.216.153.104   4    -27.107      0.049  1.909e+03 N  1  3.031e+03  2.600e-10
2017-08-31 01:37:16 103.226.213.30   2    -27.107      0.049 -3.777e+03 N  1  2.535e+03  3.177e-10
2017-08-31 02:40:42 0.0.0.0          0    -27.107      0.049  0.000e+00 ?  0  0.000e+00  2.573e-10
2017-08-31 02:44:32 61.216.153.104   4    -27.107      0.058  9.276e+03 N  2  1.510e-03  2.322e-09
2017-08-31 00:10:25 61.216.153.104   4    -27.106      0.062  1.467e-04 N  1  1.203e-03  3.520e-10
2017-08-31 00:11:29 61.216.153.104   4    -27.106      0.068 -9.852e-04 N  2  2.543e-03 -1.199e-04
  • graph of cpu_util of an instance (aggregation: count)
    the time is local time(+8). polling interval is 10s (6 points/min)

image

I have check the log of ceilometer and the api log of gnocchi. I have restarted ceilometer service after the time got stabilized and the timestamps sent to gnocchi thereafter were correct. Gnocchi api get batch measures every 10s. So the only issue resides on the metric daemon side I suppose.

@chungg
Copy link
Member

chungg commented Aug 31, 2017

oh. i understand now. you sent a point with timestamp that is incorrectly set to the future.

hmm... i don't know if we can handle this. the reason is, once that future point is processed, gnocchi will rebuild the backwindow around that point. therefore, even if we support deleting points in the archive policy timespan, there is a very good chance the backwindow cannot be recovered to the correct spot. this means, the next point that comes in could possibly wiped out good points.

ie. if we do hourly aggregates: points at 10:30, 10:35, 12:00 <- wrong, 10:40. even if we delete 12:00, we don't have 10:30 10:35 points anymore, so if we pass in 10:40 and process it, it will effectively throw away 10:30, 10:35 points

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Aug 31, 2017

Yes, I understand the issue now. thanks!

This seems to be a difficult problem to solve. But I assume that this kind of accident does happen now and then, so we should solve it somehow.

And continuing your example, the current situation has something even worse.
points: 10:30 10:35 13:00(wrong) 10:40 10:45 .... 11:00 11:05....
10:40, 10:45 ... 10:55 cannot be processed since points like 10:30 were lost. What's worse now is 11:00, 11:05 are not processed either. And I think this is somewhat easier to be solved. Maybe offer a request to reset current window to the current time.

Some possible solution I come up with:

  • delete metrics in a certain period, and reset current window (maybe clear the measurements already there as well)
  • reject measurement from the future. (this only fix timestamp error. If the time of the machine gnocchi is on is also incorrect, the problem is not solved)

BTW, in the screenshot of grafana I posted above, I supposed that I shouldn't see any metrics appeared until around 10:44 (02:44:32 + 8hour), but some metrics seemed to pop up randomly. I don't know why, and maybe it is a bug?

Here is another screenshot with closer time: (6 points/min is the normal case)
image

@jd
Copy link
Member

jd commented Aug 31, 2017

But I assume that this kind of accident does happen now and then, so we should solve it somehow

Not really. A system with a wrong clock will always plenty of other problems, so operators don't usually let that go into production.

Maybe offer a request to reset current window to the current time.

That would basically be a call to delete point in the timeseries. I've logged a feature request in #309.

reject measurement from the future. (this only fix timestamp error. If the time of the machine gnocchi is on is also incorrect, the problem is not solved)

Yeah thought if your computers are not completely synchronized, this could be a problem. I don't like the idea of tieing anything to $NOW in general. Many TSDB do that and it makes them terrible in this regard.

BTW, in the screenshot of grafana I posted above, I supposed that I shouldn't see any metrics appeared until around 10:44 (02:44:32 + 8hour), but some metrics seemed to pop up randomly.

Why shouldn't be there any metric at 10:44? Do you have a history of the point you sent?

I don't know why, and maybe it is a bug?

So try to understand why and we'll know if it's a bug. :)

@jd jd added the question label Aug 31, 2017
@Asu4ni
Copy link
Contributor Author

Asu4ni commented Aug 31, 2017

Why shouldn't be there any metric at 10:44? Do you have a history of the point you sent?

Since from my chrony log (also pasted above):
2017-08-31 02:44:32 61.216.153.104 4 -27.107 0.058 9.276e+03 N 2 1.510e-03 2.322e-09
The furthest time chrony somehow changed to is this: 02:44(UTC), 10:44(Local).
My granularity is 1m and back_window=1, so I suppose that any incoming measurement sent to gnocchi whose timestamp < 10:44:00 would be ignored. But as you saw, there are metrics here and there which appear inregularly.
Another interesting thing is that at the 10:44, the number of points of each metric exceeds 6. I guess that the incoming measurements received when the time was accidentally changed to 10:44, combined with the incoming measurenents just received, were aggregated together. I'm not sure whether this is an intended behavior.

As for the history of the point I sent, I don't have the details since they were sent from the ceilometer. But I have checked the log file of gnocchi-api, and the batch measures had been correctly received every 10s.

@jd
Copy link
Member

jd commented Sep 1, 2017

Chrony is a NTP server, that does not tell which measures you sent in which order. WIthout that it's impossible to assert if there's a bug or not. The back window applies on processing: if you set 2 days of metric in one batch, they won't be ignored, whatever the back window is.

The back window is a guarantee of what will be processed, not what will be ignored.

I'm closing this because they are no interesting details. If you have a way to actually reproduce what you think is a problem, please feel free to reopen with a way to reproduce.

@jd jd closed this as completed Sep 1, 2017
@Asu4ni
Copy link
Contributor Author

Asu4ni commented Sep 1, 2017

OK, thanks! I'll dig into that further sometimes.

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Sep 6, 2017

( @jd please help reopening this thanks! )

I did a little experiment: (timezone +8)

  • granularity: 60s / 30min (only show points from 60s below)
  • back_window: 0
  • aggregation: count
  • polling interval: 20s (3 points/min)

time: 14:30

start sending measures from ceilometer

time: 14:35

change system time to 15:00(incorrect)

time: 15:05(incorrect) / 14:40

change system back to 14:40
restart services including ceilometer-*, metricd, api
image

time: 15:10

image

api.txt
The attachment is the log file of gnocchi api, I also output the timestamp of each incoming measurements. Check it out if you're interested. Note that the time is in UTC, +8 before comparing to the screenshot.

I don't know if the behavior corresponds to our intention. Some metrics have 6 points which is kind of weird.

@jd
Copy link
Member

jd commented Sep 6, 2017

back_window: ?

Ah, and you want to debug a back window problem without knowing which value it is? Nice. :)

It's impossible to know the back window without knowing the full archive policy definition. You said "granularity: 60s" but is that the only definition in your AP?

If that's the case your back window is between 0 and 60s.

Then you do not define how often you run metricd. If you run metricd every hour or so. Again, I'm gonna repeat myself but the back window is a guarantee of what will be processed, not what will be ignored. So when metricd runs will change what will or will not be ignored.

First graph:

  • your 14:35-15:00 never existed since you bumped to 15:00, so you have a gap, Ceilometer did not generate any data for 14:35-15:00.

Second graph:

  • you come back to 14:40, so Ceilometer generates points for 14:40, 14:41, etc… except that these are out of the back window, so they are ignored. Gnocchi restarts to process points when they fall into the back window again, which seems to be at 15:00 – which makes me thing your back window is more 5 minutes than 1 minute. Obviously that makes your graph having 6 points since Ceilometer data twice for 15:00, 15:01, etc.

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Sep 6, 2017

@jd , I put ? as a placeholder when I was typing, I just forgot to replace it. I have updated it.

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Sep 6, 2017

There's also another 30 minutes aggregation.

@jd
Copy link
Member

jd commented Sep 6, 2017

There's also another 30 minutes aggregation.

So your back window is between 0 and 30 minutes – I knew it was at least 5 minutes according to your graph. In that case it is 14:30-15:00 and then 15:00-15:30.

No bug then. Sorry!

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Sep 6, 2017

@jd No need to say sorry. ^^ I'm just trying to figure out the mechanism. And now I'm really curious about how the time is divided. I originally supposed that the time interval starts at the time when the metric is created. For example, a metric with granularity 30min is created at 08:09, then the intervals should be 08:09 ~ 08:39, 08:39 ~ 09:09, etc. Then I found out that they aren't. What will happen then if the granularity cannot divide an hour evenly, 29 min for example.
Anyway, thanks for the help.

@jd
Copy link
Member

jd commented Sep 6, 2017

I originally supposed that the time interval starts at the time when the metric is created.

No, time is divided starting at 0 (1970-01-01T00:00:00) and adding your aggregation period from that.

@Asu4ni
Copy link
Contributor Author

Asu4ni commented Sep 6, 2017

OK, got it. I'd post something if I find something interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants