-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A solution to solve 'instable system time making metricd ignore incoming measurements' #307
Comments
huh? metricd doesn't work based on your system time. it just uses whatever timestamp you write in and if it fits into the back window (if not a new metric) |
Sorry that I didn't mention that I test gnocchi in an all-in-one openstack cloud. So actually the timestamps from the ceilometer were wrong as well. |
There are some log on my system if anyone would want to take a look.
I have check the log of ceilometer and the api log of gnocchi. I have restarted ceilometer service after the time got stabilized and the timestamps sent to gnocchi thereafter were correct. Gnocchi api get batch measures every 10s. So the only issue resides on the metric daemon side I suppose. |
oh. i understand now. you sent a point with timestamp that is incorrectly set to the future. hmm... i don't know if we can handle this. the reason is, once that future point is processed, gnocchi will rebuild the backwindow around that point. therefore, even if we support deleting points in the archive policy timespan, there is a very good chance the backwindow cannot be recovered to the correct spot. this means, the next point that comes in could possibly wiped out good points. ie. if we do hourly aggregates: points at 10:30, 10:35, 12:00 <- wrong, 10:40. even if we delete 12:00, we don't have 10:30 10:35 points anymore, so if we pass in 10:40 and process it, it will effectively throw away 10:30, 10:35 points |
Not really. A system with a wrong clock will always plenty of other problems, so operators don't usually let that go into production.
That would basically be a call to delete point in the timeseries. I've logged a feature request in #309.
Yeah thought if your computers are not completely synchronized, this could be a problem. I don't like the idea of tieing anything to
Why shouldn't be there any metric at 10:44? Do you have a history of the point you sent?
So try to understand why and we'll know if it's a bug. :) |
Since from my chrony log (also pasted above): As for the history of the point I sent, I don't have the details since they were sent from the ceilometer. But I have checked the log file of gnocchi-api, and the batch measures had been correctly received every 10s. |
Chrony is a NTP server, that does not tell which measures you sent in which order. WIthout that it's impossible to assert if there's a bug or not. The back window applies on processing: if you set 2 days of metric in one batch, they won't be ignored, whatever the back window is. The back window is a guarantee of what will be processed, not what will be ignored. I'm closing this because they are no interesting details. If you have a way to actually reproduce what you think is a problem, please feel free to reopen with a way to reproduce. |
OK, thanks! I'll dig into that further sometimes. |
( @jd please help reopening this thanks! ) I did a little experiment: (timezone +8)
time: 14:30start sending measures from ceilometer time: 14:35change system time to 15:00(incorrect) time: 15:05(incorrect) / 14:40change system back to 14:40 time: 15:10api.txt I don't know if the behavior corresponds to our intention. Some metrics have 6 points which is kind of weird. |
Ah, and you want to debug a back window problem without knowing which value it is? Nice. :) It's impossible to know the back window without knowing the full archive policy definition. You said "granularity: 60s" but is that the only definition in your AP? If that's the case your back window is between 0 and 60s. Then you do not define how often you run metricd. If you run metricd every hour or so. Again, I'm gonna repeat myself but the back window is a guarantee of what will be processed, not what will be ignored. So when metricd runs will change what will or will not be ignored. First graph:
Second graph:
|
@jd , I put ? as a placeholder when I was typing, I just forgot to replace it. I have updated it. |
There's also another 30 minutes aggregation. |
So your back window is between 0 and 30 minutes – I knew it was at least 5 minutes according to your graph. In that case it is 14:30-15:00 and then 15:00-15:30. No bug then. Sorry! |
@jd No need to say sorry. ^^ I'm just trying to figure out the mechanism. And now I'm really curious about how the time is divided. I originally supposed that the time interval starts at the time when the metric is created. For example, a metric with granularity 30min is created at 08:09, then the intervals should be 08:09 ~ 08:39, 08:39 ~ 09:09, etc. Then I found out that they aren't. What will happen then if the granularity cannot divide an hour evenly, 29 min for example. |
No, time is divided starting at 0 (1970-01-01T00:00:00) and adding your aggregation period from that. |
OK, got it. I'd post something if I find something interesting. |
Sometimes my ntp client change system time too over like this:
10:00 -> 11:30 -> 10:50 ....... (current time is actually about 10:50)
And as expected, not until 11:30 did metricd start processing incoming measurements.
Is there a way to solve the issue right now? Like: a command / api to delete all the future metrics (metric that has timestamp later than current time).
If not, should we implement such a feature? I guess a wait for 30 minutes to recover from the unstable state is fine, but what if the system time was accidentally modified to several days later?
The text was updated successfully, but these errors were encountered: