-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Objects are removed from IDO during reload with config change #7125
Comments
ref/NC/603111 |
ref/IC/13763 |
I think we're seeing the same sort of issue. In our case only hosts disappear for a while the services seem to stay.
We have 1895 hosts and 40533 services. Every hour puppet refreshes the inventory and
And a similar 2 hours earlier
All these objects appear to be hosts, not services. So all hosts disappeared from the db for ~ 30 seconds. If helpful I could look into collecting debug logs. |
Hm. Can I see how the database was created?
Which MySQL/MariaDB version is used here? In order to trigger such a difference, the names between the config objects and the database objects need to be different. This cannot be the case, really, and may tests only show that one host object as added/removed, depending on the case. I suspect a problem with collations or strings here, something where Icinga thinks that everything has changed, and as such, first disabling all the host objects ... https://github.com/Icinga/icinga2/blob/master/lib/db_ido_mysql/idomysqlconnection.cpp#L425 ... and then finding out that all host objects need to be added again. https://github.com/Icinga/icinga2/blob/master/lib/db_ido_mysql/idomysqlconnection.cpp#L460 One idea while reading the code: if This is done beforehand in
https://github.com/Icinga/icinga2/blob/master/lib/db_ido/dbtype.cpp#L71 This takes name1 and name2 into account, name2 is empty for host objects, set for services. On the other hand, it may be the cause that Meaning to say, the reconnect timer started when the config object activation is still in progress, and there's no active host objects yet. Coming from 13a8fa2 we introduced a load dependent order and c54e042 sets that. According to the code, the activation is just fine - 0 is checkable objects, 100 for features. That way the IDO starts after all normal config objects are loaded. And obviously I would see that fail in my tests, which brings me back to the database collation thing, and string comparisons. Cheers, |
I'm facing the same problem.. |
Please answer my questions from #7125 (comment) as well. |
I'm using version: 10.1.37-MariaDB-0+deb9u1 Debian 9.6 Not sure which is the database to query:
|
|
|
Full schema dump (mysqldump --no-data icinga) Package versions:
Everything is done with the supplied puppet modules and icinga packages. |
My database uses https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434 This may have an influence here, next to the MySQL/MariaDB version of course. I can see that both of you use Maybe we can figure it out by querying for wrong column length.
all multi-byte values within UTF8 should not be shown, can can instead be queries with the The above query will also show plain strings, and might be longer than expected. Look for the I could imagine that especially host object names contain special characters, while services and others do not. Coming to a Director deployment (seen that in a customer ticket), where import sources generate the hostname, ... what if real utf8 is inserted into crap utf8? This will always lead to non-matching names in Icinga vs IDO DB objects. More testsutf8 vs utf8mb4I've tried a simple test case with 5.5.60 MariaDB (CentOS) and 10.3.11-MariaDB on macOS.
I would consider that MySQL/MariaDB throws an error on insert, if these problems happen, but I can only see that with 10.3 MariaDB. In 5.5.60 this generates a warning, but the resulting values are broken.
I'm not sure whether the strict mode plays a role here, https://www.linode.com/community/questions/17070/how-can-i-disable-mysql-strict-mode utf8
utf8mb4
|
Mysql seems fine:
We only have ascii hostnames and the way we generate config from puppetdb (custom) should ensure (i'll check) ascii as well. Config seems to be fully ascii:
I've just noticed the same happens in our ACCP setup, I will try mb4 there. I've checked the debug log but I'm not sure what to look for. As far as I can tell DB updates are batched, making The log lines seem to arrive a while later than the db updates:
|
Yes, @fbs, this would be extremely helpful. All of you, please collect Icinga debug logs from start to re-appearing of the hosts and upload them here: |
This is an upload-only URL, no-one except Icinga devs working on this issue will have access to the files. |
And please upload also the output of |
Assigning this to 2.11 as a working task. No guarantee though - if we cannot find it, it won't block the release. |
Just taking a quick look into this (as I assumed this is expected behaviour as I saw the same as far as I remember). Having UTF8MB4 instead of UTF8MB3 becomes relevant only for characters outside of BMP and that only if the fields where they are inserted to/read from are not an 8bit charset that is completely covered within BMP. And yes for MySQL <5.7 and MariaDB < 10.2.4 it's sadly a must to ensure strict mode to get an error instead of silent ignores for violation of length constraints etc.. (And for more recent versions it could also be a must for applications to ensure these new proper defaults are configured differently by mistake...) Anyway, regarding the problem...
Because this probably translates to |
Charsets were a route where changes over different versions could have gotten into. The better chance are to analyse why GetOrCreateObjectByName() doesn't work in this regard. This method ensures that all active objects have a config object pointer set, within the loop and with ConfigObject::GetObject() https://github.com/Icinga/icinga2/blob/master/lib/db_ido_mysql/idomysqlconnection.cpp#L431 The later loop which deactivates all objects, has a conditional continue which avoids any db object which has a config object pointer. If that pointer is NULL, the update is_active=0 query happens. That's likely a race condition, and I am not yet sure what changed or how one can reproduce this. Seemingly it only affects host names, but not check commands where name2 also is a null value in MySQL. It could also be the case that the IdoMysqlConnection object gets activated, and the reconnect timer starts "too soon", leaving the host objects not activated. The code speaks a different language with the defined activation order in a serial fashion. Likely this did not happen before as it was in a random order, and hosts where always activated before idomysqlconnections (alphanumeric sort in memory). With the new activation order and some sorting beforehand, this likely has an influence here. A critical section would be to implement a sleep into the host's start method, and see whether the IdoMysqlConnection objects gets started and the reconnect timer starts to run. Just some thoughts on the matter - I am not actively working on it now. |
More AnalysisI've received debug logs from a partner customer which unveils that the is_active=0 events are fired on reload shutdown, and not during the initial startup routine as described originally.
Code AnalysisDbConnection::UpdateObject() will be called for any runtime event, not only during the Reconnect() handling. This was changed with 2.4 where runtime objects could be created any time. It has a safety hook though - whenever the application is in shutdown mode, it will not work on such updates in order to prevent false object updates. That's fine and always has been. Actually, there's a behavioural change here. Previously a reload would also just wait for an application shutdown In Application::RunEventLoop(). This was changed with the Systemd watchdog for reloads in #5996 and uses a new handler called SigUsr2Handler(). Wait, there was a problem with this handler already where reloads didn't write the state file in #6691. And if one looks closer, the ConfigObject::StopObjects() is duplicated into this scope. But what about m_ShuttingDown? Is this really set to true when a reload is triggered? Obviously not, since this happens before the IcingaApplication is really shutdown, and is controlled via the global variable l_Reloading. By exposing this variable and using it next to IsShuttingDown, one can mitigate the behaviour (and likely fix it). TestsHeavy configMy current "many.conf".
A long startup time
This is an important measurement for the reload then.
The nasty thingsWell, not obviously in the log, I'm using a custom patch here after having analysed the problem from the code parts.
This goes on. CheckCommand objects are not evaluated against is_active in the database in Icinga Web 2 so this isn't seen by default. Look how shiny the hosts are deactivated though following the alpha-numeric order.
This goes on and on. HostGroups are inactive as well, this might be taken into account for permission filters in Icinga Web 2.
Well, and then there's the IcingaApplication. It is not about IdoMysqlConnection or reconnect handling here, we are in a fucking shutdown state. Well, and IcingaApplication calls Stop() which sets m_ShuttingDown to true.
If this object type would exist in IDO MySQL it would deactivated. Obviously it isn't. Going further, IdoMysqlConnection gets deactivated too.
It has some queries in the queue, and the connection is dropped after that. ConclusionSo, who can now explain why only hosts are affected? Right. The only visible objects in alphanumeric order which are visible in Icinga Web 2 by is_active=0. The hostgroups are not that visible but likely a problem. In Git Master, the problem is more obvious with the changes from #6970 stopping the IcingaApplication at the latest point possible. Good that this issue got some customer priority. My analysis patch already includes safety hooks to not fire the query just to log the state (won't be enabled for production!). PatchWell. Too short for my nerves with the IDO code. |
Another test round while cleaning the patch. 2.10
2.11
2.11 has the more immanent problem with the changed deactivation order, relying on the reload signal even more. |
This follows the same principle as with the shutdown handler, and was introduced with the changed reload handling with 2.9. Previously IsShuttingDown() was sufficient which got set at one location. SigUsr2 as handler introduced a new location where m_ShuttingDown is not necessarily set yet. Since this handler gets called when l_Restarting is enabled, we'll use this flag to avoid config update events resulting in object deactivation (object->IsActive() always returns false). refs #5996 refs #6691 refs #6970 fixes #7125
This follows the same principle as with the shutdown handler, and was introduced with the changed reload handling with 2.9. Previously IsShuttingDown() was sufficient which got set at one location. SigUsr2 as handler introduced a new location where m_ShuttingDown is not necessarily set yet. Since this handler gets called when l_Restarting is enabled, we'll use this flag to avoid config update events resulting in object deactivation (object->IsActive() always returns false). refs #5996 refs #6691 refs #6970 fixes #7125 (cherry picked from commit 78e24c5)
We have an issue where some minutes after a reload with new configuration is triggered, all objects (at least hosts and services) are removed from the IDO and Icinga Web 2 shows an empty list. Some minutes later all reappear at once.
This happens in a rather large setup.
Sorry for being sparse on information, I just wanted to open the ticket so we all can start putting information in it, so expect the detail data coming up.
All I can say now is that it happens on RHEL 7 with Icinga 2 2.10.4 and happened with other 2.10 versions as well.
The text was updated successfully, but these errors were encountered: