Skip to content

Commit

Permalink
Improve DB IDO HA failover behaviour
Browse files Browse the repository at this point in the history
- Decrease Object Authority updates to 10s (was 30s)
- Decrease failover timeout to 30s (was 60s)
- Decrease cold startup (after (re)start) with no OA updates to 30s (was 60s)
- Immediately connect on Resume()
- Fix query priority which got broken with #6970
- Add more logging when a failover is in progress

```
[2019-03-29 16:13:53 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 8.33246s ago (< failover timeout of 30s). Retrying.

[2019-03-29 16:14:23 +0100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 38.3288s ago. Taking over 'ido-mysql' in HA zone 'master'.
```

- Add more logging for reconnect and disconnect handling
- Add 'last_failover' attribute to IDO*Connection objects

refs #6970
  • Loading branch information
Michael Friedrich committed Apr 1, 2019
1 parent 34e0364 commit 149f640
Show file tree
Hide file tree
Showing 8 changed files with 126 additions and 42 deletions.
16 changes: 14 additions & 2 deletions doc/09-object-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -875,7 +875,7 @@ Configuration Attributes:
instance\_name | String | **Optional.** Unique identifier for the local Icinga 2 instance. Defaults to `default`.
instance\_description | String | **Optional.** Description for the Icinga 2 instance.
enable\_ha | Boolean | **Optional.** Enable the high availability functionality. Only valid in a [cluster setup](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido). Defaults to `true`.
failover\_timeout | Duration | **Optional.** Set the failover timeout in a [HA cluster](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido). Must not be lower than 60s. Defaults to `60s`.
failover\_timeout | Duration | **Optional.** Set the failover timeout in a [HA cluster](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido). Must not be lower than 30s. Defaults to `30s`.
cleanup | Dictionary | **Optional.** Dictionary with items for historical table cleanup.
categories | Array | **Optional.** Array of information types that should be written to the database.

Expand Down Expand Up @@ -924,6 +924,12 @@ by Icinga Web 2 in the table above.
In addition to the category flags listed above the `DbCatEverything`
flag may be used as a shortcut for listing all flags.

Runtime Attributes:

Name | Type | Description
----------------------------|-----------------------|-----------------
last\_failover | Timestamp | When the last failover happened for this connection (only available with `enable_ha = true`.

## IdoPgsqlConnection <a id="objecttype-idopgsqlconnection"></a>

IDO database adapter for PostgreSQL.
Expand Down Expand Up @@ -963,7 +969,7 @@ Configuration Attributes:
instance\_name | String | **Optional.** Unique identifier for the local Icinga 2 instance. Defaults to `default`.
instance\_description | String | **Optional.** Description for the Icinga 2 instance.
enable\_ha | Boolean | **Optional.** Enable the high availability functionality. Only valid in a [cluster setup](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido). Defaults to `true`.
failover\_timeout | Duration | **Optional.** Set the failover timeout in a [HA cluster](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido). Must not be lower than 60s. Defaults to `60s`.
failover\_timeout | Duration | **Optional.** Set the failover timeout in a [HA cluster](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido). Must not be lower than 30s. Defaults to `30s`.
cleanup | Dictionary | **Optional.** Dictionary with items for historical table cleanup.
categories | Array | **Optional.** Array of information types that should be written to the database.

Expand Down Expand Up @@ -1012,6 +1018,12 @@ by Icinga Web 2 in the table above.
In addition to the category flags listed above the `DbCatEverything`
flag may be used as a shortcut for listing all flags.

Runtime Attributes:

Name | Type | Description
----------------------------|-----------------------|-----------------
last\_failover | Timestamp | When the last failover happened for this connection (only available with `enable_ha = true`.

## InfluxdbWriter <a id="objecttype-influxdbwriter"></a>

Writes check result metrics and performance data to a defined InfluxDB host.
Expand Down
7 changes: 4 additions & 3 deletions lib/base/workqueue.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ namespace icinga

enum WorkQueuePriority
{
PriorityLow,
PriorityNormal,
PriorityHigh
PriorityLow = 0,
PriorityNormal = 1,
PriorityHigh = 2,
PriorityImmediate = 4
};

using TaskFunction = std::function<void ()>;
Expand Down
4 changes: 2 additions & 2 deletions lib/db_ido/dbconnection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -435,8 +435,8 @@ void DbConnection::ValidateFailoverTimeout(const Lazy<double>& lvalue, const Val
{
ObjectImpl<DbConnection>::ValidateFailoverTimeout(lvalue, utils);

if (lvalue() < 60)
BOOST_THROW_EXCEPTION(ValidationError(this, { "failover_timeout" }, "Failover timeout minimum is 60s."));
if (lvalue() < 30)
BOOST_THROW_EXCEPTION(ValidationError(this, { "failover_timeout" }, "Failover timeout minimum is 30s."));
}

void DbConnection::ValidateCategories(const Lazy<Array::Ptr>& lvalue, const ValidationUtils& utils)
Expand Down
4 changes: 3 additions & 1 deletion lib/db_ido/dbconnection.ti
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,11 @@ abstract class DbConnection : ConfigObject
};

[config] double failover_timeout {
default {{{ return 60; }}}
default {{{ return 30; }}}
};

[state, no_user_modify] double last_failover;

[no_user_modify] String schema_version;
[no_user_modify] bool connected;
[no_user_modify] bool should_connect {
Expand Down
64 changes: 47 additions & 17 deletions lib/db_ido_mysql/idomysqlconnection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,16 @@ void IdoMysqlConnection::StatsFunc(const Dictionary::Ptr& status, const Array::P

void IdoMysqlConnection::Resume()
{
DbConnection::Resume();

Log(LogInformation, "IdoMysqlConnection")
<< "'" << GetName() << "' resumed.";

SetConnected(false);

m_QueryQueue.SetExceptionCallback(std::bind(&IdoMysqlConnection::ExceptionHandler, this, _1));

/* Immediately try to connect on Resume() without timer. */
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::Reconnect, this), PriorityImmediate);

m_TxTimer = new Timer();
m_TxTimer->SetInterval(1);
m_TxTimer->OnTimerExpired.connect(std::bind(&IdoMysqlConnection::TxTimerHandler, this));
Expand All @@ -83,23 +84,30 @@ void IdoMysqlConnection::Resume()
m_ReconnectTimer->SetInterval(10);
m_ReconnectTimer->OnTimerExpired.connect(std::bind(&IdoMysqlConnection::ReconnectTimerHandler, this));
m_ReconnectTimer->Start();
m_ReconnectTimer->Reschedule(0);

/* Start with queries after connect. */
DbConnection::Resume();

ASSERT(m_Mysql->thread_safe());
}

void IdoMysqlConnection::Pause()
{
m_ReconnectTimer.reset();
Log(LogDebug, "IdoMysqlConnection")
<< "Attempting to pause '" << GetName() << "'.";

DbConnection::Pause();

m_ReconnectTimer.reset();

#ifdef I2_DEBUG /* I2_DEBUG */
Log(LogDebug, "IdoMysqlConnection")
<< "Rescheduling disconnect task.";
#endif /* I2_DEBUG */

m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::Disconnect, this), PriorityLow);

/* Work on remaining tasks but never delete the threads, for HA resuming later. */
m_QueryQueue.Join();

Log(LogInformation, "IdoMysqlConnection")
Expand Down Expand Up @@ -137,6 +145,9 @@ void IdoMysqlConnection::Disconnect()
m_Mysql->close(&m_Connection);

SetConnected(false);

Log(LogInformation, "IdoMysqlConnection")
<< "Disconnected from '" << GetName() << "' database '" << GetDatabase() << "'.";
}

void IdoMysqlConnection::TxTimerHandler()
Expand All @@ -154,8 +165,8 @@ void IdoMysqlConnection::NewTransaction()
<< "Scheduling new transaction and finishing async queries.";
#endif /* I2_DEBUG */

m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::InternalNewTransaction, this), PriorityHigh);
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::FinishAsyncQueries, this), PriorityHigh);
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::InternalNewTransaction, this), PriorityNormal);
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::FinishAsyncQueries, this), PriorityNormal);
}

void IdoMysqlConnection::InternalNewTransaction()
Expand All @@ -176,7 +187,8 @@ void IdoMysqlConnection::ReconnectTimerHandler()
<< "Scheduling reconnect task.";
#endif /* I2_DEBUG */

m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::Reconnect, this), PriorityHigh);
/* Only allow Reconnect events with high priority. */
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::Reconnect, this), PriorityImmediate);
}

void IdoMysqlConnection::Reconnect()
Expand All @@ -194,6 +206,7 @@ void IdoMysqlConnection::Reconnect()

bool reconnect = false;

/* Ensure to close old connections first. */
if (GetConnected()) {
/* Check if we're really still connected */
if (m_Mysql->ping(&m_Connection) == 0)
Expand All @@ -204,6 +217,9 @@ void IdoMysqlConnection::Reconnect()
reconnect = true;
}

Log(LogDebug, "IdoMysqlConnection")
<< "Reconnect: Clearing ID cache.";

ClearIDCache();

String ihost, isocket_path, iuser, ipasswd, idb;
Expand Down Expand Up @@ -258,6 +274,9 @@ void IdoMysqlConnection::Reconnect()
BOOST_THROW_EXCEPTION(std::runtime_error(m_Mysql->error(&m_Connection)));
}

Log(LogNotice, "IdoMysqlConnection")
<< "Reconnect: '" << GetName() << "' is now connected to database '" << GetDatabase() << "'.";

SetConnected(true);

IdoMysqlResult result = Query("SELECT @@global.max_allowed_packet AS max_allowed_packet");
Expand Down Expand Up @@ -343,12 +362,16 @@ void IdoMysqlConnection::Reconnect()
else
status_update_time = 0;

double status_update_age = Utility::GetTime() - status_update_time;
double now = Utility::GetTime();

double status_update_age = now - status_update_time;
double failoverTimeout = GetFailoverTimeout();

Log(LogNotice, "IdoMysqlConnection")
<< "Last update by '" << endpoint_name << "' was " << status_update_age << "s ago.";
if (status_update_age < failoverTimeout) {
Log(LogInformation, "IdoMysqlConnection")
<< "Last update by endpoint '" << endpoint_name << "' was "
<< status_update_age << "s ago (< failover timeout of " << failoverTimeout << "s). Retrying.";

if (status_update_age < GetFailoverTimeout()) {
m_Mysql->close(&m_Connection);
SetConnected(false);
SetShouldConnect(false);
Expand All @@ -366,9 +389,15 @@ void IdoMysqlConnection::Reconnect()

return;
}

SetLastFailover(now);

Log(LogInformation, "IdoMysqlConnection")
<< "Last update by endpoint '" << endpoint_name << "' was "
<< status_update_age << "s ago. Taking over '" << GetName() << "' in HA zone '" << Zone::GetLocalZone()->GetName() << "'.";
}

Log(LogNotice, "IdoMysqlConnection", "Enabling IDO connection.");
Log(LogNotice, "IdoMysqlConnection", "Enabling IDO connection in HA zone.");
}

Log(LogInformation, "IdoMysqlConnection")
Expand Down Expand Up @@ -435,9 +464,9 @@ void IdoMysqlConnection::Reconnect()
<< "Scheduling session table clear and finish connect task.";
#endif /* I2_DEBUG */

m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::ClearTablesBySession, this), PriorityHigh);
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::ClearTablesBySession, this), PriorityNormal);

m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::FinishConnect, this, startTime), PriorityHigh);
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::FinishConnect, this, startTime), PriorityNormal);
}

void IdoMysqlConnection::FinishConnect(double startTime)
Expand All @@ -450,7 +479,8 @@ void IdoMysqlConnection::FinishConnect(double startTime)
FinishAsyncQueries();

Log(LogInformation, "IdoMysqlConnection")
<< "Finished reconnecting to MySQL IDO database in " << std::setw(2) << Utility::GetTime() - startTime << " second(s).";
<< "Finished reconnecting to '" << GetName() << "' database '" << GetDatabase() << "' in "
<< std::setw(2) << Utility::GetTime() - startTime << " second(s).";

Query("COMMIT");
Query("BEGIN");
Expand Down Expand Up @@ -710,7 +740,7 @@ void IdoMysqlConnection::ActivateObject(const DbObject::Ptr& dbobj)
<< "Scheduling object activation task for '" << dbobj->GetName1() << "!" << dbobj->GetName2() << "'.";
#endif /* I2_DEBUG */

m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::InternalActivateObject, this, dbobj), PriorityHigh);
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::InternalActivateObject, this, dbobj), PriorityNormal);
}

void IdoMysqlConnection::InternalActivateObject(const DbObject::Ptr& dbobj)
Expand Down Expand Up @@ -755,7 +785,7 @@ void IdoMysqlConnection::DeactivateObject(const DbObject::Ptr& dbobj)
<< "Scheduling object deactivation task for '" << dbobj->GetName1() << "!" << dbobj->GetName2() << "'.";
#endif /* I2_DEBUG */

m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::InternalDeactivateObject, this, dbobj), PriorityHigh);
m_QueryQueue.Enqueue(std::bind(&IdoMysqlConnection::InternalDeactivateObject, this, dbobj), PriorityNormal);
}

void IdoMysqlConnection::InternalDeactivateObject(const DbObject::Ptr& dbobj)
Expand Down
Loading

0 comments on commit 149f640

Please sign in to comment.