-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-17856 Port binlog_error_action from MySQL #3615
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a draft is a draft
#FIXME: hangs MTR | ||
#--echo Test case7 | ||
## Simulate diskfull during opening of binlog and check for diskfull error | ||
## behaviour where binlog error action is to continue the server | ||
## after the error. | ||
#SET SESSION debug="+d,simulate_file_write_error"; | ||
#--replace_regex /\.[\\\/]master/master/ /Errcode: 28 ".*"\)/Errcode: 28 "No space left on device")/ | ||
## error EE_WRITE with error code number 3 is expected. | ||
#--error 3 | ||
#flush logs; | ||
#SET SESSION debug="-d,simulate_file_write_error"; | ||
#--source include/restart_mysqld.inc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hangs at flush logs;
on my machine.
It may or may not have something to od with the disabled suppressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MariaDB encounters a full disk error while trying to write to a binary log file, then it will keep retrying the write every 60 seconds. Log messages will get written to the error log every 600 seconds.
⸺ MariaDB Knowledge Base “Using and Maintaining the Binary Log” § Effects of Full Disk Errors on Binary Logging
Mentioned by MDEV-20796, This note reminded me that I saw this retry message in my logs; they confirm this “hanging” behavior.
Actually, this periodically-retry mechanism means that this case doesn’t actually IGNORE_ERROR
, since strictly speaking, it picks up the error and reponds with “retry later” (contrast with ABORT_SERVER
’s “abort immediately” response).
MDEV-20796(’s description) even wants the IGNORE_ERROR
to apply this periodically-retry behavior on all scenarios.
Should we review our terminologies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in general, we don't have to, but if we can conform to MySQL
options, it is preferred, as it makes migration more straightforward. Another option would be to include MDEV-20796 in the scope of this PR (perhaps a separate commit though) as a third option that allows for consistency between the option names and behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to include MDEV-20796 in the scope of this PR
Nay, the option MDEV-20796 wants would make other cases respond to errors by continuously retrying as well; it’s a distinct feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but it would allow each option to be consistent to its meaning. That is, we could slightly change IGNORE_ERROR
behavior so it wouldn't "retry later" on full disk errors, but it would always turn off binlogging and move on. And we wouldn't lose the functionality because it would be nested in the RETRY
configuration.
#XXX: This was From their follow-up commit, but I don’t find anything resembling in our version of `log.cc` | ||
# https://github.com/mysql/mysql-server/commit/3b6b4bf8c5d1bfada58678acebafdf6f813c2dfe#diff-0464eed4eba118bfe876fa26af42bbf449724a332247e64685cd5b73f1922a51 | ||
#--echo Test case9 | ||
#CREATE TABLE t1(i INT); | ||
## Simulate error during flushing cache to file and test the behaviour | ||
## when binlog_error_action is set to ABORT_SERVER/IGNORE_ERROR. | ||
#SET SESSION debug="+d,simulate_error_during_flush_cache_to_file"; | ||
#SET GLOBAL binlog_error_action= ABORT_SERVER; | ||
#--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect | ||
#--error ER_BINLOG_LOGGING_IMPOSSIBLE | ||
#INSERT INTO t1 VALUES (1); | ||
#--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect | ||
#--enable_reconnect | ||
#--source include/wait_until_connected_again.inc | ||
# | ||
#--let $assert_cond= COUNT(*) = 0 FROM t1; | ||
#--let $assert_text= Count of elements in t1 should be 0. | ||
#--source include/assert.inc | ||
# | ||
#SET SESSION debug="+d,simulate_error_during_flush_cache_to_file"; | ||
#SET GLOBAL binlog_error_action= IGNORE_ERROR; | ||
#INSERT INTO t1 VALUES (2); | ||
# | ||
#--let $assert_cond= COUNT(*) = 1 FROM t1; | ||
#--let $assert_text= Count of elements in t1 should be 1. | ||
#--source include/assert.inc | ||
# | ||
#DROP table t1; | ||
#SET SESSION debug="-d,simulate_error_during_flush_cache_to_file"; | ||
#--source include/restart_mysqld.inc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mysql/mysql-server@3b6b4bf8c5 mentioned the MySQL ticket and looked like a follow-up bug-fix.
However, it looks like MySQL had a different caching model than MariaDB.
--echo Test case3 | ||
# Simulate diskfull during opening of binlog and check for diskfull error | ||
# behaviour where binlog error action is to abort the server. | ||
SET SESSION debug="+d,simulate_file_write_error"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simulate_file_write_error
does simulate disk full.
############################################################################### | ||
# Bug#11758766:MYSQLD CONTINUES OPERATION WITHOUT LOGGING WHEN BINLOGS | ||
# CANNOT BE WRITTEN | ||
# | ||
# Problem: | ||
# ======== | ||
# If an error occurs that prevents mysqld writing to the binary logs (disk | ||
# full, readonly filesystem, etc) then the logs are disabled and operations | ||
# continue. This can lead to out of sync slaves and improper backups. | ||
# | ||
# Test: | ||
# ===== | ||
# A new option "binlog_error_action" has been introduced whose values | ||
# are "IGNORE" or "ABORT". When binlogging becomes impossible if user sets | ||
# the variable to "ABORT" server will stop if user sets it to "IGNORE" binlog | ||
# will be turned off and server will continue. 4 different test cases are | ||
# added to check both the behaviours. | ||
############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was copied from MySQL.
I plan to reformat the file a bit – at least I’d --echo
the testcase descriptions in the .result
s.
Let me know what other conventions there are.
@@ -28,7 +28,6 @@ com_rpc_tx : Requires connection attributes and detached sessions | |||
mysqlbinlog_blind_replace: requires @@enable_blind_replace support | |||
optimize_myrocks_replace_into_base: requires @@enable_blind_replace support | |||
optimize_myrocks_replace_into_lock: requires @@enable_blind_replace support | |||
rocksdb.skip_core_dump_on_error: requires @@binlog_error_action support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I optimistically reënabled this to see if it’s good now or still needs more support.
//XXX: MySQL needs both error calls. mysql/mysql-server@3b6b4bf8c5 | ||
my_error(ER_BINLOG_LOGGING_IMPOSSIBLE, MYF(0), abort_error); | ||
sql_print_error("Could not use %s for logging (error %d). %s", name, error_id, abort_error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mysql/mysql-server@3b6b4bf8c5 amended the second error call.
I wonder, what’s the difference between my_error
and sql_print_error
, and why did they (if not us) need to error twice?
I know that my_error
receives its format template from errmsg-utf8.txt
while sql_print_error
uses a given template. (I know this from #3360
.) This is why they have different error messages here.
server/sql/share/errmsg-utf8.txt
Lines 8311 to 8317 in ebbbe9d
ER_BINLOG_LOGGING_IMPOSSIBLE | |
chi "二进制记录不可能。消息:%s" | |
eng "Binary logging not possible. Message: %s" | |
ger "Binärlogging nicht möglich. Meldung: %s" | |
geo "ბინარული ჟურნალი შეუძლებელია. შეტყობინება: %s" | |
spa "No es posible llevar historial (log) binario. Mensaje: %s" | |
sw "Uwekaji katika kumbukumbu ya mfumo wa jozi hauwezekani. Ujumbe: %s" |
(Yes,
ER_BINLOG_LOGGING_IMPOSSIBLE
already exists.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I'm actually not sure why they did that..In Maria, my_error()
will set an error context on current_thd
(see my_message_sql
) as well as print an error message, whereas sql_print_error()
just prints the error. I imagine we should only need my_error
, though we should extend abort_error
with the
Could not use %s ..
details.
@@ -938,6 +938,15 @@ extern bool max_user_connections_checking; | |||
extern ulong opt_binlog_dbug_fsync_sleep; | |||
static const int SERVER_UID_SIZE= 29; | |||
extern char server_uid[SERVER_UID_SIZE+1]; | |||
extern ulong binlog_error_action; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referred MySQL’s locations for changes in sys_vars.cc
and mysqld.cc
, but mysqld.h
… I don’t understand how the sections were divided.
sql/sys_vars.cc
Outdated
"error, the server can either ignore the error and let the primary " | ||
"continue, or abort", | ||
GLOBAL_VAR(binlog_error_action), CMD_LINE(REQUIRED_ARG), | ||
(const char *[]){"IGNORE_ERROR", "ABORT_SERVER", NullS}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that it’s excessive to exile the enum list at a different place but still only use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn. Compount literals is a GCC extension for C++, even though it’s standardized in C99, our regular-C version.
GLOBAL_VAR(binlog_error_action), CMD_LINE(REQUIRED_ARG), | ||
(const char *[]){"IGNORE_ERROR", "ABORT_SERVER", NullS}, | ||
DEFAULT(IGNORE_ERROR)); | ||
|
||
static Sys_var_on_access_global<Sys_var_mybool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s a Sys_var_on_access_global
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its basically an authentication wrapper to secure the variable s.t. its modification is protected by some role (see set_var::check()
and for more details on the variable checking process).
@@ -3048,7 +3069,7 @@ bool MYSQL_LOG::open( | |||
DBUG_RETURN(0); | |||
|
|||
err: | |||
sql_print_error(fatal_log_error, name, errno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL had an additional log_type == LOG_BIN
requirement for this condition:
if (log_type == LOG_BIN && binlogging_impossible_mode == ABORT_SERVER)
{
…
my_error(ER_BINLOG_LOGGING_IMPOSSIBLE …
Why is it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ensures we only apply the error logic to the binary log, and not other logs. That is, the binlog (of type MYSQL_BIN_LOG
) is derived from MYSQL_LOG
, and this is the general MYSQL_LOG::open
function.
sql/mysqld.cc
Outdated
@@ -454,6 +454,7 @@ uint opt_binlog_gtid_index_span_min= 65536; | |||
my_bool opt_master_verify_checksum= 0; | |||
my_bool opt_slave_sql_verify_checksum= 1; | |||
const char *binlog_format_names[]= {"MIXED", "STATEMENT", "ROW", NullS}; | |||
ulong binlog_error_action; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A passerby colleague questioned why a variable for a single file requires modifying 4 files to add.
I’ll move this line to log.cc
and then, yes, why is a variable split among variables.cc
, implementation.cc
and a bridging .h
🤷?
@@ -938,6 +938,15 @@ extern bool max_user_connections_checking; | |||
extern ulong opt_binlog_dbug_fsync_sleep; | |||
static const int SERVER_UID_SIZE= 29; | |||
extern char server_uid[SERVER_UID_SIZE+1]; | |||
extern ulong binlog_error_action; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An enum variable only has so many valid options, but they’re unsigned long
s in RAM. The old design left so much unused RAM space 🤷.
This commit creates the variable `binlog_error_action`. It currently does nothing, and will be implemented in the main commit of MDEV-17856 Port binlog_error_action from MySQL.
***This is a draft commit temporarily pushed for collaboration.***
3a8fa72
to
836a915
Compare
This is a work-in-progress draft pushed for feedback.
An amazing description should answer some questions like:
Description
TODO: fill description here
This is a port of MySQL 5.6.22’s
binlog_error_action
variable – specifically, mysql/mysql-server@477240cee6 and select follow-up changes.The
binlog_error_action=ABORT_SERVER
enables the server to fail logging fast (Don’t full disk cause downtime anyway eh?), which prevents the server from accepting (or rejecting) more modifications that it cannot log for recovery and/or replication.IGNORE_ERROR
(current default): https://mariadb.com/kb/en/using-and-maintaining-the-binary-log/#effects-of-full-disk-errors-on-binary-loggingRETRY_LATER
andDISABLE_LOGGING
ABORT_SERVER
.Release Notes
TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.
How can this PR be tested?
TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
Basing the PR against the correct MariaDB version
main
branch.PR quality check