Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-17856 Port binlog_error_action from MySQL #3615

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

ParadoxV5
Copy link
Contributor

@ParadoxV5 ParadoxV5 commented Nov 7, 2024

This is a work-in-progress draft pushed for feedback.


  • The Jira issue number for this PR is: MDEV-17856

An amazing description should answer some questions like:

  1. What problem is the patch trying to solve?
  2. If some output changed that is not visible in a test case, what was it looking like before the change and how it's looking with this patch applied?
  3. Do you think this patch might introduce side-effects in other parts of the server?

Description

TODO: fill description here


This is a port of MySQL 5.6.22’s binlog_error_action variable – specifically, mysql/mysql-server@477240cee6 and select follow-up changes.

Controls what happens when the server cannot write to the binary log, which can cause the master's log to become inconsistent and replication slaves to lose synchronization. Previous releases used the name binlogging_impossible_mode.

In MySQL 5.6 [and current MariaDB], the default for binlog_error_action is IGNORE_ERROR, meaning the server logs the error, halts logging, and continues performing updates; this is to provide backward compatibility with older versions of the MySQL [and MariaDB] Server. Setting this variable to ABORT_SERVER makes the server halt logging and shut down whenever it cannot write to the binary log; this is the recommended setting, particularly in complex replication environments.

The binlog_error_action=ABORT_SERVER enables the server to fail logging fast (Don’t full disk cause downtime anyway eh?), which prevents the server from accepting (or rejecting) more modifications that it cannot log for recovery and/or replication.



Release Notes

TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@ParadoxV5 ParadoxV5 self-assigned this Nov 7, 2024
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a draft is a draft

Comment on lines +117 to +128
#FIXME: hangs MTR
#--echo Test case7
## Simulate diskfull during opening of binlog and check for diskfull error
## behaviour where binlog error action is to continue the server
## after the error.
#SET SESSION debug="+d,simulate_file_write_error";
#--replace_regex /\.[\\\/]master/master/ /Errcode: 28 ".*"\)/Errcode: 28 "No space left on device")/
## error EE_WRITE with error code number 3 is expected.
#--error 3
#flush logs;
#SET SESSION debug="-d,simulate_file_write_error";
#--source include/restart_mysqld.inc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hangs at flush logs; on my machine.
It may or may not have something to od with the disabled suppressions.

Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MariaDB encounters a full disk error while trying to write to a binary log file, then it will keep retrying the write every 60 seconds. Log messages will get written to the error log every 600 seconds.
MariaDB Knowledge Base “Using and Maintaining the Binary Log” § Effects of Full Disk Errors on Binary Logging

Mentioned by MDEV-20796, This note reminded me that I saw this retry message in my logs; they confirm this “hanging” behavior.


Actually, this periodically-retry mechanism means that this case doesn’t actually IGNORE_ERROR, since strictly speaking, it picks up the error and reponds with “retry later” (contrast with ABORT_SERVER’s “abort immediately” response).
MDEV-20796(’s description) even wants the IGNORE_ERROR to apply this periodically-retry behavior on all scenarios.

Should we review our terminologies?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in general, we don't have to, but if we can conform to MySQL options, it is preferred, as it makes migration more straightforward. Another option would be to include MDEV-20796 in the scope of this PR (perhaps a separate commit though) as a third option that allows for consistency between the option names and behaviors.

Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to include MDEV-20796 in the scope of this PR

Nay, the option MDEV-20796 wants would make other cases respond to errors by continuously retrying as well; it’s a distinct feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but it would allow each option to be consistent to its meaning. That is, we could slightly change IGNORE_ERROR behavior so it wouldn't "retry later" on full disk errors, but it would always turn off binlogging and move on. And we wouldn't lose the functionality because it would be nested in the RETRY configuration.

Comment on lines +141 to +170
#XXX: This was From their follow-up commit, but I don’t find anything resembling in our version of `log.cc`
# https://github.com/mysql/mysql-server/commit/3b6b4bf8c5d1bfada58678acebafdf6f813c2dfe#diff-0464eed4eba118bfe876fa26af42bbf449724a332247e64685cd5b73f1922a51
#--echo Test case9
#CREATE TABLE t1(i INT);
## Simulate error during flushing cache to file and test the behaviour
## when binlog_error_action is set to ABORT_SERVER/IGNORE_ERROR.
#SET SESSION debug="+d,simulate_error_during_flush_cache_to_file";
#SET GLOBAL binlog_error_action= ABORT_SERVER;
#--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
#--error ER_BINLOG_LOGGING_IMPOSSIBLE
#INSERT INTO t1 VALUES (1);
#--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
#--enable_reconnect
#--source include/wait_until_connected_again.inc
#
#--let $assert_cond= COUNT(*) = 0 FROM t1;
#--let $assert_text= Count of elements in t1 should be 0.
#--source include/assert.inc
#
#SET SESSION debug="+d,simulate_error_during_flush_cache_to_file";
#SET GLOBAL binlog_error_action= IGNORE_ERROR;
#INSERT INTO t1 VALUES (2);
#
#--let $assert_cond= COUNT(*) = 1 FROM t1;
#--let $assert_text= Count of elements in t1 should be 1.
#--source include/assert.inc
#
#DROP table t1;
#SET SESSION debug="-d,simulate_error_during_flush_cache_to_file";
#--source include/restart_mysqld.inc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mysql/mysql-server@3b6b4bf8c5 mentioned the MySQL ticket and looked like a follow-up bug-fix.
However, it looks like MySQL had a different caching model than MariaDB.

--echo Test case3
# Simulate diskfull during opening of binlog and check for diskfull error
# behaviour where binlog error action is to abort the server.
SET SESSION debug="+d,simulate_file_write_error";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simulate_file_write_error does simulate disk full.

Comment on lines +1 to +18
###############################################################################
# Bug#11758766:MYSQLD CONTINUES OPERATION WITHOUT LOGGING WHEN BINLOGS
# CANNOT BE WRITTEN
#
# Problem:
# ========
# If an error occurs that prevents mysqld writing to the binary logs (disk
# full, readonly filesystem, etc) then the logs are disabled and operations
# continue. This can lead to out of sync slaves and improper backups.
#
# Test:
# =====
# A new option "binlog_error_action" has been introduced whose values
# are "IGNORE" or "ABORT". When binlogging becomes impossible if user sets
# the variable to "ABORT" server will stop if user sets it to "IGNORE" binlog
# will be turned off and server will continue. 4 different test cases are
# added to check both the behaviours.
###############################################################################
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was copied from MySQL.

I plan to reformat the file a bit – at least I’d --echo the testcase descriptions in the .results.
Let me know what other conventions there are.

@@ -28,7 +28,6 @@ com_rpc_tx : Requires connection attributes and detached sessions
mysqlbinlog_blind_replace: requires @@enable_blind_replace support
optimize_myrocks_replace_into_base: requires @@enable_blind_replace support
optimize_myrocks_replace_into_lock: requires @@enable_blind_replace support
rocksdb.skip_core_dump_on_error: requires @@binlog_error_action support
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I optimistically reënabled this to see if it’s good now or still needs more support.

Comment on lines +2920 to +2924
//XXX: MySQL needs both error calls. mysql/mysql-server@3b6b4bf8c5
my_error(ER_BINLOG_LOGGING_IMPOSSIBLE, MYF(0), abort_error);
sql_print_error("Could not use %s for logging (error %d). %s", name, error_id, abort_error);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mysql/mysql-server@3b6b4bf8c5 amended the second error call.

I wonder, what’s the difference between my_error and sql_print_error, and why did they (if not us) need to error twice?

I know that my_error receives its format template from errmsg-utf8.txt while sql_print_error uses a given template. (I know this from #3360.) This is why they have different error messages here.

ER_BINLOG_LOGGING_IMPOSSIBLE
chi "二进制记录不可能。消息:%s"
eng "Binary logging not possible. Message: %s"
ger "Binärlogging nicht möglich. Meldung: %s"
geo "ბინარული ჟურნალი შეუძლებელია. შეტყობინება: %s"
spa "No es posible llevar historial (log) binario. Mensaje: %s"
sw "Uwekaji katika kumbukumbu ya mfumo wa jozi hauwezekani. Ujumbe: %s"

(Yes, ER_BINLOG_LOGGING_IMPOSSIBLE already exists.)

Copy link
Contributor

@bnestere bnestere Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I'm actually not sure why they did that..In Maria, my_error() will set an error context on current_thd (see my_message_sql) as well as print an error message, whereas sql_print_error() just prints the error. I imagine we should only need my_error , though we should extend abort_error with the

Could not use %s ..

details.

@@ -938,6 +938,15 @@ extern bool max_user_connections_checking;
extern ulong opt_binlog_dbug_fsync_sleep;
static const int SERVER_UID_SIZE= 29;
extern char server_uid[SERVER_UID_SIZE+1];
extern ulong binlog_error_action;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referred MySQL’s locations for changes in sys_vars.cc and mysqld.cc, but mysqld.h… I don’t understand how the sections were divided.

sql/sys_vars.cc Outdated
"error, the server can either ignore the error and let the primary "
"continue, or abort",
GLOBAL_VAR(binlog_error_action), CMD_LINE(REQUIRED_ARG),
(const char *[]){"IGNORE_ERROR", "ABORT_SERVER", NullS},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt that it’s excessive to exile the enum list at a different place but still only use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn. Compount literals is a GCC extension for C++, even though it’s standardized in C99, our regular-C version.

GLOBAL_VAR(binlog_error_action), CMD_LINE(REQUIRED_ARG),
(const char *[]){"IGNORE_ERROR", "ABORT_SERVER", NullS},
DEFAULT(IGNORE_ERROR));

static Sys_var_on_access_global<Sys_var_mybool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s a Sys_var_on_access_global?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its basically an authentication wrapper to secure the variable s.t. its modification is protected by some role (see set_var::check() and for more details on the variable checking process).

@@ -3048,7 +3069,7 @@ bool MYSQL_LOG::open(
DBUG_RETURN(0);

err:
sql_print_error(fatal_log_error, name, errno);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL had an additional log_type == LOG_BIN requirement for this condition:

  if (log_type == LOG_BIN && binlogging_impossible_mode == ABORT_SERVER)
  {
    …
    my_error(ER_BINLOG_LOGGING_IMPOSSIBLE …

Why is it there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures we only apply the error logic to the binary log, and not other logs. That is, the binlog (of type MYSQL_BIN_LOG) is derived from MYSQL_LOG, and this is the general MYSQL_LOG::open function.

sql/mysqld.cc Outdated
@@ -454,6 +454,7 @@ uint opt_binlog_gtid_index_span_min= 65536;
my_bool opt_master_verify_checksum= 0;
my_bool opt_slave_sql_verify_checksum= 1;
const char *binlog_format_names[]= {"MIXED", "STATEMENT", "ROW", NullS};
ulong binlog_error_action;
Copy link
Contributor Author

@ParadoxV5 ParadoxV5 Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A passerby colleague questioned why a variable for a single file requires modifying 4 files to add.

I’ll move this line to log.cc and then, yes, why is a variable split among variables.cc, implementation.cc and a bridging .h 🤷?

@@ -938,6 +938,15 @@ extern bool max_user_connections_checking;
extern ulong opt_binlog_dbug_fsync_sleep;
static const int SERVER_UID_SIZE= 29;
extern char server_uid[SERVER_UID_SIZE+1];
extern ulong binlog_error_action;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An enum variable only has so many valid options, but they’re unsigned longs in RAM. The old design left so much unused RAM space 🤷.

This commit creates the variable `binlog_error_action`.
It currently does nothing, and will be implemented in the main commit of
MDEV-17856 Port binlog_error_action from MySQL.
***This is a draft commit temporarily pushed for collaboration.***
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants