Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper handling of CHAR columns with binary collations #8730

Merged
merged 5 commits into from
Oct 7, 2021

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Aug 30, 2021

Description

The work in #8137 seems to have introduced a bug in how we vreplicate CHAR columns that are using a binary collation. If we e.g. have a column defined as foo CHAR(3) COLLATE UTF8MB4_BIN then the column gets padded to the maximum byte length / display width of 12 -- utf8mb4 is 1-4 bytes per character -- which causes the SQL statement to fail as we're then trying to insert 12 characters into a 3 character column:

Error applying event: Data too long for column 'foo' at row 1 (errno 1406) (sqlstate 22001)
during query: update t1 set ... foo='TWD\0\0\0\0\0\0\0\0\0' ...

Test Case

# git clone [email protected]:vitessio/vitess.git && cd vitess

git checkout main && git pull

make docker_local

./docker/local/run.sh

vtctlclient -server localhost:15999 ApplySchema -sql "ALTER TABLE customer ADD country_code CHAR(3) DEFAULT 'USA' COLLATE UTF8MB4_BIN" commerce

mysql -h 127.0.0.1 -P 15306 commerce -e "insert into customer values (1, '[email protected]', 'USA')"

./201_customer_tablets.sh

./202_move_tables.sh

mysql -h 127.0.0.1 -P 15306 commerce -e "insert into customer values (2, '[email protected]', 'CAN')"

vtctlclient -server localhost:15999 Workflow customer.commerce2customer show

Final results:

{
        "Workflow": "commerce2customer",
        "SourceLocation": {
                "Keyspace": "commerce",
                "Shards": [
                        "0"
                ]
        },
        "TargetLocation": {
                "Keyspace": "customer",
                "Shards": [
                        "0"
                ]
        },
        "MaxVReplicationLag": 5,
        "ShardStatuses": {
                "0/zone1-0000000200": {
                        "PrimaryReplicationStatuses": [
                                {
                                        "Shard": "0",
                                        "Tablet": "zone1-0000000200",
                                        "ID": 1,
                                        "Bls": {
                                                "keyspace": "commerce",
                                                "shard": "0",
                                                "filter": {
                                                        "rules": [
                                                                {
                                                                        "match": "customer",
                                                                        "filter": "select * from customer"
                                                                },
                                                                {
                                                                        "match": "corder",
                                                                        "filter": "select * from corder"
                                                                }
                                                        ]
                                                }
                                        },
                                        "Pos": "971ac708-09c6-11ec-9caa-0242ac110002:1-42",
                                        "StopPos": "",
                                        "State": "Running",
                                        "DBName": "vt_customer",
                                        "TransactionTimestamp": 0,
                                        "TimeUpdated": 1630351445,
                                        "Message": "Data too long for column 'country_code' at row 1 (errno 1406) (sqlstate 22001) during query: insert into customer(customer_id,email,country_code) values (2,'[email protected]','CAN\\0\\0\\0\\0\\0\\0\\0\\0\\0')",
                                        "Tags": "",
                                        "CopyState": null
                                }
                        ],
                        "TabletControls": null,
                        "PrimaryIsServing": true
                }
        }
}

Originating Binlog events:

mysqlbinlog -vvv --base64-output=decode-rows /vt/vtdataroot/vt_0000000100/bin-logs/vt-0000000100-bin.000001
...
# at 11487
#210830 19:43:25 server id 386181494  end_log_pos 11552 CRC32 0x5f219419 	GTID	last_committed=40	sequence_number=41	rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= '717b62d4-09ca-11ec-acb3-0242ac110002:41'/*!*/;
# at 11552
#210830 19:43:25 server id 386181494  end_log_pos 11631 CRC32 0x357d0a9c 	Query	thread_id=77	exec_time=0	error_code=0
SET TIMESTAMP=1630352605/*!*/;
BEGIN
/*!*/;
# at 11631
#210830 19:43:25 server id 386181494  end_log_pos 11695 CRC32 0xcf4ba969 	Table_map: `vt_commerce`.`customer` mapped to number 143
# at 11695
#210830 19:43:25 server id 386181494  end_log_pos 11761 CRC32 0x0babe28e 	Write_rows: table id 143 flags: STMT_END_F
### INSERT INTO `vt_commerce`.`customer`
### SET
###   @1=2 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='[email protected]' /* VARSTRING(128) meta=128 nullable=1 is_null=0 */
###   @3='CAN' /* STRING(12) meta=65036 nullable=1 is_null=0 */
# at 11761
#210830 19:43:25 server id 386181494  end_log_pos 11792 CRC32 0xbaed4e55 	Xid = 495
COMMIT/*!*/;
...

It's the max length you see there of 12 bytes that is the issue. We have to take the max-bytes-per-character into account when padding the byte array with null bytes (\0s). So in this case we have a max of 4 bytes per character with utf8mb4 and we should be calculating this to map the max bytes to max chars when padding with null bytes: 12 / 4 = 3. In effect, we need to make the same kind of adjustment done here and here in MySQL. I don't yet see where we have character set info available here in Vitess though...

Related Issue(s)

#8743

Checklist

  • Should this PR be backported? Yes
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@shlomi-noach
Copy link
Contributor

I'm actually unsure in which endtoend test this test data lands. Right now we're seeing a flare of failing tests, but none of them seems to indicate anything "USA" (the text used to validate behavior).
I can quickly look tomorrow, let's keep this open.

@deepthi deepthi removed the request for review from rohit-nayak-ps August 30, 2021 18:51
@mattlord mattlord self-assigned this Aug 30, 2021
@mattlord mattlord marked this pull request as draft August 30, 2021 19:28
@mattlord
Copy link
Contributor Author

mattlord commented Aug 31, 2021

Binlog events w/o the COLLATION clause from the test case (vreplication works fine):

# at 11467
#210831  1:46:23 server id 271990647  end_log_pos 11532 CRC32 0xb8057d26 	GTID	last_committed=40	sequence_number=41	rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= '245cb216-09fd-11ec-a089-0242ac110002:41'/*!*/;
# at 11532
#210831  1:46:23 server id 271990647  end_log_pos 11611 CRC32 0xeb8c3e1a 	Query	thread_id=69	exec_time=0	error_code=0
SET TIMESTAMP=1630374383/*!*/;
BEGIN
/*!*/;
# at 11611
#210831  1:46:23 server id 271990647  end_log_pos 11675 CRC32 0x06e8a0e5 	Table_map: `vt_commerce`.`customer` mapped to number 143
# at 11675
#210831  1:46:23 server id 271990647  end_log_pos 11741 CRC32 0xebc7059f 	Write_rows: table id 143 flags: STMT_END_F
### INSERT INTO `vt_commerce`.`customer`
### SET
###   @1=2 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='[email protected]' /* VARSTRING(128) meta=128 nullable=1 is_null=0 */
###   @3='CAN' /* STRING(9) meta=65033 nullable=1 is_null=0 */
# at 11741
#210831  1:46:23 server id 271990647  end_log_pos 11772 CRC32 0x46bb8653 	Xid = 471
COMMIT/*!*/;

Max storage/display width bytes went from 12 to 9 -- because this is MySQL 5.7 where utf8mb3 is the default -- and meta went from 65036 to 65033... but the key thing is that it's not being treated as a purely binary value (binary type and binary flag set) and thus does not get the new padding.

@mattlord mattlord changed the title Test proper handling of character columns with binary collations Proper handling of CHAR columns with binary collations Aug 31, 2021
@shlomi-noach
Copy link
Contributor

Awesome! I see the same error when I do manual tests; and we need to fix this. What concerns me is that I can't see which CI test fails due to the changes in this PR. We should find at least one failed test first, and if there isn't, generate one, and then work to fix the problem and ensure the test passes.

It's easy for me to produce a test case in onlineddl_vrepl_suite, which tests OnlineDDL via VReplication. I actually created one locally and I can see the problem reproduce.

@shlomi-noach
Copy link
Contributor

@mattlord some character set info is found here:

// A few interesting character set values.
// See http://dev.mysql.com/doc/internals/en/character-set.html#packet-Protocol::CharacterSet
const (
// CharacterSetUtf8 is for UTF8. We use this by default.
CharacterSetUtf8 = 33
// CharacterSetBinary is for binary. Use by integer fields for instance.
CharacterSetBinary = 63
)
// CharacterSetMap maps the charset name (used in ConnParams) to the
// integer value. Interesting ones have their own constant above.
var CharacterSetMap = map[string]uint8{
"big5": 1,
"dec8": 3,
"cp850": 4,
"hp8": 6,
"koi8r": 7,
"latin1": 8,
"latin2": 9,
"swe7": 10,
"ascii": 11,
"ujis": 12,
"sjis": 13,
"hebrew": 16,
"tis620": 18,
"euckr": 19,
"koi8u": 22,
"gb2312": 24,
"greek": 25,
"cp1250": 26,
"gbk": 28,
"latin5": 30,
"armscii8": 32,
"utf8": CharacterSetUtf8,
"ucs2": 35,
"cp866": 36,
"keybcs2": 37,
"macce": 38,
"macroman": 39,
"cp852": 40,
"latin7": 41,
"utf8mb4": 45,
"cp1251": 51,
"utf16": 54,
"utf16le": 56,
"cp1256": 57,
"cp1257": 59,
"utf32": 60,
"binary": CharacterSetBinary,
"geostd8": 92,
"cp932": 95,
"eucjpms": 97,
}
// CharacterSetEncoding maps a charset name to a golang encoder.
// golang does not support encoders for all MySQL charsets.
// A charset not in this map is unsupported.
// A trivial encoding (e.g. utf8) has a `nil` encoder
var CharacterSetEncoding = map[string]encoding.Encoding{
"cp850": charmap.CodePage850,
"koi8r": charmap.KOI8R,
"latin1": charmap.Windows1252,
"latin2": charmap.ISO8859_2,
"ascii": nil,
"hebrew": charmap.ISO8859_8,
"greek": charmap.ISO8859_7,
"cp1250": charmap.Windows1250,
"gbk": simplifiedchinese.GBK,
"latin5": charmap.ISO8859_9,
"utf8": nil,
"cp866": charmap.CodePage866,
"cp852": charmap.CodePage852,
"latin7": charmap.ISO8859_13,
"utf8mb4": nil,
"cp1251": charmap.Windows1251,
"cp1256": charmap.Windows1256,
"cp1257": charmap.Windows1257,
"binary": nil,
}
// ReverseCharacterSetMap maps a charset integer code to charset name
var ReverseCharacterSetMap = map[uint8]string{}
func init() {
for c, i := range CharacterSetMap {
ReverseCharacterSetMap[i] = c
}
}

which is currently mostly used to map a character set to golang encoding.

Question: I don';t really understand the code in https://github.com/mysql/mysql-server/blob/beb865a960b9a8a16cf999c323e46c5b0c67f21f/sql/sql_show.cc#L5468-L5482 and in https://github.com/mysql/mysql-server/blob/beb865a960b9a8a16cf999c323e46c5b0c67f21f/sql/field.cc#L6410-L6440.

I wonder: would it be correct to just trim right the \0 padding, instead of pre-calculating length based on actual character contents?

@shlomi-noach
Copy link
Contributor

I created #8749 as a draft PR, which introduces a test that is known to fail CI. I prefer the tests in this PR, but am unsure where they land and which test they fail.

@shlomi-noach
Copy link
Contributor

What confuses me is that char(3) collate utf8mb4_bin is considered to be a BINARY in the binlog, at least the way Vitess parses it. I tried identifying that this is really-a-test-with-binary-collation but haven't nailed it yet. I think if we can identify that, we can exempt the right padding.

@mattlord
Copy link
Contributor Author

mattlord commented Aug 31, 2021

So the write_rows binary log events DO NOT have the character set information in them. And they do in fact treat/handle/encode the char(3) collate utf8mb4_bin column values as a binary type with the binary flag set so we cannot distinguish it from a e.g. BINARY(12) column value when parsing the binary log events.

This means that we need to apply the padding, or trim the previously added padding, when forming the SQL statement on the vplayer side -- where we should have the character set information for the table and column(s).

@shlomi-noach make sense? I think that's what you were alluding to with:

I wonder: would it be correct to just trim right the \0 padding, instead of pre-calculating length based on actual character contents?

@shlomi-noach
Copy link
Contributor

make sense? I think that's what you were alluding to with:

@mattlord I have to say, so nice to have someone so authoritative! 😊
This is cool; on the VPlayer side we are able to identify the actual column type. If the information is not already there, then we can relatively easily add it. So let's do auto-padding on parsing row event, then possibly trimming it when applying in VPlayer.

@tokikanno
Copy link
Contributor

tokikanno commented Aug 31, 2021

I'm not familiar with the binlog parsing, but how does mysqlbinlog tool generate following info?

###   @1=2 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='[email protected]' /* VARSTRING(128) meta=128 nullable=1 is_null=0 */
                               ^^^^^^^^^^^^^^ where does this VARSTRING(128) info come from? 🤔 
###   @3='CAN' /* STRING(12) meta=65036 nullable=1 is_null=0 */
                 ^^^^^^^^^^^ also this STRING(12)

Are these usable info for auto-padding?

@mattlord
Copy link
Contributor Author

mattlord commented Aug 31, 2021

Hi @tokikanno!

I'm not familiar with the binlog parsing, but how does mysqlbinlog tool generate following info?

###   @1=2 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='[email protected]' /* VARSTRING(128) meta=128 nullable=1 is_null=0 */
                               ^^^^^^^^^^^^^^ where does this VARSTRING(128) info come from? 🤔 
###   @3='CAN' /* STRING(12) meta=65036 nullable=1 is_null=0 */
                 ^^^^^^^^^^^ also this STRING(12)

Are these usable info for auto-padding?

At the code level, mysqlbinlog parses the events in essentially the same manner we are in vitess: https://github.com/mysql/mysql-server/blob/5.7/client/mysqlbinlog.cc

At a logical level, you can see the docs here: https://dev.mysql.com/doc/refman/5.7/en/mysqlbinlog-row-events.html

Character set information is not available in the binary log, which affects string column display:

There is no distinction made between corresponding binary and nonbinary string types (BINARY and CHAR, VARBINARY and VARCHAR, BLOB and TEXT). The output uses a data type of STRING for fixed-length strings and VARSTRING for variable-length strings.

For multibyte character sets, the maximum number of bytes per character is not present in the binary log, so the length for string types is displayed in bytes rather than in characters. For example, STRING(4) is used as the data type for values from either of these column types:

  • CHAR(4) CHARACTER SET latin1
  • CHAR(2) CHARACTER SET ucs2

@tokikanno
Copy link
Contributor

There is no distinction made between corresponding binary and nonbinary string types (BINARY and CHAR, VARBINARY and VARCHAR, BLOB and TEXT). The output uses a data type of STRING for fixed-length strings and VARSTRING for variable-length strings.

thx, got it 🤨

@shlomi-noach
Copy link
Contributor

@mattlord as I've been away for a while: would you like to import the test case from #8749? If that passes, then we're all good.

@mattlord
Copy link
Contributor Author

mattlord commented Oct 4, 2021

@mattlord as I've been away for a while: would you like to import the test case from #8749? If that passes, then we're all good.

Will do, thanks! I'm going to work on this next.

@mattlord mattlord closed this Oct 4, 2021
@mattlord mattlord deleted the TestVReplCharColWBinaryCollation branch October 4, 2021 15:39
@mattlord mattlord restored the TestVReplCharColWBinaryCollation branch October 4, 2021 15:40
@mattlord mattlord reopened this Oct 4, 2021
@mattlord mattlord force-pushed the TestVReplCharColWBinaryCollation branch 4 times, most recently from 8a1da75 to 437f77a Compare October 5, 2021 20:13
@mattlord mattlord force-pushed the TestVReplCharColWBinaryCollation branch from e08c25f to 82e4788 Compare October 6, 2021 04:39
We need to take the max bytes per character into account
or we try to insert too many characters into the fixed
length column.

Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord force-pushed the TestVReplCharColWBinaryCollation branch from 82e4788 to fe9ed55 Compare October 6, 2021 04:53
@shlomi-noach
Copy link
Contributor

Looks like onlineddl_vrepl_suite passes now!

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the insight on how binaries are encoded in MySQL; tests were added - if they pass, then the PR is good to merge!

If we do the padding higher in the call stack then we have the
column schema, which includes the character set, that we can use
to do the padding correctly.

Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord force-pushed the TestVReplCharColWBinaryCollation branch 2 times, most recently from 5822063 to e7c8cfb Compare October 6, 2021 07:19
There's some test failures that I don't fully understand -- this
method is cauing VDiffs to fail, along with the onlineddl_vrepl
varbinary test.

Reverting to the post-trim method as it now seems safer -- if less
elegant.

Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord force-pushed the TestVReplCharColWBinaryCollation branch from e7c8cfb to 639acae Compare October 6, 2021 07:21
And also ensure that both the value and the column type binary

Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord requested review from deepthi and removed request for systay and harshit-gangal October 6, 2021 19:54
@vitessio vitessio deleted a comment from shlomi-noach Oct 6, 2021
@sougou sougou merged commit 6624c7c into main Oct 7, 2021
@sougou sougou deleted the TestVReplCharColWBinaryCollation branch October 7, 2021 19:38
mattlord pushed a commit to planetscale/vitess that referenced this pull request Oct 7, 2021
…yCollation

Proper handling of CHAR columns with binary collations

Signed-off-by: Matt Lord <[email protected]>
mattlord pushed a commit to planetscale/vitess that referenced this pull request Oct 7, 2021
…yCollation

Proper handling of CHAR columns with binary collations

Signed-off-by: Matt Lord <[email protected]>
deepthi added a commit that referenced this pull request Oct 8, 2021
Backport [#8730 to 12.0]: Proper handling of CHAR columns with binary collations
deepthi added a commit that referenced this pull request Oct 8, 2021
Backport [#8730 to 11.0]: Proper handling of CHAR columns with binary collations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants