Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If MySQL two-phase commit is time-consuming, it will cause gh-ost to lose data. #1030

Closed
shaohk opened this issue Sep 28, 2021 · 2 comments
Closed

Comments

@shaohk
Copy link
Contributor

shaohk commented Sep 28, 2021

We encountered a case of the gh-ost lose data. The details of this case is as follows:

  1. start the gh-ost to alter online table t1.

  2. when the gh-ost is finish, we find there is lose a data in new table t1, and the lose data is in the old table _t1_del.

  3. we analyze the binlog and gh-ost log is as follows

  • lose data id is 17, binlog position is (mysql-bin.000009, 98240).
  • the gh-ost begin to sync binlog from position (mysql-bin.000009, 1780).
  • the gh-ost migration min and max value is (1, 16).
  • the gh-ost first status log is
    Copy: 0/16 0.0%; Applied: 0; Backlog: 0/1000; Time: 33s(total), 0s(copy); streamer: mysql-bin.000009:101907; Lag: 15.18s, HeartbeatLag: 15.08s, State: migrating; ETA: N/A

We suspect the case is because of MySQL two-phase commit.

The gh-ost command is

gh-ost \
--max-load=Threads_running=32 \
--critical-load=Threads_running=48 \
--chunk-size=500 \
--max-lag-millis=1000000 \
--initially-drop-old-table \
--initially-drop-socket-file \
--user="xx" \
--password="xxxx" \
--host="xxx.xxx.xxx.xxx" \
--port=xxxxx \
--database="shaohk" \
--table="t1" \
--alter="ADD INDEX idx_t1 (name);" \
--switch-to-rbr=false \
--cut-over=default \
--assume-rbr=true \
--exact-rowcount=false \
--default-retries=3 \
--critical-load-interval-millis=300000 \
--serve-socket-file=/tmp/ghost.socket.t1.sock \
--panic-flag-file=/tmp/ghost.panic.flag.shaohk.t1 \
--throttle-flag-file=/tmp/ghost.throttle.flag.shaohk.t1 \
--force-table-names=t1 \
--verbose \
--execute \
--postpone-cut-over-flag-file=/tmp/ghost.postpone.flag.shaohk.t1 \
--allow-on-master

The insert transaction is

mysql> begin;
Query OK, 0 rows affected (0.03 sec)

mysql> insert into t1 values (17,'test');
Query OK, 1 row affected (0.04 sec)

mysql> commit;
Query OK, 0 rows affected (30.3 sec)

The gh-ost log is

2021-09-26 23:11:50 INFO starting gh-ost
2021-09-26 23:11:50 INFO Migrating `shaohk`.`t1`
2021-09-26 23:11:50 INFO inspector connection validated on 127.0.0.1:3306
2021-09-26 23:11:50 INFO User has ALL privileges
2021-09-26 23:11:50 INFO binary logs validated on 127.0.0.1:3306
2021-09-26 23:11:50 INFO Inspector initiated on 127.0.0.1:3306, version 5.7.26-301001001-log
2021-09-26 23:11:50 INFO Table found. Engine=InnoDB
2021-09-26 23:11:50 INFO Estimated number of rows via EXPLAIN: 16
2021-09-26 23:11:51 INFO Recursively searching for replication master
2021-09-26 23:11:51 INFO Master found to be 127.0.0.1:3306
2021-09-26 23:11:51 INFO log_slave_updates validated on 127.0.0.1:3306
2021-09-26 23:11:51 INFO streamer connection validated on 127.0.0.1:3306
[2021/09/26 23:11:51] [info] binlogsyncer.go:148 create BinlogSyncer with config {99999 mysql 127.0.0.1 3306 xxx  false false <nil> false UTC true 0 0s 0s 0 false false 0 <nil>}
2021-09-26 23:11:51 INFO Connecting binlog streamer at mysql-bin.000009:1780
[2021/09/26 23:11:51] [info] binlogsyncer.go:374 begin to sync binlog from position (mysql-bin.000009, 1780)
[2021/09/26 23:11:51] [info] binlogsyncer.go:791 rotate to (mysql-bin.000009, 1780)
2021-09-26 23:11:51 INFO rotate to next log from mysql-bin.000009:0 to mysql-bin.000009
2021-09-26 23:11:51 INFO applier connection validated on 127.0.0.1:3306
2021-09-26 23:11:51 INFO applier connection validated on 127.0.0.1:3306
2021-09-26 23:11:52 INFO will use time_zone='SYSTEM' on applier
2021-09-26 23:11:52 INFO Examining table structure on applier
2021-09-26 23:11:52 INFO Applier initiated on 127.0.0.1:3306, version 5.7.26-301001001-log
2021-09-26 23:11:52 INFO Dropping table `shaohk`.`_t1_del`
2021-09-26 23:11:52 INFO Table dropped
2021-09-26 23:11:52 INFO Dropping table `shaohk`.`_t1_ghc`
2021-09-26 23:11:52 INFO Table dropped
2021-09-26 23:11:52 INFO Creating changelog table `shaohk`.`_t1_ghc`
2021-09-26 23:11:52 INFO Changelog table created
2021-09-26 23:11:52 INFO Creating ghost table `shaohk`.`_t1_gho`
2021-09-26 23:11:52 INFO Ghost table created
2021-09-26 23:11:52 INFO Altering ghost table `shaohk`.`_t1_gho`
2021-09-26 23:11:52 INFO Ghost table altered
2021-09-26 23:11:52 INFO Intercepted changelog state GhostTableMigrated
2021-09-26 23:11:52 INFO Waiting for ghost table to be migrated. Current lag is 0s
2021-09-26 23:11:52 INFO Handled changelog state GhostTableMigrated
2021-09-26 23:11:52 INFO Chosen shared unique key is PRIMARY
2021-09-26 23:11:52 INFO Shared columns are id,name
2021-09-26 23:11:52 INFO Listening on unix socket file: /tmp/ghost.socket.146155549.sock
2021-09-26 23:12:12 INFO Migration min values: [1]
2021-09-26 23:12:12 INFO Migration max values: [16]
2021-09-26 23:12:22 INFO Waiting for first throttle metrics to be collected
2021-09-26 23:12:23 INFO First throttle metrics collected
# Migrating `shaohk`.`t1`; Ghost table is `shaohk`.`_t1_gho`
# Migrating 127.0.0.1:3306; inspecting 127.0.0.1:3306; executing on shaohk.local
# Migration started at Sun Sep 26 23:11:50 +0800 2021
# chunk-size: 500; max-lag-millis: 1000000ms; dml-batch-size: 10; max-load: Threads_running=32; critical-load: Threads_running=48; nice-ratio: 0.000000
# throttle-flag-file: /tmp/ghost.throttle.flag.shaohk.t1.146155549
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag.shaohk.t1.146155549 [set]
# panic-flag-file: /tmp/ghost.panic.flag.shaohk.t1.146155549
# Serving on unix socket: /tmp/ghost.socket.146155549.sock
Copy: 0/16 0.0%; Applied: 0; Backlog: 0/1000; Time: 33s(total), 0s(copy); streamer: mysql-bin.000009:101907; Lag: 15.18s, HeartbeatLag: 15.08s, State: migrating; ETA: N/A
Copy: 16/16 100.0%; Applied: 0; Backlog: 0/1000; Time: 49s(total), 16s(copy); streamer: mysql-bin.000009:103548; Lag: 31.11s, HeartbeatLag: 31.13s, State: migrating; ETA: due
2021-09-26 23:12:40 INFO Row copy complete
Copy: 16/16 100.0%; Applied: 0; Backlog: 0/1000; Time: 50s(total), 17s(copy); streamer: mysql-bin.000009:106332; Lag: 0.38s, HeartbeatLag: 0.38s, State: migrating; ETA: due
Copy: 16/16 100.0%; Applied: 0; Backlog: 0/1000; Time: 50s(total), 17s(copy); streamer: mysql-bin.000009:106332; Lag: 0.34s, HeartbeatLag: 0.34s, State: migrating; ETA: due
Copy: 16/16 100.0%; Applied: 0; Backlog: 0/1000; Time: 51s(total), 17s(copy); streamer: mysql-bin.000009:114627; Lag: 0.08s, HeartbeatLag: 0.13s, State: postponing cut-over; ETA: due
Copy: 16/16 100.0%; Applied: 0; Backlog: 0/1000; Time: 52s(total), 17s(copy); streamer: mysql-bin.000009:121824; Lag: 0.08s, HeartbeatLag: 0.13s, State: postponing cut-over; ETA: due
2021-09-26 23:13:01 INFO Grabbing voluntary lock: gh-ost.66.lock
2021-09-26 23:13:01 INFO Setting LOCK timeout as 6 seconds
2021-09-26 23:13:01 INFO Looking for magic cut-over table
2021-09-26 23:13:01 INFO Creating magic cut-over table `shaohk`.`_t1_del`
2021-09-26 23:13:01 INFO Magic cut-over table created
2021-09-26 23:13:01 INFO Locking `shaohk`.`t1`, `shaohk`.`_t1_del`
2021-09-26 23:13:01 INFO Tables locked
2021-09-26 23:13:01 INFO Session locking original & magic tables is 66
2021-09-26 23:13:01 INFO Writing changelog state: AllEventsUpToLockProcessed:1632669181273438000
2021-09-26 23:13:01 INFO Intercepted changelog state AllEventsUpToLockProcessed
2021-09-26 23:13:01 INFO Handled changelog state AllEventsUpToLockProcessed
2021-09-26 23:13:01 INFO Waiting for events up to lock
2021-09-26 23:13:01 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1632669181273438000
2021-09-26 23:13:01 INFO Done waiting for events up to lock; duration=269.19469ms
# Migrating `shaohk`.`t1`; Ghost table is `shaohk`.`_t1_gho`
# Migrating 127.0.0.1:3306; inspecting 127.0.0.1:3306; executing on shaohk.local
# Migration started at Sun Sep 26 23:11:50 +0800 2021
# chunk-size: 500; max-lag-millis: 1000000ms; dml-batch-size: 10; max-load: Threads_running=32; critical-load: Threads_running=48; nice-ratio: 0.000000
# throttle-flag-file: /tmp/ghost.throttle.flag.shaohk.t1.146155549
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# postpone-cut-over-flag-file: /tmp/ghost.postpone.flag.shaohk.t1.146155549
# panic-flag-file: /tmp/ghost.panic.flag.shaohk.t1.146155549
# Serving on unix socket: /tmp/ghost.socket.146155549.sock
Copy: 16/16 100.0%; Applied: 0; Backlog: 0/1000; Time: 1m11s(total), 17s(copy); streamer: mysql-bin.000009:249730; Lag: 0.07s, HeartbeatLag: 0.11s, State: migrating; ETA: due
2021-09-26 23:13:01 INFO Setting RENAME timeout as 3 seconds
2021-09-26 23:13:01 INFO Session renaming tables is 72
2021-09-26 23:13:01 INFO Issuing and expecting this to block: rename /* gh-ost */ table `shaohk`.`t1` to `shaohk`.`_t1_del`, `shaohk`.`_t1_gho` to `shaohk`.`t1`
2021-09-26 23:13:01 INFO Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)
2021-09-26 23:13:01 INFO Checking session lock: gh-ost.66.lock
2021-09-26 23:13:01 INFO Connection holding lock on original table still exists
2021-09-26 23:13:01 INFO Will now proceed to drop magic table and unlock tables
2021-09-26 23:13:01 INFO Dropping magic cut-over table
2021-09-26 23:13:01 INFO Releasing lock from `shaohk`.`t1`, `shaohk`.`_t1_del`
2021-09-26 23:13:01 INFO Tables renamed
2021-09-26 23:13:01 INFO Tables unlocked
2021-09-26 23:13:01 INFO Lock & rename duration: 741.223847ms. During this time, queries on `t1` were blocked
[2021/09/26 23:13:01] [info] binlogsyncer.go:180 syncer is closing...
[2021/09/26 23:13:02] [info] binlogsyncer.go:864 kill last connection id 68
[2021/09/26 23:13:02] [info] binlogsyncer.go:210 syncer is closed
2021-09-26 23:13:02 INFO Closed streamer connection. err=<nil>
2021-09-26 23:13:02 INFO Dropping table `shaohk`.`_t1_ghc`
2021-09-26 23:13:02 INFO Table dropped
2021-09-26 23:13:02 INFO Am not dropping old table because I want this operation to be as live as possible. If you insist I should do it, please add `--ok-to-drop-table` next time. But I prefer you do not. To drop the old table, issue:
2021-09-26 23:13:02 INFO -- drop table `shaohk`.`_t1_del`
2021-09-26 23:13:02 INFO Done migrating `shaohk`.`t1`
2021-09-26 23:13:02 INFO Removing socket file: /tmp/ghost.socket.146155549.sock
2021-09-26 23:13:02 INFO Tearing down inspector
2021-09-26 23:13:02 INFO Tearing down applier
2021-09-26 23:13:02 INFO Tearing down streamer
2021-09-26 23:13:02 INFO Tearing down throttler

The alter result is

mysql> select * from t1;
+----+------+
| id | name |
+----+------+
|  1 | test |
|  2 | test |
|  3 | test |
|  4 | test |
|  5 | test |
|  6 | test |
|  7 | test |
|  8 | test |
|  9 | test |
| 10 | test |
| 11 | test |
| 12 | test |
| 13 | test |
| 14 | test |
| 15 | test |
| 16 | test |
+----+------+

mysql> select * from _t1_del;
+----+------+
| id | name |
+----+------+
|  1 | test |
|  2 | test |
|  3 | test |
|  4 | test |
|  5 | test |
|  6 | test |
|  7 | test |
|  8 | test |
|  9 | test |
| 10 | test |
| 11 | test |
| 12 | test |
| 13 | test |
| 14 | test |
| 15 | test |
| 16 | test |
| 17 | test |
+----+------+
17 rows in set (0.03 sec)
@shaohk
Copy link
Contributor Author

shaohk commented Nov 1, 2021

👍

@timvaillancourt
Copy link
Collaborator

This issue should mostly be resolved by the merge of #1141

That said, #1040 may further improve data integrity, so I'll leave this open for more consideration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@timvaillancourt @shaohk and others