Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

because lock is not release, drop cutover sentry table is hanged #1171

Closed
lmtwga opened this issue Aug 26, 2022 · 7 comments
Closed

because lock is not release, drop cutover sentry table is hanged #1171

lmtwga opened this issue Aug 26, 2022 · 7 comments

Comments

@lmtwga
Copy link
Contributor

lmtwga commented Aug 26, 2022

reason

  • when atomicCutOver() -> this.waitForEventsUpToLock() != nil

    gh-ost/go/logic/migrator.go

    Lines 668 to 670 in 3f44e04

    if err := this.waitForEventsUpToLock(); err != nil {
    return this.migrationContext.Log.Errore(err)
    }
  • then will drop sentry table
  • There are two places to delete sentry table:
    • First place: atomicCutOver() -> defer func() -> dropCutOverSentryTableOnce.Do -> this.applier.DropAtomicCutOverSentryTableIfExists()

      gh-ost/go/logic/migrator.go

      Lines 644 to 649 in 3f44e04

      defer func() {
      okToUnlockTable <- true
      dropCutOverSentryTableOnce.Do(func() {
      this.applier.DropAtomicCutOverSentryTableIfExists()
      })
      }()
    • Second place: AtomicCutOverMagicLock() - > dropCutOverSentryTableOnce.Do -> tx.Exec(query)

      gh-ost/go/logic/applier.go

      Lines 974 to 984 in 3f44e04

      query = fmt.Sprintf(`drop /* gh-ost */ table if exists %s.%s`,
      sql.EscapeName(this.migrationContext.DatabaseName),
      sql.EscapeName(this.migrationContext.GetOldTableName()),
      )
      dropCutOverSentryTableOnce.Do(func() {
      if _, err := tx.Exec(query); err != nil {
      this.migrationContext.Log.Errore(err)
      // We DO NOT return here because we must `UNLOCK TABLES`!
      }
      })
  • but, two places use Once.Do()。Who executes first depends on luck!
  • Unfortunately,when 'atomicCutOver() -> defer func() -> dropCutOverSentryTableOnce.Do -> this.applier.DropAtomicCutOverSentryTableIfExists() ' was executed firstly,because the lock of lock sentry table has not been released, and because the db connection db.SetMaxOpenConns(3)

    gh-ost/go/mysql/utils.go

    Lines 56 to 57 in 3f44e04

    db.SetMaxOpenConns(MaxDBPoolConnections)
    db.SetMaxIdleConns(MaxDBPoolConnections)

    so, the first place to drop sentry table will use a different MySQL thread, will be hanged until db connection timeout.

behavior

  • drop _del table will hanged until db connection timeout
    image

  • Partial execution log

Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 9s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 0.30s, HeartbeatLag: 0.80s, State: migrating; ETA: due
2022-04-07 19:25:31 INFO Row copy complete
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 9s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 0.30s, HeartbeatLag: 0.80s, State: migrating; ETA: due
2022-04-07 19:25:31 INFO Grabbing voluntary lock: gh-ost.84571662.lock
2022-04-07 19:25:31 INFO Setting LOCK timeout as 6 seconds
2022-04-07 19:25:31 INFO Looking for magic cut-over table
2022-04-07 19:25:31 INFO Creating magic cut-over table `gallery`.`_time_info__tablet_1923_20220407192522_del`
2022-04-07 19:25:31 INFO Magic cut-over table created
2022-04-07 19:25:31 INFO Locking `gallery`.`time_info__tablet_1923`, `gallery`.`_time_info__tablet_1923_20220407192522_del`
2022-04-07 19:25:31 INFO Tables locked
2022-04-07 19:25:31 INFO Session locking original & magic tables is 84571662
2022-04-07 19:25:31 INFO Writing changelog state: AllEventsUpToLockProcessed:1649330731511157653
2022-04-07 19:25:31 INFO Waiting for events up to lock
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 10s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 1.30s, HeartbeatLag: 1.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 11s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 2.30s, HeartbeatLag: 2.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 12s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 3.30s, HeartbeatLag: 3.80s, State: migrating; ETA: due
2022-04-07 19:25:34 ERROR Timeout while waiting for events up to lock
2022-04-07 19:25:34 ERROR 2022-04-07 19:25:34 ERROR Timeout while waiting for events up to lock
2022-04-07 19:25:34 INFO Looking for magic cut-over table
2022-04-07 19:25:34 INFO Will now proceed to drop magic table and unlock tables
2022-04-07 19:25:34 INFO Dropping magic cut-over table
2022-04-07 19:25:34 INFO Dropping magic cut-over table
2022-04-07 19:25:34 INFO Dropping table `gallery`.`_time_info__tablet_1923_20220407192522_del`
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 13s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 4.30s, HeartbeatLag: 4.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 14s(total), 8s(copy); streamer: mysql-bin.037492:236853172; Lag: 5.30s, HeartbeatLag: 5.80s, State: migrating; ETA: due

solution

Which fun to create sentry table, drop the sentry table should to be in the same fun, Is it right?
#1180

@lmtwga lmtwga changed the title because lock is not release, drop cutover sentry table is hang because lock is not release, drop cutover sentry table is hanged Aug 26, 2022
@lmtwga
Copy link
Contributor Author

lmtwga commented Aug 26, 2022

@shlomi-noach Please review when you have time
the PR: #1180

@lmtwga
Copy link
Contributor Author

lmtwga commented Aug 29, 2022

@shlomi-noach Please review when you have time, thank you very much!

@lmtwga
Copy link
Contributor Author

lmtwga commented Sep 1, 2022

@timvaillancourt @dm-2 @jsoref Please review when you have time, thank you very much!

@dm-2
Copy link
Contributor

dm-2 commented Sep 6, 2022

👋 @lmtwga, thanks for the issue and PR!

Nice debugging! Just to summarise what's happening:

  1. If atomicCutOver function exits because of an error (i.e. before true is sent to the okToUnlockTable channel), it executes its defer block whilst AtomicCutOverMagicLock function still holds the table lock that includes the sentry table
  2. The defer block for atomicCutOver sends true to the okToUnlockTable channel, causing a race to the dropCutOverSentryTableOnce-protected drop of the sentry table in both the atomicCutOver and AtomicCutOverMagicLock functions
  3. If atomicCutOver wins the race, the drop table is blocked due to the lock held by AtomicCutOverMagicLock, and AtomicCutOverMagicLock is blocked because sync.Once.Do(f) blocks until the first execution of f completes successfully (which is the drop table that is blocked) - causing a deadlock until either database connection times out or is disconnected
  4. If AtomicCutOverMagicLock wins the race, the sentry table is dropped successfully and no deadlock occurs

The deadlock only happens after we've failed the cut-over due to an error, so it has no impact on successful gh-ost migrations.

The impact of this deadlock scenario is that it will delay the retry of failed cut-overs if atomicCutOver wins the race detailed in (2) above.

@lmtwga
Copy link
Contributor Author

lmtwga commented Sep 16, 2022

@dm-2
Thank you very much for your reply!
I totally agree with what you said above.
In fact, if the above situation occurs, it may affect the online service. Because in AtomicCutOverMagicLock, two tables are locked at once.

gh-ost/go/logic/applier.go

Lines 943 to 947 in 3f44e04

query = fmt.Sprintf(`lock /* gh-ost */ tables %s.%s write, %s.%s write`,
sql.EscapeName(this.migrationContext.DatabaseName),
sql.EscapeName(this.migrationContext.OriginalTableName),
sql.EscapeName(this.migrationContext.DatabaseName),
sql.EscapeName(this.migrationContext.GetOldTableName()),

The lock will block all subsequent operations. For example, as shown in the figure below:
image

the operations in the green boxes are blocked.

@dm-2
Copy link
Contributor

dm-2 commented Oct 21, 2022

@dm-2 Thank you very much for your reply! I totally agree with what you said above. In fact, if the above situation occurs, it may affect the online service. Because in AtomicCutOverMagicLock, two tables are locked at once.

gh-ost/go/logic/applier.go

Lines 943 to 947 in 3f44e04

query = fmt.Sprintf(`lock /* gh-ost */ tables %s.%s write, %s.%s write`,
sql.EscapeName(this.migrationContext.DatabaseName),
sql.EscapeName(this.migrationContext.OriginalTableName),
sql.EscapeName(this.migrationContext.DatabaseName),
sql.EscapeName(this.migrationContext.GetOldTableName()),

The lock will block all subsequent operations. For example, as shown in the figure below:
image
the operations in the green boxes are blocked.

Good point! I'll take a look at the PR again 👍

@meiji163
Copy link
Contributor

Fixed by #1180

btruhand added a commit to Faire/gh-ost that referenced this issue Oct 20, 2023
…to lock

This PR contains a fix for the deadlock issue described in github#992 (which is also discussed here github#1171). The fix is introduced in the original repo here https://github.com/github/gh-ost/pull/1180/files. This PR contains a cherry-pick of that fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants