Lock the database while executing Migrate() #34115

AndriySvyryd · 2024-06-28T20:23:52Z

Fixes #33731

src/EFCore/Diagnostics/IDiagnosticsLogger.cs

AndriySvyryd · 2024-07-08T21:20:17Z

@davidfowl FYI

src/EFCore.Relational/Migrations/IMigrationDatabaseLock.cs

src/EFCore.Relational/Migrations/HistoryRepository.cs

roji · 2024-07-09T16:14:30Z

src/EFCore.Relational/Migrations/Internal/Migrator.cs

+    ///     any release. You should only use it directly in your code with extreme caution and knowing that
+    ///     doing so can result in application failures when updating to a new Entity Framework Core release.
+    /// </summary>
+    protected virtual TimeSpan LockTimeout { get; } = TimeSpan.FromMinutes(30);


30 minutes seems a bit much maybe? Also, do we plan to allow the user to tweak this via DbContext options or similar?

It's bad if the lock expires before the migration is finished (it would be worse than the current situation since now the app could be relying on this functionality). And in the vast majority of cases, it won't matter if this value is higher than it should be. So, overall, it's better to overshoot. Filed #34196

It's hard for me to imagine scenarios where a 30-minute timeout is actuaslly useful/meaningful... A 30 minute timeout implies that for e.g. 25 minutes there are still common/useful scenarios where waiting still makes sense, since the operation would complete successfully. There are of course some very specific migrations that could take longer, but those types of migrations could just as well take hours (and need to be managed a bit differently in any case).

Another way to look at it, is that it's hard for me to imagine a scenario where the user doesn't manually do a forceful ternmination before the 30 minutes are up. If that's true, that feels like a strong indication that the timeout should be shorter.

There are of course some very specific migrations that could take longer, but those types of migrations could just as well take hours (and need to be managed a bit differently in any case).

We've seen reports of applications with hundreds of migrations. If anything, you are convincing me to increase the timeout

Another way to look at it, is that it's hard for me to imagine a scenario where the user doesn't manually do a forceful termination before the 30 minutes are up.

Forceful termination during migration application is just asking for trouble, it's akin to rebooting while updating your BIOS, because it's taking too long.
The worst case for a longer then needed lock timeout is that the users needs to wait some time before retrying if a very rare catastrophic failure occurred. However, the worst case for a timeout that's too short is a corrupted database schema that needs to be fixed manually.

Made configurable in #34199

One main reason why a timeout might occur, is a network partition or another environmental problem, meaning that the operation will never ever complete. in fact, I think that's the only reason why we have a time out here in the first place, isn't it?

Yes, the timeout is for cases like such, for providers that need to release the lock on the client, like Sqlite, to be able to recover on the next run after the specified amount of time.

It's hard to imagine a provider where the operation would get stuck indefinitely, but where it's still possible to release the lock... Like if it's a network partition, the release will fail anyway; and for sqlite you can't have a network partition in the first place.

But if there's some reason this makes sense for sqlite specifically, we could also make the time out specific to that provider, rather than making it general.

Forceful termination during migration application is just asking for trouble, it's akin to rebooting while updating your BIOS, because it's taking too long.

If this is true, then its really bad. We know and all agree that applying migrations often take a long, long time. This means that people will forcefully terminate it. We can say not too all we want; people will still do it. I would if it had only been going 5 mins and I realized something was missing.

It feels to me like any pattern that is both very slow (we can't change that) and can't be interrupted without "asking for trouble" is going to be a pit of failure.

If this is true, then its really bad. We know and all agree that applying migrations often take a long, long time. This means that people will forcefully terminate it.

We do log when we start applying each migration, so you can see the progress. And this change isn't making this aspect worse.

Remember that in at least some databases, transactions work quite well with migrations, in which case there's no problem with early termination - the database should always be left in a consistent state (I think PG works like this). Of course, in SQL Server certain operations cannot be done within a transaction, and there we have a problem.

We do log when we start applying each migration, so you can see the progress. And this change isn't making this aspect worse.

This definitely helps, but keep in mind scenarios where for example an index is added without the online or concurrent option, to a huge table, and then that's a single DDL statement that could take a very long time. Of course, doing this in a huge production database is probably a very bad idea.

src/EFCore.Relational/Migrations/HistoryRepository.cs

src/EFCore.SqlServer/Migrations/Internal/SqlServerHistoryRepository.cs

src/EFCore.Sqlite.Core/Migrations/Internal/SqliteHistoryRepository.cs

src/Microsoft.Data.Sqlite.Core/SqliteConnection.cs

Fixes #33731

src/EFCore.SqlServer/Migrations/Internal/SqlServerHistoryRepository.cs

src/EFCore.SqlServer/Migrations/Internal/SqlServerMigrationDatabaseLock.cs

src/EFCore.Sqlite.Core/Migrations/Internal/SqliteHistoryRepository.cs

src/EFCore.Sqlite.Core/Migrations/Internal/SqliteMigrationDatabaseLock.cs

roji

LGTM, but I think we should at least bring up the question of the timeout in a design discussion, to see what everyone thinks. But I'm good with merging this now, and possibly revisiting it later and changing the behavior if we end up deciding that.

ajcvickers · 2024-07-11T18:25:55Z

Just FYI @AndriySvyryd @roji this doesn't look good to me, but if you guys are for it, then I won't push back.

AndriySvyryd · 2024-07-11T19:00:43Z

We'll discuss this in a design meeting. @ErikEJ if you have feedback, please speak up

roji · 2024-07-11T20:21:40Z

Just FYI @AndriySvyryd @roji this doesn't look good to me, but if you guys are for it, then I won't push back.

Specifically for the timeout questions, or other things? Just so that we get the discussion right.

ajcvickers · 2024-07-11T20:24:54Z

Mostly because termination of the process is "asking for trouble."

The timeout is also an issue, but secondary to leaving the universe in a bad state if something goes wrong.

AndriySvyryd · 2024-07-11T22:49:30Z

Mostly because termination of the process is "asking for trouble."

As @roji said this only applies to operations that cannot be performed in a transaction and I don't think there's anything meaningful that we can do to make it less risky.

ErikEJ · 2024-07-12T13:32:43Z

Is it possible to opt out of this new behaviour?

I think this goes somewhat against the docs advice of running only generated scripts in production !?

roji · 2024-07-12T13:59:17Z

@ErikEJ re opting out, do you see a possible negative consequence of having the locking behavior?

While we still advise against applying migrationas at program startup, we recognize that a lot of users do so, and with this we aim to make it less problematic - let us know your thoughts.

ErikEJ · 2024-07-12T14:07:44Z

What happens if a lot of web servers boot and wants to run migrations with the new behaviour? If that creates a lot of app locks = consumes server resources it could be a concern.

roji · 2024-07-12T14:16:29Z

Isn't it always bad to run migrations concurrently without a lock? The situation today (i.e. without any protections) seems much worse than the situation after this PR… Also, note that this doesn't create a lot of locks, but rather just one - the same lock is taken by all server instances coming up, in which attempt to apply the migrations. So I can't think of a scenario with this PR makes things worse than they already are - only better...

ErikEJ · 2024-07-12T14:30:56Z

Assume you mean server sessions?

Ok, so what happens for an app that can not get the lock after a timeout? Exception?

roji · 2024-07-12T15:13:15Z

Yes, see above for discussions around the timeout. it's a tricky question, but I think anything is better than just allowing concurrent service sessions to apply migrations and step over each other...

ajcvickers · 2024-07-13T10:24:51Z

The situation today (i.e. without any protections) seems much worse than the situation after this PR…

I disagree. It isn't worse today, because we have consistently and repeatedly told people not to do this because it isn't safe. But now we're building in recommended ways to do things...that are not safe. Not only that, but Aspire is going to force people into using unsafe mechanisms. To me, that's not making things better.

But I won't comment anymore (on the direction we're going), because the team want to go this way and Aspire is forcing it.

roji · 2024-07-13T11:16:51Z

@ajcvickers first, I remember a discussion where I think we all agreed that for providers that have no full transactionality protection for migrations (e.g. SQL Server), there's really nothing we can do to make things reliable - it's just not possible, etc.

That aspect of it also seems really orthogonal to whether people apply migrations at startup, whether locking is applied, etc. In other words, if you're applying migrations today via a migration bundle or SQL script, and there's a network partition, then you're left in an indeterminate, possibly inconsistent state; that's just the way things are (unless the database fully supports transactions). Nothing about Aspire, or this PR really changes that as far as I can tell...

I agree that the lock timeout raises some questions that I'd like us to discuss and find consensus on... But that's only a very specific aspect, whereas you seem to be saying that the whole approach is somehow problematic, and I think it's important for us to understand how/why.

But I won't comment anymore (on the direction we're going), because the team want to go this way and Aspire is forcing it.

I don't think anybody is forcing us to do anything - people have been running migrations on startup despite all the guidance we've tried to give, and my memory of the design meeting is that we agreed to at least improve that scenario. In any case, I do think we should discuss if you think this is wrong.

ajcvickers · 2024-07-13T11:34:16Z

@roji After reviewing the code here, I have reverted to my previous opinion. But it's all good; don't need to block on me. Please proceed.

ajcvickers · 2024-07-13T13:35:17Z

@roji and I had a conversation about this. My main concern here is the unreliability of applying Migrations. In particular, the times we make changes on SQL Server outside of a transaction. If we can stop doing this by default, then I would be much happier with promoting (proving guidance that recommends) a mechanism that executes at runtime in the background such that if it fails it cannot leave the database in a permanently bad state. If, as Shay asserts, these are edge cases only, then a breaking change to stop handling them without an explicit opt-in should not be impactful. On the other hand, if this break is impactful, then it indicates that the reliability is not just an edge case.

Note that this is orthogonal to the timeout, but I do think we need to solve that too.

AndriySvyryd · 2024-07-13T17:46:47Z

a breaking change to stop handling them without an explicit opt-in should not be impactful

If we do this it will fail on the first run that doesn't support transactions, the user will opt-in and then we are just back to the same behavior. Sure, the user might be more aware of the risks, but there's really nothing that they can do to mitigate them.

ErikEJ · 2024-07-14T06:08:34Z

Just to understand: what happens today is that all app instances attempt to apply the same migration at around the same time, either causing schema corruption or runtime errors ??

ajcvickers · 2024-07-14T09:16:19Z

@AndriySvyryd The fundamental point is that the opt-in should only be needed very rarely, otherwise Shay's argument falls down, and I revert to we shouldn't do this at all.

@ErikEJ This particular PR isn't about the reliability issue. The point is that if the reliability issue is so bad that the developer writing it says an abnormal termination is, "asking for trouble" then I don't think we should be going in the direction of providing guidance around this. But if we are going to provide this guidance, then the concurrency issue also needs to be solved, which is what this PR is about.

AndriySvyryd · 2024-07-15T01:15:00Z

otherwise Shay's argument falls down, and I revert to we shouldn't do this at all.

We shouldn't add locking? Even though it's orthogonal? Or we shouldn't allow migration operations without transactions at all? Users will still do this even if we don't provide guidance.

@AndriySvyryd The fundamental point is that the opt-in should only be needed very rarely

Ok, so should we add a warning as error for any non-transactional migration operation, including the ones added manually by the user, as long as it's not the only one in the migration, that says "The migration operation '{operation}' cannot be executed in a transaction. If the app is terminated or an unrecoverable error occurs while this operation is being executed then the migration will be left in a partially applied state and would need to be reverted manually before it can be applied again. Create a separate migration that contains just this operation."?

ErikEJ · 2024-07-15T04:47:58Z

@AndriySvyryd great idea!

roji · 2024-07-15T14:16:59Z

Hey all, I've got a few comments here too, will post them once I'm back at the end of the week...

AndriySvyryd · 2024-07-16T00:49:14Z

Added the warning in #34199

roji · 2024-07-19T14:14:36Z

First, IMHO it's important to clearly distinguish between reliability issues when applying migrations, and concurrency issues; this PR attempts only to address the concurrency issues that arise when multiple server instances start up and attempt to apply migrations at startup in parallel, stepping on each others' toes. It neither improves the reliability situation nor makes it worse; that is, if a non-transactional migration is applied and a network partition occurs, the results are undefined regardless of whether the migration is applied on startup, via a SQL script, a migration bundle, etc.

In my conversation with @ajcvickers he made the point that by improving the concurrency situation, we may be making the situation worse, because any reliability issue is easier to miss when applying the migration at startup (compared to e.g. via a bundle, in a special deployment-time step). I'm not sure to what extent this is true - an error on application startup should take the application down, which should be visible to e.g. devops in a similar way as the failure of a deployment step. And in any case, if a user applies migration without any sort of health checking or mechanism to ensure that the app started successfully, IMHO all bets are off and anything we do here is probably irrelevant - such users aren't going to be deterred by at-startup migrations being unsafe in any case. In other words, I don't think anything is gained by leaving at-startup migrations unsafe concurrency-wise - we're not helping guide users away from there by doing that.

Now, to bring a bit of concreteness to the reliability question (we sometimes fall into the trap of discussing things in the abstract)... Looking at SqlServerMigrationsSqlGenerator, I can see transactions being suppressed in the following cases:

Any DDL operation touching a memory-optimized tables (as documented)
Database create/drop (these aren't "real" migrations and transactionality makes no sense here anyway)
Alter database operations (i.e. change collation, edition). I'm not sure this is strictly necessary - there's some in-memory logic there, and our suppression may be for that, possibly being applied in other, unnecessary cases as well.

In other words, I think we can say that SQL Server supports transactional migrations in the vast majority of cases, and AFAIK so do the other databases; as far as I can tell, the reliability issues we've discussing can generally be considered an edge case. I'd love to be corrected if I've missed something here - I notably haven't done a thorough cross-database investigation; but if things are as they are, than I wouldn't want essentially the SQL Server in-memory table limitation to determine wide-reaching, architectural decisions in how migrations are applied across relational databases, etc.

Finally, I do agree that we need to discuss what exactly we want to achieve with the lock timeout added in this PR, as well as verify that the locks themselves won't create issues (if there's trouble during apply). For example, the SQL Server locking implementation uses sp_getapplock; the docs state that "Locks associated with the session are released when the session is logged out.", which I read to mean that if a network partition occurs, the lock automatically gets released and shouldn't cause any trouble. But we should indeed pay attention to this aspect.

roji · 2024-07-27T08:11:30Z

Just for completeness, in PG there's one DDL operation (AFAIK) that can't be done inside a migration - creation of an index concurrently (in SQL Server this is known as "online" creation (i.e. the table involved is available for regular operations while the index is being created).

AndriySvyryd requested a review from a team June 28, 2024 20:23

ranma42 reviewed Jun 28, 2024

View reviewed changes

src/EFCore/Diagnostics/IDiagnosticsLogger.cs Outdated Show resolved Hide resolved

AndriySvyryd force-pushed the Issue33731 branch 2 times, most recently from 6b4331d to 0560bef Compare June 28, 2024 20:55

AndriySvyryd requested a review from roji July 3, 2024 21:10

roji reviewed Jul 9, 2024

View reviewed changes

AndriySvyryd added the providers-beware label Jul 9, 2024

AndriySvyryd force-pushed the Issue33731 branch from 0560bef to 51f9b63 Compare July 9, 2024 19:18

Lock the database while executing Migrate()

b8f4cc8

Fixes #33731

roji reviewed Jul 10, 2024

View reviewed changes

Review feedback

e05b76c

AndriySvyryd force-pushed the Issue33731 branch from 51f9b63 to e05b76c Compare July 10, 2024 16:40

roji approved these changes Jul 11, 2024

View reviewed changes

AndriySvyryd merged commit 27ca389 into main Jul 11, 2024
7 checks passed

AndriySvyryd deleted the Issue33731 branch July 11, 2024 17:01

roji mentioned this pull request Jul 12, 2024

Implement migration locking npgsql/efcore.pg#3223

Closed

This was referenced Aug 19, 2024

Document migration locking dotnet/EntityFramework.Docs#4783

Open

[release/9.0] Acquire the migrations database lock in a transaction. #34489

Merged

roji mentioned this pull request Aug 29, 2024

Regression: database existence check in HistoryRepository no longer goes through DatabaseCreator #34569

Closed

Lock the database while executing Migrate() #34115

Lock the database while executing Migrate() #34115

Conversation

AndriySvyryd commented Jun 28, 2024

AndriySvyryd commented Jul 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roji left a comment

Choose a reason for hiding this comment

ajcvickers commented Jul 11, 2024

AndriySvyryd commented Jul 11, 2024

roji commented Jul 11, 2024

ajcvickers commented Jul 11, 2024

AndriySvyryd commented Jul 11, 2024

ErikEJ commented Jul 12, 2024

roji commented Jul 12, 2024

ErikEJ commented Jul 12, 2024

roji commented Jul 12, 2024

ErikEJ commented Jul 12, 2024

roji commented Jul 12, 2024

ajcvickers commented Jul 13, 2024

roji commented Jul 13, 2024

ajcvickers commented Jul 13, 2024

ajcvickers commented Jul 13, 2024 • edited Loading

AndriySvyryd commented Jul 13, 2024

ErikEJ commented Jul 14, 2024

ajcvickers commented Jul 14, 2024

AndriySvyryd commented Jul 15, 2024 • edited Loading

ErikEJ commented Jul 15, 2024

roji commented Jul 15, 2024

AndriySvyryd commented Jul 16, 2024

roji commented Jul 19, 2024 • edited Loading

roji commented Jul 27, 2024

ajcvickers commented Jul 13, 2024 •

edited

Loading

AndriySvyryd commented Jul 15, 2024 •

edited

Loading

roji commented Jul 19, 2024 •

edited

Loading