Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock the database while executing Migrate() #34115

Merged
merged 2 commits into from
Jul 11, 2024
Merged

Lock the database while executing Migrate() #34115

merged 2 commits into from
Jul 11, 2024

Conversation

AndriySvyryd
Copy link
Member

Fixes #33731

@AndriySvyryd AndriySvyryd requested a review from a team June 28, 2024 20:23
@AndriySvyryd AndriySvyryd force-pushed the Issue33731 branch 2 times, most recently from 6b4331d to 0560bef Compare June 28, 2024 20:55
@AndriySvyryd AndriySvyryd requested a review from roji July 3, 2024 21:10
@AndriySvyryd
Copy link
Member Author

@davidfowl FYI

src/EFCore.Relational/Migrations/IMigrationDatabaseLock.cs Outdated Show resolved Hide resolved
src/EFCore.Relational/Migrations/HistoryRepository.cs Outdated Show resolved Hide resolved
/// any release. You should only use it directly in your code with extreme caution and knowing that
/// doing so can result in application failures when updating to a new Entity Framework Core release.
/// </summary>
protected virtual TimeSpan LockTimeout { get; } = TimeSpan.FromMinutes(30);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 minutes seems a bit much maybe? Also, do we plan to allow the user to tweak this via DbContext options or similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's bad if the lock expires before the migration is finished (it would be worse than the current situation since now the app could be relying on this functionality). And in the vast majority of cases, it won't matter if this value is higher than it should be. So, overall, it's better to overshoot. Filed #34196

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard for me to imagine scenarios where a 30-minute timeout is actuaslly useful/meaningful... A 30 minute timeout implies that for e.g. 25 minutes there are still common/useful scenarios where waiting still makes sense, since the operation would complete successfully. There are of course some very specific migrations that could take longer, but those types of migrations could just as well take hours (and need to be managed a bit differently in any case).

Another way to look at it, is that it's hard for me to imagine a scenario where the user doesn't manually do a forceful ternmination before the 30 minutes are up. If that's true, that feels like a strong indication that the timeout should be shorter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are of course some very specific migrations that could take longer, but those types of migrations could just as well take hours (and need to be managed a bit differently in any case).

We've seen reports of applications with hundreds of migrations. If anything, you are convincing me to increase the timeout

Another way to look at it, is that it's hard for me to imagine a scenario where the user doesn't manually do a forceful termination before the 30 minutes are up.

Forceful termination during migration application is just asking for trouble, it's akin to rebooting while updating your BIOS, because it's taking too long.
The worst case for a longer then needed lock timeout is that the users needs to wait some time before retrying if a very rare catastrophic failure occurred. However, the worst case for a timeout that's too short is a corrupted database schema that needs to be fixed manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made configurable in #34199

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One main reason why a timeout might occur, is a network partition or another environmental problem, meaning that the operation will never ever complete. in fact, I think that's the only reason why we have a time out here in the first place, isn't it?

Yes, the timeout is for cases like such, for providers that need to release the lock on the client, like Sqlite, to be able to recover on the next run after the specified amount of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to imagine a provider where the operation would get stuck indefinitely, but where it's still possible to release the lock... Like if it's a network partition, the release will fail anyway; and for sqlite you can't have a network partition in the first place.

But if there's some reason this makes sense for sqlite specifically, we could also make the time out specific to that provider, rather than making it general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forceful termination during migration application is just asking for trouble, it's akin to rebooting while updating your BIOS, because it's taking too long.

If this is true, then its really bad. We know and all agree that applying migrations often take a long, long time. This means that people will forcefully terminate it. We can say not too all we want; people will still do it. I would if it had only been going 5 mins and I realized something was missing.

It feels to me like any pattern that is both very slow (we can't change that) and can't be interrupted without "asking for trouble" is going to be a pit of failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is true, then its really bad. We know and all agree that applying migrations often take a long, long time. This means that people will forcefully terminate it.

We do log when we start applying each migration, so you can see the progress. And this change isn't making this aspect worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that in at least some databases, transactions work quite well with migrations, in which case there's no problem with early termination - the database should always be left in a consistent state (I think PG works like this). Of course, in SQL Server certain operations cannot be done within a transaction, and there we have a problem.

We do log when we start applying each migration, so you can see the progress. And this change isn't making this aspect worse.

This definitely helps, but keep in mind scenarios where for example an index is added without the online or concurrent option, to a huge table, and then that's a single DDL statement that could take a very long time. Of course, doing this in a huge production database is probably a very bad idea.

src/EFCore.Relational/Migrations/HistoryRepository.cs Outdated Show resolved Hide resolved
Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think we should at least bring up the question of the timeout in a design discussion, to see what everyone thinks. But I'm good with merging this now, and possibly revisiting it later and changing the behavior if we end up deciding that.

@AndriySvyryd AndriySvyryd merged commit 27ca389 into main Jul 11, 2024
7 checks passed
@AndriySvyryd AndriySvyryd deleted the Issue33731 branch July 11, 2024 17:01
@ajcvickers
Copy link
Member

Just FYI @AndriySvyryd @roji this doesn't look good to me, but if you guys are for it, then I won't push back.

@AndriySvyryd
Copy link
Member Author

We'll discuss this in a design meeting. @ErikEJ if you have feedback, please speak up

@roji
Copy link
Member

roji commented Jul 11, 2024

Just FYI @AndriySvyryd @roji this doesn't look good to me, but if you guys are for it, then I won't push back.

Specifically for the timeout questions, or other things? Just so that we get the discussion right.

@ajcvickers
Copy link
Member

Mostly because termination of the process is "asking for trouble."

The timeout is also an issue, but secondary to leaving the universe in a bad state if something goes wrong.

@AndriySvyryd
Copy link
Member Author

Mostly because termination of the process is "asking for trouble."

As @roji said this only applies to operations that cannot be performed in a transaction and I don't think there's anything meaningful that we can do to make it less risky.

@ErikEJ
Copy link
Contributor

ErikEJ commented Jul 12, 2024

Is it possible to opt out of this new behaviour?

I think this goes somewhat against the docs advice of running only generated scripts in production !?

@roji
Copy link
Member

roji commented Jul 12, 2024

@ErikEJ re opting out, do you see a possible negative consequence of having the locking behavior?

While we still advise against applying migrationas at program startup, we recognize that a lot of users do so, and with this we aim to make it less problematic - let us know your thoughts.

@ErikEJ
Copy link
Contributor

ErikEJ commented Jul 12, 2024

What happens if a lot of web servers boot and wants to run migrations with the new behaviour? If that creates a lot of app locks = consumes server resources it could be a concern.

@roji
Copy link
Member

roji commented Jul 12, 2024

Isn't it always bad to run migrations concurrently without a lock? The situation today (i.e. without any protections) seems much worse than the situation after this PR… Also, note that this doesn't create a lot of locks, but rather just one - the same lock is taken by all server instances coming up, in which attempt to apply the migrations. So I can't think of a scenario with this PR makes things worse than they already are - only better...

@ErikEJ
Copy link
Contributor

ErikEJ commented Jul 12, 2024

Assume you mean server sessions?

Ok, so what happens for an app that can not get the lock after a timeout? Exception?

@roji
Copy link
Member

roji commented Jul 12, 2024

Yes, see above for discussions around the timeout. it's a tricky question, but I think anything is better than just allowing concurrent service sessions to apply migrations and step over each other...

@ajcvickers
Copy link
Member

The situation today (i.e. without any protections) seems much worse than the situation after this PR…

I disagree. It isn't worse today, because we have consistently and repeatedly told people not to do this because it isn't safe. But now we're building in recommended ways to do things...that are not safe. Not only that, but Aspire is going to force people into using unsafe mechanisms. To me, that's not making things better.

But I won't comment anymore (on the direction we're going), because the team want to go this way and Aspire is forcing it.

@roji
Copy link
Member

roji commented Jul 13, 2024

@ajcvickers first, I remember a discussion where I think we all agreed that for providers that have no full transactionality protection for migrations (e.g. SQL Server), there's really nothing we can do to make things reliable - it's just not possible, etc.

That aspect of it also seems really orthogonal to whether people apply migrations at startup, whether locking is applied, etc. In other words, if you're applying migrations today via a migration bundle or SQL script, and there's a network partition, then you're left in an indeterminate, possibly inconsistent state; that's just the way things are (unless the database fully supports transactions). Nothing about Aspire, or this PR really changes that as far as I can tell...

I agree that the lock timeout raises some questions that I'd like us to discuss and find consensus on... But that's only a very specific aspect, whereas you seem to be saying that the whole approach is somehow problematic, and I think it's important for us to understand how/why.

But I won't comment anymore (on the direction we're going), because the team want to go this way and Aspire is forcing it.

I don't think anybody is forcing us to do anything - people have been running migrations on startup despite all the guidance we've tried to give, and my memory of the design meeting is that we agreed to at least improve that scenario. In any case, I do think we should discuss if you think this is wrong.

@ajcvickers
Copy link
Member

@roji After reviewing the code here, I have reverted to my previous opinion. But it's all good; don't need to block on me. Please proceed.

@ajcvickers
Copy link
Member

ajcvickers commented Jul 13, 2024

@roji and I had a conversation about this. My main concern here is the unreliability of applying Migrations. In particular, the times we make changes on SQL Server outside of a transaction. If we can stop doing this by default, then I would be much happier with promoting (proving guidance that recommends) a mechanism that executes at runtime in the background such that if it fails it cannot leave the database in a permanently bad state. If, as Shay asserts, these are edge cases only, then a breaking change to stop handling them without an explicit opt-in should not be impactful. On the other hand, if this break is impactful, then it indicates that the reliability is not just an edge case.

Note that this is orthogonal to the timeout, but I do think we need to solve that too.

@AndriySvyryd
Copy link
Member Author

a breaking change to stop handling them without an explicit opt-in should not be impactful

If we do this it will fail on the first run that doesn't support transactions, the user will opt-in and then we are just back to the same behavior. Sure, the user might be more aware of the risks, but there's really nothing that they can do to mitigate them.

@ErikEJ
Copy link
Contributor

ErikEJ commented Jul 14, 2024

Just to understand: what happens today is that all app instances attempt to apply the same migration at around the same time, either causing schema corruption or runtime errors ??

@ajcvickers
Copy link
Member

@AndriySvyryd The fundamental point is that the opt-in should only be needed very rarely, otherwise Shay's argument falls down, and I revert to we shouldn't do this at all.

@ErikEJ This particular PR isn't about the reliability issue. The point is that if the reliability issue is so bad that the developer writing it says an abnormal termination is, "asking for trouble" then I don't think we should be going in the direction of providing guidance around this. But if we are going to provide this guidance, then the concurrency issue also needs to be solved, which is what this PR is about.

@AndriySvyryd
Copy link
Member Author

AndriySvyryd commented Jul 15, 2024

otherwise Shay's argument falls down, and I revert to we shouldn't do this at all.

We shouldn't add locking? Even though it's orthogonal? Or we shouldn't allow migration operations without transactions at all? Users will still do this even if we don't provide guidance.

@AndriySvyryd The fundamental point is that the opt-in should only be needed very rarely

Ok, so should we add a warning as error for any non-transactional migration operation, including the ones added manually by the user, as long as it's not the only one in the migration, that says "The migration operation '{operation}' cannot be executed in a transaction. If the app is terminated or an unrecoverable error occurs while this operation is being executed then the migration will be left in a partially applied state and would need to be reverted manually before it can be applied again. Create a separate migration that contains just this operation."?

@ErikEJ
Copy link
Contributor

ErikEJ commented Jul 15, 2024

@AndriySvyryd great idea!

@roji
Copy link
Member

roji commented Jul 15, 2024

Hey all, I've got a few comments here too, will post them once I'm back at the end of the week...

@AndriySvyryd
Copy link
Member Author

Added the warning in #34199

@roji
Copy link
Member

roji commented Jul 19, 2024

First, IMHO it's important to clearly distinguish between reliability issues when applying migrations, and concurrency issues; this PR attempts only to address the concurrency issues that arise when multiple server instances start up and attempt to apply migrations at startup in parallel, stepping on each others' toes. It neither improves the reliability situation nor makes it worse; that is, if a non-transactional migration is applied and a network partition occurs, the results are undefined regardless of whether the migration is applied on startup, via a SQL script, a migration bundle, etc.

In my conversation with @ajcvickers he made the point that by improving the concurrency situation, we may be making the situation worse, because any reliability issue is easier to miss when applying the migration at startup (compared to e.g. via a bundle, in a special deployment-time step). I'm not sure to what extent this is true - an error on application startup should take the application down, which should be visible to e.g. devops in a similar way as the failure of a deployment step. And in any case, if a user applies migration without any sort of health checking or mechanism to ensure that the app started successfully, IMHO all bets are off and anything we do here is probably irrelevant - such users aren't going to be deterred by at-startup migrations being unsafe in any case. In other words, I don't think anything is gained by leaving at-startup migrations unsafe concurrency-wise - we're not helping guide users away from there by doing that.

Now, to bring a bit of concreteness to the reliability question (we sometimes fall into the trap of discussing things in the abstract)... Looking at SqlServerMigrationsSqlGenerator, I can see transactions being suppressed in the following cases:

  • Any DDL operation touching a memory-optimized tables (as documented)
  • Database create/drop (these aren't "real" migrations and transactionality makes no sense here anyway)
  • Alter database operations (i.e. change collation, edition). I'm not sure this is strictly necessary - there's some in-memory logic there, and our suppression may be for that, possibly being applied in other, unnecessary cases as well.

In other words, I think we can say that SQL Server supports transactional migrations in the vast majority of cases, and AFAIK so do the other databases; as far as I can tell, the reliability issues we've discussing can generally be considered an edge case. I'd love to be corrected if I've missed something here - I notably haven't done a thorough cross-database investigation; but if things are as they are, than I wouldn't want essentially the SQL Server in-memory table limitation to determine wide-reaching, architectural decisions in how migrations are applied across relational databases, etc.

Finally, I do agree that we need to discuss what exactly we want to achieve with the lock timeout added in this PR, as well as verify that the locks themselves won't create issues (if there's trouble during apply). For example, the SQL Server locking implementation uses sp_getapplock; the docs state that "Locks associated with the session are released when the session is logged out.", which I read to mean that if a network partition occurs, the lock automatically gets released and shouldn't cause any trouble. But we should indeed pay attention to this aspect.

@roji
Copy link
Member

roji commented Jul 27, 2024

Just for completeness, in PG there's one DDL operation (AFAIK) that can't be done inside a migration - creation of an index concurrently (in SQL Server this is known as "online" creation (i.e. the table involved is available for regular operations while the index is being created).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement database locking for migrations
5 participants