Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Upgrade from version 0.22.3 to 0.23-beta2 failed database migration #2074

Closed
2 of 4 tasks
mpoindexter opened this issue Aug 22, 2024 · 10 comments · Fixed by #2076
Closed
2 of 4 tasks

[Bug] Upgrade from version 0.22.3 to 0.23-beta2 failed database migration #2074

mpoindexter opened this issue Aug 22, 2024 · 10 comments · Fixed by #2076
Labels
bug Something isn't working

Comments

@mpoindexter
Copy link
Contributor

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Upgrading from 0.22.3 to 0.23-beta2 failed on database migration with this error:

{"level":"fatal","error":"ERROR: insert or update on table \"routes\" violates foreign key constraint \"fk_nodes_routes\" (SQLSTATE 23503)","time":1724359938,"message":"Migration failed: ERROR: insert or update on table \"routes\" violates foreign key constraint \"fk_nodes_routes\" (SQLSTATE 23503)"}

Expected Behavior

Migration completes successfully

Steps To Reproduce

Have an existing database where there are rows in the routes table that reference nodes that no longer exist in the nodes table. Not sure how it happened, but seems like I had a number of these, seems like older versions of the code could cause this to happen.

Running delete from routes where node_id not in (select id from nodes); and then attempting to upgrade again solved the issue, but ideally this would be done as part of the migration scripts.

Environment

- OS: any
- Headscale version: 0.23-beta2
- Tailscale version: any

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Anything else?

No response

@kradalby
Copy link
Collaborator

I think I have fixed this in #2076, do you have the ability to test it?

It is some tech debts from the early days of non structured migrations that keep biting us.

@mpoindexter
Copy link
Contributor Author

In some sense the change in #2076 works in that migration does not fail. It does mean that there is now a difference in the structure of the routes table between a fresh database and my upgraded one - specifically, the foreign key between routes and node does not get created. This doesn't seem ideal, IMO it is best to have the migration scripts remove any invalid data vs continuing to have data that doesn't conform to the intended schema still around and creating problems.

@mpoindexter
Copy link
Contributor Author

I tried as an alternate fix the following code vs only running automigrate if the table does not exist:

+                                       if tx.Migrator().HasTable(&types.Route{}) && tx.Migrator().HasTable(&types.Node{}) {
+                                               err := tx.Exec("delete from routes where node_id not in (select id from nodes)").Error
+                                               if err != nil {
+                                                       return err
+                                               }
+                                       }
+
                                        err = tx.AutoMigrate(&types.Route{})
                                        if err != nil {
                                                return err

and it solved this problem, and did not delete my existing routes (other than the ones that were linked to nodes that did not exist), and I ended up with the correct foreign keys created.

@kradalby
Copy link
Collaborator

kradalby commented Aug 24, 2024

apologise, when I posted my reply to this PR I thought I was posting on #2063, so indeed, this has not been fixed, my apologies. Too late on a friday...

Thank you for investigating, I'll take this and amend the PR.

@kradalby
Copy link
Collaborator

This fix breaks #2063, so I will have to investigate. The whole initial migration is incredibly fragile and horrible, but untangling it is next to impossible considering we dont have databases to verify with.

Do you have an unupgraded 0.22 database you could email me that I can test with?

@mpoindexter
Copy link
Contributor Author

Yes, I have a development one that I have a backup of. I'll update it to remove any sensitive information and send it over today. FWIW, I'm having trouble seeing how the AutoMigrate command would drop rows as described in #2063 - (https://github.com/go-gorm/gorm/blob/v1.25.10/migrator/migrator.go#L121) doesn't seem to have any logic that would drop rows

@kradalby
Copy link
Collaborator

Could you by any chance create the dev one as sqlite? it would make it infinitely easier to make it part of a unit test. And it is our recommended database.

I also do not understand why its dropped, but I'm gonna pay back the missing techdebt by adding tests for migrations, if not only to make sure we dont regress.

@kradalby
Copy link
Collaborator

I've converted it to SQLite and made this integration test:
ac045c3 (#2076)

I am not sure if it ended up correctly, but at least now I got two failing tests. I'll try to continue as I have time, but please feel free to have a look.

@kradalby
Copy link
Collaborator

specifically, the foreign key between routes and node does not get created. This doesn't seem ideal, IMO it is best to have the migration scripts remove any invalid data vs continuing to have data that doesn't conform to the intended schema still around and creating problems.

@mpoindexter I ended up with a combination, now the migration will use your code to clean up, and then only run the automigrate conditionally.

The foreign keys will not be "created" retroactively, and I am not sure if that is a common thing for the db engine to sort out. For now, I wont spent more time on this as its more of a inconvenience and the code does not rely on fk or cascading deletes so its a minor issue.

@mpoindexter
Copy link
Contributor Author

@kradalby I think there is still a problem with how it has been done. I spent a while tracing down why the routes table gets cleared on migration and found the following:

The implication of this, is that with the change done in #2076 the routes rows are not dropped for existing databases, since the automigration is not done that actually creates the constraints with the ON DELETE CASCADE. For new databases, the constraint will still be created, and the next time you drop a column from the nodes table, or add a constraint to that table, the routes table will get cleared. So there's a big trap lurking for you in the future for databases that were not migrated from 0.22.

There is a way to deal with this in sqlite: foreign key processing can be turned off temporarily using a pragma and then turned back on. The GORM migrator does this for alter column (https://github.com/glebarez/sqlite/blob/master/migrator.go#L79), but not for drop column/create constraint/drop constraint. I am not sure why that is and it seems like a bug.

sqlite provides guidance on how to perform these migrations here: https://www.sqlite.org/lang_altertable.html - see section 7. Based on that, I would propose that the right way to solve the problem is to:

  1. Disable foreign key processing prior to running the migrations when using sqlite
  2. Run the migrations as they used to be (i.e. not conditioning the AutoMigrate call on whether the table already exists)
  3. Use PRAGMA foreign_key_check to make sure all foreign key constraints are followed properly post-migration
  4. Turn foreign key processing back on

I have tested this locally, and both test cases you added pass with my changes. Let me know if you would like me to submit a PR for this. I think following this approach will make life a lot easier in the future since you won't have to worry about two different upgrade paths (from a DB that has the constraints created on routes and from a DB that doesn't have the constraints created on routes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants