Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: Parallel DB locks watcher #38

Merged
merged 2 commits into from
Oct 1, 2019

Conversation

yajo
Copy link
Contributor

@yajo yajo commented Feb 8, 2019

With this patch, if you update a database while another Odoo instance is running (such as i.e. a production instance), the production instance will not be locked just because there's a DB lock.

DB locks can happen i.e. when 2 or more transactions are happening in parallel and one of them wants to modify data in a field while another one is modifying the field itself. For example, imagine that a user is modifying a res.partner's name, while another update process is adding a UNIQUE constraint in the name field of the res_partner table. This would produce a deadlock where each transaction is waiting for the other one to finish, and thus both the production instance and the update instance would be locked infinitely until one of them is aborted.

You cannot detect such problem with common tools such as timeouts, because there still exists the possibility of a query being slow without actually being locked, like when you update an addon that has a pre-update migration script that performs lots of work, or when your queries or server are not optimized and perform slowly.

So, the only way to detect deadlocks is by issuing a separate DB cursor that is not protected by a transaction and that watches other cursors' transactions and their locks.

With this change, this is what happens now behind the scenes:

  • The DB update process is spawned in background.
  • The foreground process uses a separate watcher cursor and watches for locks.
  • If a lock is detected, the update process is aborted, giving priority to the other cursors. This is by design because your production users have priority always, and all that would happen is that the update transaction would be rolled back, so you can just try updating again later.
  • A couple of CLI parameters allow you to modify the watcher behavior.

BTW, the changes in pre-commit are because I have Fedora, where the py3 version is 3.7. I guess this change will help other contributors too, to avoid pre-commit failures...

@Tecnativa

@yajo yajo force-pushed the update-parallel-db-lock-watcher branch 2 times, most recently from 6a7e4c7 to d463077 Compare February 8, 2019 10:54
@codecov
Copy link

codecov bot commented Feb 8, 2019

Codecov Report

Merging #38 into master will decrease coverage by 1.02%.
The diff coverage is 76.31%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   92.91%   91.89%   -1.03%     
==========================================
  Files          13       13              
  Lines         706      740      +34     
  Branches      115      118       +3     
==========================================
+ Hits          656      680      +24     
- Misses         36       44       +8     
- Partials       14       16       +2
Impacted Files Coverage Δ
click_odoo_contrib/update.py 88.99% <76.31%> (-7.01%) ⬇️
click_odoo_contrib/backupdb.py 98.21% <0%> (-1.79%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2187bda...d463077. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 8, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@99fb8d5). Click here to learn what that means.
The diff coverage is 64.58%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #38   +/-   ##
=========================================
  Coverage          ?   91.46%           
=========================================
  Files             ?       13           
  Lines             ?      750           
  Branches          ?      123           
=========================================
  Hits              ?      686           
  Misses            ?       51           
  Partials          ?       13
Impacted Files Coverage Δ
click_odoo_contrib/update.py 85.12% <64.58%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99fb8d5...02b4ee0. Read the comment docs.

@yajo
Copy link
Contributor Author

yajo commented Feb 8, 2019

Please set as WIP for now.

@lmignon lmignon added the WIP label Feb 8, 2019
@yajo yajo force-pushed the update-parallel-db-lock-watcher branch 3 times, most recently from f7ce758 to d2ebe51 Compare February 12, 2019 09:02
@yajo
Copy link
Contributor Author

yajo commented Feb 12, 2019

This shouldn't be WIP now, it's ready to review.

I published https://youtu.be/20bUjtO_Mjk for you to be able to see what is this doing in practice. Basically, it wants to allow safely updating in parallel with a production running database, while not disturbing for other use cases.

I'd appreciate some help on putting ✔️ travis, because the errors seem not related or at least I don't understand the logs.

Thanks.

yajo added a commit to Tecnativa/doodba that referenced this pull request Feb 12, 2019
This is a preview feature.

Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere.

However, this should be considered a beta feature.

This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package.
yajo added a commit to Tecnativa/doodba that referenced this pull request Feb 12, 2019
This is a preview feature.

Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere.

However, this should be considered a beta feature.

This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package.

Closes #160.
yajo added a commit to Tecnativa/doodba that referenced this pull request Feb 12, 2019
This is a preview feature.

Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere.

However, this should be considered a beta feature.

This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package.

Closes #160.
@blaggacao
Copy link
Contributor

blaggacao commented Feb 12, 2019

Firstly, here is what I understand of the spec:

  • Be able to update a production db while it is running in production.

I'm not the code owner, but I'm a interested third party. Therefore, going over that spec, first:

  • Is running updates on live production database instances really a strategy you want to employ (as opposed to reusing a maintenance windows for as short and ad hoc as it might be)?
  • Note: There is rolling updates, so this maintenance window can be database specific, and not necessarily cluster specific (which is kind of the only reason I can figure why you would have formulated the purported spec in the first place).

Please correct me if I misunderstood anything.

@yajo
Copy link
Contributor Author

yajo commented Feb 13, 2019

Yes, that's the purpose. I don't want to ask my customers if I can update, I want to update at any time without them noticing at all. HA, basically. Of course, this feature must be combined with some kind of rolling update system to be helpful, but that's outside this PR's scope.

@yajo yajo force-pushed the update-parallel-db-lock-watcher branch from d2ebe51 to ac74926 Compare February 13, 2019 08:26
@yajo
Copy link
Contributor Author

yajo commented Feb 13, 2019

I can't understand Travis... Where's the failure? Thanks!

@sbidoul
Copy link
Member

sbidoul commented Feb 13, 2019

I don't understand it either... no time to dig into this now unfortunately.

@lmignon
Copy link
Member

lmignon commented Feb 13, 2019

Strange, but without more details into the log, it's not possible to understand the problem. Since the error occurs into the code I've written, I'll try to find some time to dig into.... (but not today...)

@yajo
Copy link
Contributor Author

yajo commented Feb 13, 2019

I guess we should increase logging in Travis. I can try if you tell me how.

@blaggacao
Copy link
Contributor

@yajo Got it. It looks like a two-sided sword to me. And I'm not sure if that would convince me of being the right approach. I repeat, I'm not the code owner, so I'm just a vocal spectator. You can just ignore me if you want. 😉

But still it's valuable to understand your thought process. For example, I cannot really figure out what would happen if an update transaction would update the schema while another user transaction still relies on a previous schema state. Which of both transactions would corrupt and fail? Is that possible at all?

Please share your insights. 😉

As for now and current state of (my) knowledge, it seems to me that updates without schema changes could be executed safely. The problem I see is that this feature of an update is opaque during the odoo update process: We cannot scope such "hot" updates to non-schema-changes.

I might just as likely be completely wrong in my analysis. Please let me know! 😄

@yajo
Copy link
Contributor Author

yajo commented Feb 13, 2019

what would happen if an update transaction would update the schema while another user transaction still relies on a previous schema state

Well, that's precisely the case we're handling here.

What would happen is that both cursors would become locked, in an endless loop where each one is waiting for the other one to finish. You can see a little example in the video above.

We have been updating in parallel for years now, and it works pretty well unless this precise situation happens, because most updates do not alter the schema, or at least not while a user is specifically using a previous schema in the precise field or table that is being altered.

The solution we have developed for this problem is to just keep our eyes peeled over the logs, and if they become stuck, abort and restart the update process. It is far from ideal or maintainable! This patch is replacing our eyes by python code that spawns a watcher for locks and aborts the update if they happen.

So, while you'd still be able to stop ➡️ update ➡️ start as before, now you are able to update (:repeat: repeat if failed) :arrow_right: restart.

The final update time can be higher, but the downtime will be almost nothing.


As a side note, for you to understand better, production Odoo process would be running in a different codebase than the update process. You can use separate installations or containers for such purpose. For instance, using containers:

  1. Production container PCA based on image IA is running Odoo in production.
  2. Here you'd trigger a backup, just in case.
  3. A new update container UCB based on image IB is spawned, and it runs click-odoo-update against the same database instance as PCA.
  4. If UCB exits with failure, update has failed. Repeat step 3 as many times as needed.
  5. If UCB exits normally, update has finished properly. Replace PCA with a new container PCB, based on image IB.

The only downtime users would experience is in the last step, but even that can be removed by using more replicas for PCA and replacing them gradually. Of course this is not a one-size-fits-all solution, but at least it opens the door for real HA in Odoo, which has been closed for the last centuries.

@blaggacao
Copy link
Contributor

blaggacao commented Feb 13, 2019

@yajo Understanding what you said, then, I guess this line is utterly confusing: https://github.com/acsone/click-odoo-contrib/pull/38/files#diff-b241a31b4530ee322670fff4974f8be1R33

Watch DB while another process is updating Odoo"

Maybe you'd say: "Watch DB for existing locks while this process tries to update Odoo"

Would that be more assertive?


As for my motivation:

In Odoo Operator I'm implementing a slightly different workflow:

  1. Trigger "Stage Migration"
  2. Backup A
  3. Copy A -> B
  4. Migrate B.
  5. Expose B to the client for validation
  6. Trigger "Migration"
  7. Set A in maintenance (splash screen)
  8. Backup A
  9. Copy A -> B
  10. Migrate B
  11. Permute A / B wiring at all necessary levels

For sure, most tasks are executed as a separate job (or container, if you want).

If a version bump object is marked as bugfix=true, the workflow is shorter:

  1. Backup A
  2. Migrate A (which in most cases for bugfixes would be a no-op)
  3. Done

I think your idea is interesting for schema altering "bugfixing" workflows to cover those corner cases.

I'm still not fully understanding how / if that benefits since the migration is not split up into several transaction, so the transaction span is really the TTL of the migration and for the migrations to succeed it would need to block other users (which is down time).

Or maybe I did not understand this part:

  • Is your strategy to retry migrations until there is a sufficiently large time span for the migration to succeed and always give preference to user connections?

EDIT: I guess this line answers it: https://github.com/acsone/click-odoo-contrib/pull/38/files#diff-b241a31b4530ee322670fff4974f8be1R278

Maybe the other method doc should be updated, indeed. Or maybe it's just me being stupid 😉

EDIT2: Probably it's me being stupid 😉

@blaggacao
Copy link
Contributor

blaggacao commented Feb 13, 2019

@yajo If I was more professional than I actually am, I'd probably just ask: would you mind sharing a spec? 😉

@yajo
Copy link
Contributor Author

yajo commented Feb 14, 2019

About your 1st comment, I see you answered everything yourself. About the 2nd... 😁 If I were more professional I'd understand what a spec is!

In any case, the point of this PR is just opening a new door for us. You still are able to work as before, changing nothing (because the watcher would never hit a lock anyways, if you don't update in parallel), so no matter if you use Operator, Doodba, Ansible or Windows 98, this is just a new door you can choose to use if you want and know the implications.

I think that we shouldn't clutter this PR anymore with specific implementations. 🤔 That discussion fits better in downstream infrastructure projects, don't you think?


I see that I should update the README too. I'd like someone from @acsone to help me before with that nasty ❌ though.

@lmignon
Copy link
Member

lmignon commented Feb 14, 2019

@yajo For sure, I'll check what's wrong with travis in the next days... 😏 I keep you informed once I've fixed the problem.

@yajo yajo force-pushed the update-parallel-db-lock-watcher branch from ac74926 to 9cf0348 Compare February 14, 2019 11:49
@yajo
Copy link
Contributor Author

yajo commented Mar 11, 2019

@blaggacao The scope of this improvement is to provide a tool as stated before. I don't want to document the best practices on how to use this in the wild here if possible, because there are many different deployment scenarios that can handle this differently. What it does is simple and explained, how do you use it is up to you.

Also keep in mind that Odoo is anti-HA by design. I'm just trying to get as close as possible. The main point here is to reduce downtime to almost nothing.

As always, I'm having a difficult time trying to understand your English! 😅 Write simpler, direct questions instead, please, if anything else is left.

Thanks!

@blaggacao
Copy link
Contributor

@yajo Oh, I'm sorry for my English. Besides sharing my line of thought, the only actionable thing was the request of embedding the use cases in the the docs of this PR: for good code, I've read, that to document the why is more important than the what, isn't it?

Also in order for others to be able to validate the why (the "specification"), it needs to be documented somehow. As you might see, I'm still struggling to get a constructive critical hold (in a sense what Karl Popper stands for) on this. That's probably the root cause for my English is getting complicated. 😉

@blaggacao
Copy link
Contributor

blaggacao commented Mar 11, 2019

If you feel I'm being too querulous here, please feel free to ignore. Just know that my true intention is Popperian. 😜

@yajo yajo force-pushed the update-parallel-db-lock-watcher branch from 9b25cfb to af4674f Compare March 12, 2019 09:10
@yajo
Copy link
Contributor Author

yajo commented Mar 12, 2019

I refurbished the script's exegenesis to keep it abreast of this new feature's natality motivation. I hope it appeases your popperian querulousity.

@blaggacao
Copy link
Contributor

blaggacao commented Mar 18, 2019

Exquisitely clear and tasty so! 😉

In general: sed '/lock/dead-lock/' ? At least that's my understanding of it, now.

@yajo yajo force-pushed the update-parallel-db-lock-watcher branch 2 times, most recently from 876ac88 to 02b4ee0 Compare March 28, 2019 09:03
@yajo yajo force-pushed the update-parallel-db-lock-watcher branch from 02b4ee0 to 2ebbca9 Compare June 19, 2019 09:03
@yajo yajo force-pushed the update-parallel-db-lock-watcher branch from 2ebbca9 to b7b1bdc Compare July 11, 2019 08:39
@yajo
Copy link
Contributor Author

yajo commented Jul 11, 2019

I recently faced a failure where the lock watcher aborted a query but that allowed Odoo continue updating. I changed to use pg_terminate_backend() to close the cursor instead of pg_cancel_backend() which was just cancelling the query. I also added a system that will detect if the watcher aborted the main thread and will always return an error in such case.

@yajo
Copy link
Contributor Author

yajo commented Aug 20, 2019

It seems the failure is not related to this script.

This has been working fine in production for 5 months now. Last patches work pretty fine. Is there anything else left for it to be merged?

@sbidoul sbidoul removed the WIP label Aug 30, 2019
@sbidoul
Copy link
Member

sbidoul commented Aug 30, 2019

@yajo yes, I'll need to fix the tests and then I'll merge. I removed the WIP label.

@yajo
Copy link
Contributor Author

yajo commented Aug 30, 2019

❤️ Thanks!

@sbidoul sbidoul added the enhancement New feature or request label Aug 30, 2019
@sbidoul
Copy link
Member

sbidoul commented Aug 31, 2019

@yajo could you rebase? master is green.

With this patch, if you update a database while another Odoo instance is running (such as i.e. a production instance), the production instance will not be locked just because there's a DB lock.

DB locks can happen i.e. when 2 or more transactions are happening in parallel and one of them wants to modify data in a field while another one is modifying the field itself. For example, imagine that a user is modifying a `res.partner`'s name, while another update process is adding a `UNIQUE` constraint in the `name` field of the `res_partner` table. This would produce a deadlock where each transaction is waiting for the other one to finish, and thus both the production instance and the update instance would be locked infinitely until one of them is aborted.

You cannot detect such problem with common tools such as timeouts, because there still exists the possibility of a query being slow without actually being locked, like when you update an addon that has a pre-update migration script that performs lots of work, or when your queries or server are not optimized and perform slowly.

So, the only way to detect deadlocks is by issuing a separate DB cursor that is not protected by a transaction and that watches other cursors' transactions and their locks.

With this change, this is what happens now behind the scenes:

- The DB lock watcher process is spawned in background using a separate watcher cursor and watches for locks.
- The foreground process starts updating Odoo.
- If a lock is detected, the update process is aborted, giving priority to the other cursors. This is by design because your production users have priority always, and all that would happen is that the update transaction would be rolled back, so you can just try updating again later.
- A couple of CLI parameters allow you to modify the watcher behavior, or completely disable it.

Keep in mind that an update in Odoo issues several commits, so before starting the parallel update you must make sure the production server is running in a mode that won't reload workers, and if using Odoo < 10, that won't launch cron jobs.
@yajo yajo force-pushed the update-parallel-db-lock-watcher branch from b7b1bdc to 5c385d2 Compare September 2, 2019 07:46
@codecov-io
Copy link

codecov-io commented Sep 2, 2019

Codecov Report

Merging #38 into master will decrease coverage by 1.67%.
The diff coverage is 68.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   92.84%   91.16%   -1.68%     
==========================================
  Files          13       13              
  Lines         713      770      +57     
  Branches      117      126       +9     
==========================================
+ Hits          662      702      +40     
- Misses         37       53      +16     
- Partials       14       15       +1
Impacted Files Coverage Δ
click_odoo_contrib/update.py 85.07% <68.85%> (-11.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 784b89a...63df1c7. Read the comment docs.

@yajo
Copy link
Contributor Author

yajo commented Sep 2, 2019

Done, thanks!

@yajo
Copy link
Contributor Author

yajo commented Sep 9, 2019

So? It's green now 😊 💚

@sbidoul
Copy link
Member

sbidoul commented Sep 23, 2019

@yajo I'm just back from holidays. What held me from clicking on merge before leaving was that I noticed the global variables which I did not see before (sorry about that). Do you think you could easily get rid of them, perhaps by implementing the watcher by subclassing Thread and overriding run()?

@yajo
Copy link
Contributor Author

yajo commented Sep 24, 2019

I hope you enjoyed your holidays! 🏖️ 😎

The problem is that watcher thread needs to know if the main thread wants him to continue watching, and the main thread needs to know if the watcher thread has been aborted, and both react to conditions in the other one, so they need to share these 2 flags and the easiest way I found was using a couple of global variables.

How would a Thread subclass save us from that need? Also if that solution is more complicated, is it worth it? 🤔 Simpler is better IMHO...

@sbidoul sbidoul merged commit bbe5094 into acsone:master Oct 1, 2019
@yajo yajo deleted the update-parallel-db-lock-watcher branch October 1, 2019 12:29
@yajo
Copy link
Contributor Author

yajo commented Oct 1, 2019

Thanks! Finally! 🎉 Would you mind to release please?

@yajo yajo restored the update-parallel-db-lock-watcher branch October 2, 2019 08:51
@yajo yajo deleted the update-parallel-db-lock-watcher branch October 2, 2019 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants