update: Parallel DB locks watcher #38

yajo · 2019-02-08T09:04:45Z

With this patch, if you update a database while another Odoo instance is running (such as i.e. a production instance), the production instance will not be locked just because there's a DB lock.

DB locks can happen i.e. when 2 or more transactions are happening in parallel and one of them wants to modify data in a field while another one is modifying the field itself. For example, imagine that a user is modifying a res.partner's name, while another update process is adding a UNIQUE constraint in the name field of the res_partner table. This would produce a deadlock where each transaction is waiting for the other one to finish, and thus both the production instance and the update instance would be locked infinitely until one of them is aborted.

You cannot detect such problem with common tools such as timeouts, because there still exists the possibility of a query being slow without actually being locked, like when you update an addon that has a pre-update migration script that performs lots of work, or when your queries or server are not optimized and perform slowly.

So, the only way to detect deadlocks is by issuing a separate DB cursor that is not protected by a transaction and that watches other cursors' transactions and their locks.

With this change, this is what happens now behind the scenes:

The DB update process is spawned in background.
The foreground process uses a separate watcher cursor and watches for locks.
If a lock is detected, the update process is aborted, giving priority to the other cursors. This is by design because your production users have priority always, and all that would happen is that the update transaction would be rolled back, so you can just try updating again later.
A couple of CLI parameters allow you to modify the watcher behavior.

BTW, the changes in pre-commit are because I have Fedora, where the py3 version is 3.7. I guess this change will help other contributors too, to avoid pre-commit failures...

@Tecnativa

codecov · 2019-02-08T10:54:40Z

Codecov Report

Merging #38 into master will decrease coverage by 1.02%.
The diff coverage is 76.31%.

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   92.91%   91.89%   -1.03%     
==========================================
  Files          13       13              
  Lines         706      740      +34     
  Branches      115      118       +3     
==========================================
+ Hits          656      680      +24     
- Misses         36       44       +8     
- Partials       14       16       +2

Impacted Files	Coverage Δ
click_odoo_contrib/update.py	`88.99% <76.31%> (-7.01%)`	⬇️
click_odoo_contrib/backupdb.py	`98.21% <0%> (-1.79%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2187bda...d463077. Read the comment docs.

codecov · 2019-02-08T10:54:40Z

Codecov Report

❗ No coverage uploaded for pull request base (master@99fb8d5). Click here to learn what that means.
The diff coverage is 64.58%.

@@            Coverage Diff            @@
##             master      #38   +/-   ##
=========================================
  Coverage          ?   91.46%           
=========================================
  Files             ?       13           
  Lines             ?      750           
  Branches          ?      123           
=========================================
  Hits              ?      686           
  Misses            ?       51           
  Partials          ?       13

Impacted Files	Coverage Δ
click_odoo_contrib/update.py	`85.12% <64.58%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99fb8d5...02b4ee0. Read the comment docs.

yajo · 2019-02-08T11:44:31Z

Please set as WIP for now.

yajo · 2019-02-12T10:37:34Z

This shouldn't be WIP now, it's ready to review.

I published https://youtu.be/20bUjtO_Mjk for you to be able to see what is this doing in practice. Basically, it wants to allow safely updating in parallel with a production running database, while not disturbing for other use cases.

I'd appreciate some help on putting ✔️ travis, because the errors seem not related or at least I don't understand the logs.

Thanks.

This is a preview feature. Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere. However, this should be considered a beta feature. This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package.

This is a preview feature. Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere. However, this should be considered a beta feature. This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package. Closes #160.

blaggacao · 2019-02-12T16:32:18Z

Firstly, here is what I understand of the spec:

Be able to update a production db while it is running in production.

I'm not the code owner, but I'm a interested third party. Therefore, going over that spec, first:

Is running updates on live production database instances really a strategy you want to employ (as opposed to reusing a maintenance windows for as short and ad hoc as it might be)?
Note: There is rolling updates, so this maintenance window can be database specific, and not necessarily cluster specific (which is kind of the only reason I can figure why you would have formulated the purported spec in the first place).

Please correct me if I misunderstood anything.

yajo · 2019-02-13T08:25:28Z

Yes, that's the purpose. I don't want to ask my customers if I can update, I want to update at any time without them noticing at all. HA, basically. Of course, this feature must be combined with some kind of rolling update system to be helpful, but that's outside this PR's scope.

yajo · 2019-02-13T08:58:49Z

I can't understand Travis... Where's the failure? Thanks!

sbidoul · 2019-02-13T09:37:10Z

I don't understand it either... no time to dig into this now unfortunately.

lmignon · 2019-02-13T10:31:38Z

Strange, but without more details into the log, it's not possible to understand the problem. Since the error occurs into the code I've written, I'll try to find some time to dig into.... (but not today...)

yajo · 2019-02-13T12:33:34Z

I guess we should increase logging in Travis. I can try if you tell me how.

blaggacao · 2019-02-13T13:07:01Z

@yajo Got it. It looks like a two-sided sword to me. And I'm not sure if that would convince me of being the right approach. I repeat, I'm not the code owner, so I'm just a vocal spectator. You can just ignore me if you want. 😉

But still it's valuable to understand your thought process. For example, I cannot really figure out what would happen if an update transaction would update the schema while another user transaction still relies on a previous schema state. Which of both transactions would corrupt and fail? Is that possible at all?

Please share your insights. 😉

As for now and current state of (my) knowledge, it seems to me that updates without schema changes could be executed safely. The problem I see is that this feature of an update is opaque during the odoo update process: We cannot scope such "hot" updates to non-schema-changes.

I might just as likely be completely wrong in my analysis. Please let me know! 😄

yajo · 2019-02-13T14:00:19Z

what would happen if an update transaction would update the schema while another user transaction still relies on a previous schema state

Well, that's precisely the case we're handling here.

What would happen is that both cursors would become locked, in an endless loop where each one is waiting for the other one to finish. You can see a little example in the video above.

We have been updating in parallel for years now, and it works pretty well unless this precise situation happens, because most updates do not alter the schema, or at least not while a user is specifically using a previous schema in the precise field or table that is being altered.

The solution we have developed for this problem is to just keep our eyes peeled over the logs, and if they become stuck, abort and restart the update process. It is far from ideal or maintainable! This patch is replacing our eyes by python code that spawns a watcher for locks and aborts the update if they happen.

So, while you'd still be able to stop ➡️ update ➡️ start as before, now you are able to update (:repeat: repeat if failed) :arrow_right: restart.

The final update time can be higher, but the downtime will be almost nothing.

As a side note, for you to understand better, production Odoo process would be running in a different codebase than the update process. You can use separate installations or containers for such purpose. For instance, using containers:

Production container PCA based on image IA is running Odoo in production.
Here you'd trigger a backup, just in case.
A new update container UCB based on image IB is spawned, and it runs click-odoo-update against the same database instance as PCA.
If UCB exits with failure, update has failed. Repeat step 3 as many times as needed.
If UCB exits normally, update has finished properly. Replace PCA with a new container PCB, based on image IB.

The only downtime users would experience is in the last step, but even that can be removed by using more replicas for PCA and replacing them gradually. Of course this is not a one-size-fits-all solution, but at least it opens the door for real HA in Odoo, which has been closed for the last centuries.

blaggacao · 2019-02-13T15:15:56Z

@yajo Understanding what you said, then, I guess this line is utterly confusing: https://github.com/acsone/click-odoo-contrib/pull/38/files#diff-b241a31b4530ee322670fff4974f8be1R33

Watch DB while another process is updating Odoo"

Maybe you'd say: "Watch DB for existing locks while this process tries to update Odoo"

Would that be more assertive?

As for my motivation:

In Odoo Operator I'm implementing a slightly different workflow:

Trigger "Stage Migration"
Backup A
Copy A -> B
Migrate B.
Expose B to the client for validation
Trigger "Migration"
Set A in maintenance (splash screen)
Backup A
Copy A -> B
Migrate B
Permute A / B wiring at all necessary levels

For sure, most tasks are executed as a separate job (or container, if you want).

If a version bump object is marked as bugfix=true, the workflow is shorter:

Backup A
Migrate A (which in most cases for bugfixes would be a no-op)
Done

I think your idea is interesting for schema altering "bugfixing" workflows to cover those corner cases.

I'm still not fully understanding how / if that benefits since the migration is not split up into several transaction, so the transaction span is really the TTL of the migration and for the migrations to succeed it would need to block other users (which is down time).

Or maybe I did not understand this part:

Is your strategy to retry migrations until there is a sufficiently large time span for the migration to succeed and always give preference to user connections?

EDIT: I guess this line answers it: https://github.com/acsone/click-odoo-contrib/pull/38/files#diff-b241a31b4530ee322670fff4974f8be1R278

Maybe the other method doc should be updated, indeed. Or maybe it's just me being stupid 😉

EDIT2: Probably it's me being stupid 😉

blaggacao · 2019-02-13T15:21:03Z

@yajo If I was more professional than I actually am, I'd probably just ask: would you mind sharing a spec? 😉

yajo · 2019-02-14T08:16:51Z

About your 1st comment, I see you answered everything yourself. About the 2nd... 😁 If I were more professional I'd understand what a spec is!

In any case, the point of this PR is just opening a new door for us. You still are able to work as before, changing nothing (because the watcher would never hit a lock anyways, if you don't update in parallel), so no matter if you use Operator, Doodba, Ansible or Windows 98, this is just a new door you can choose to use if you want and know the implications.

I think that we shouldn't clutter this PR anymore with specific implementations. 🤔 That discussion fits better in downstream infrastructure projects, don't you think?

I see that I should update the README too. I'd like someone from @acsone to help me before with that nasty ❌ though.

lmignon · 2019-02-14T08:32:49Z

@yajo For sure, I'll check what's wrong with travis in the next days... 😏 I keep you informed once I've fixed the problem.

click_odoo_contrib/update.py

yajo · 2019-03-11T10:56:59Z

@blaggacao The scope of this improvement is to provide a tool as stated before. I don't want to document the best practices on how to use this in the wild here if possible, because there are many different deployment scenarios that can handle this differently. What it does is simple and explained, how do you use it is up to you.

Also keep in mind that Odoo is anti-HA by design. I'm just trying to get as close as possible. The main point here is to reduce downtime to almost nothing.

As always, I'm having a difficult time trying to understand your English! 😅 Write simpler, direct questions instead, please, if anything else is left.

Thanks!

blaggacao · 2019-03-11T14:45:43Z

@yajo Oh, I'm sorry for my English. Besides sharing my line of thought, the only actionable thing was the request of embedding the use cases in the the docs of this PR: for good code, I've read, that to document the why is more important than the what, isn't it?

Also in order for others to be able to validate the why (the "specification"), it needs to be documented somehow. As you might see, I'm still struggling to get a constructive critical hold (in a sense what Karl Popper stands for) on this. That's probably the root cause for my English is getting complicated. 😉

blaggacao · 2019-03-11T14:48:50Z

If you feel I'm being too querulous here, please feel free to ignore. Just know that my true intention is Popperian. 😜

yajo · 2019-03-12T09:18:19Z

I refurbished the script's exegenesis to keep it abreast of this new feature's natality motivation. I hope it appeases your popperian querulousity.

blaggacao · 2019-03-18T09:13:49Z

Exquisitely clear and tasty so! 😉

In general: sed '/lock/dead-lock/' ? At least that's my understanding of it, now.

yajo · 2019-07-11T09:16:45Z

I recently faced a failure where the lock watcher aborted a query but that allowed Odoo continue updating. I changed to use pg_terminate_backend() to close the cursor instead of pg_cancel_backend() which was just cancelling the query. I also added a system that will detect if the watcher aborted the main thread and will always return an error in such case.

yajo · 2019-08-20T11:57:12Z

It seems the failure is not related to this script.

This has been working fine in production for 5 months now. Last patches work pretty fine. Is there anything else left for it to be merged?

sbidoul · 2019-08-30T06:57:38Z

@yajo yes, I'll need to fix the tests and then I'll merge. I removed the WIP label.

yajo · 2019-08-30T11:04:11Z

❤️ Thanks!

sbidoul · 2019-08-31T09:06:56Z

@yajo could you rebase? master is green.

With this patch, if you update a database while another Odoo instance is running (such as i.e. a production instance), the production instance will not be locked just because there's a DB lock. DB locks can happen i.e. when 2 or more transactions are happening in parallel and one of them wants to modify data in a field while another one is modifying the field itself. For example, imagine that a user is modifying a `res.partner`'s name, while another update process is adding a `UNIQUE` constraint in the `name` field of the `res_partner` table. This would produce a deadlock where each transaction is waiting for the other one to finish, and thus both the production instance and the update instance would be locked infinitely until one of them is aborted. You cannot detect such problem with common tools such as timeouts, because there still exists the possibility of a query being slow without actually being locked, like when you update an addon that has a pre-update migration script that performs lots of work, or when your queries or server are not optimized and perform slowly. So, the only way to detect deadlocks is by issuing a separate DB cursor that is not protected by a transaction and that watches other cursors' transactions and their locks. With this change, this is what happens now behind the scenes: - The DB lock watcher process is spawned in background using a separate watcher cursor and watches for locks. - The foreground process starts updating Odoo. - If a lock is detected, the update process is aborted, giving priority to the other cursors. This is by design because your production users have priority always, and all that would happen is that the update transaction would be rolled back, so you can just try updating again later. - A couple of CLI parameters allow you to modify the watcher behavior, or completely disable it. Keep in mind that an update in Odoo issues several commits, so before starting the parallel update you must make sure the production server is running in a mode that won't reload workers, and if using Odoo < 10, that won't launch cron jobs.

codecov-io · 2019-09-02T07:47:01Z

Codecov Report

Merging #38 into master will decrease coverage by 1.67%.
The diff coverage is 68.85%.

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   92.84%   91.16%   -1.68%     
==========================================
  Files          13       13              
  Lines         713      770      +57     
  Branches      117      126       +9     
==========================================
+ Hits          662      702      +40     
- Misses         37       53      +16     
- Partials       14       15       +1

Impacted Files	Coverage Δ
click_odoo_contrib/update.py	`85.07% <68.85%> (-11.03%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 784b89a...63df1c7. Read the comment docs.

yajo · 2019-09-02T07:47:55Z

Done, thanks!

yajo · 2019-09-09T09:45:44Z

So? It's green now 😊 💚

sbidoul · 2019-09-23T19:47:38Z

@yajo I'm just back from holidays. What held me from clicking on merge before leaving was that I noticed the global variables which I did not see before (sorry about that). Do you think you could easily get rid of them, perhaps by implementing the watcher by subclassing Thread and overriding run()?

yajo · 2019-09-24T09:52:26Z

I hope you enjoyed your holidays! 🏖️ 😎

The problem is that watcher thread needs to know if the main thread wants him to continue watching, and the main thread needs to know if the watcher thread has been aborted, and both react to conditions in the other one, so they need to share these 2 flags and the easiest way I found was using a couple of global variables.

How would a Thread subclass save us from that need? Also if that solution is more complicated, is it worth it? 🤔 Simpler is better IMHO...

yajo · 2019-10-01T12:29:47Z

Thanks! Finally! 🎉 Would you mind to release please?

yajo force-pushed the update-parallel-db-lock-watcher branch 2 times, most recently from 6a7e4c7 to d463077 Compare February 8, 2019 10:54

lmignon added the WIP label Feb 8, 2019

yajo force-pushed the update-parallel-db-lock-watcher branch 3 times, most recently from f7ce758 to d2ebe51 Compare February 12, 2019 09:02

yajo mentioned this pull request Feb 12, 2019

Include click-odoo-update with parallel support Tecnativa/doodba#202

Merged

yajo force-pushed the update-parallel-db-lock-watcher branch from d2ebe51 to ac74926 Compare February 13, 2019 08:26

sbidoul reviewed Feb 14, 2019

View reviewed changes

click_odoo_contrib/update.py Outdated Show resolved Hide resolved

sbidoul reviewed Feb 14, 2019

View reviewed changes

click_odoo_contrib/update.py Show resolved Hide resolved

yajo force-pushed the update-parallel-db-lock-watcher branch from ac74926 to 9cf0348 Compare February 14, 2019 11:49

yajo force-pushed the update-parallel-db-lock-watcher branch from 9b25cfb to af4674f Compare March 12, 2019 09:10

yajo force-pushed the update-parallel-db-lock-watcher branch 2 times, most recently from 876ac88 to 02b4ee0 Compare March 28, 2019 09:03

yajo force-pushed the update-parallel-db-lock-watcher branch from 02b4ee0 to 2ebbca9 Compare June 19, 2019 09:03

yajo force-pushed the update-parallel-db-lock-watcher branch from 2ebbca9 to b7b1bdc Compare July 11, 2019 08:39

sbidoul removed the WIP label Aug 30, 2019

sbidoul added the enhancement New feature or request label Aug 30, 2019

yajo force-pushed the update-parallel-db-lock-watcher branch from b7b1bdc to 5c385d2 Compare September 2, 2019 07:46

Remove global variables if db lock watcher

63df1c7

sbidoul merged commit bbe5094 into acsone:master Oct 1, 2019

yajo deleted the update-parallel-db-lock-watcher branch October 1, 2019 12:29

yajo restored the update-parallel-db-lock-watcher branch October 2, 2019 08:51

yajo deleted the update-parallel-db-lock-watcher branch October 2, 2019 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update: Parallel DB locks watcher #38

update: Parallel DB locks watcher #38

yajo commented Feb 8, 2019 •

edited

Loading

codecov bot commented Feb 8, 2019

codecov bot commented Feb 8, 2019 •

edited

Loading

yajo commented Feb 8, 2019

yajo commented Feb 12, 2019

blaggacao commented Feb 12, 2019 •

edited

Loading

yajo commented Feb 13, 2019

yajo commented Feb 13, 2019

sbidoul commented Feb 13, 2019

lmignon commented Feb 13, 2019

yajo commented Feb 13, 2019

blaggacao commented Feb 13, 2019

yajo commented Feb 13, 2019

blaggacao commented Feb 13, 2019 •

edited

Loading

blaggacao commented Feb 13, 2019 •

edited

Loading

yajo commented Feb 14, 2019

lmignon commented Feb 14, 2019

yajo commented Mar 11, 2019

blaggacao commented Mar 11, 2019

blaggacao commented Mar 11, 2019 •

edited

Loading

yajo commented Mar 12, 2019

blaggacao commented Mar 18, 2019 •

edited

Loading

yajo commented Jul 11, 2019

yajo commented Aug 20, 2019

sbidoul commented Aug 30, 2019

yajo commented Aug 30, 2019

sbidoul commented Aug 31, 2019

codecov-io commented Sep 2, 2019 •

edited

Loading

yajo commented Sep 2, 2019

yajo commented Sep 9, 2019

sbidoul commented Sep 23, 2019

yajo commented Sep 24, 2019

yajo commented Oct 1, 2019

update: Parallel DB locks watcher #38

update: Parallel DB locks watcher #38

Conversation

yajo commented Feb 8, 2019 • edited Loading

codecov bot commented Feb 8, 2019

Codecov Report

codecov bot commented Feb 8, 2019 • edited Loading

Codecov Report

yajo commented Feb 8, 2019

yajo commented Feb 12, 2019

blaggacao commented Feb 12, 2019 • edited Loading

yajo commented Feb 13, 2019

yajo commented Feb 13, 2019

sbidoul commented Feb 13, 2019

lmignon commented Feb 13, 2019

yajo commented Feb 13, 2019

blaggacao commented Feb 13, 2019

yajo commented Feb 13, 2019

blaggacao commented Feb 13, 2019 • edited Loading

blaggacao commented Feb 13, 2019 • edited Loading

yajo commented Feb 14, 2019

lmignon commented Feb 14, 2019

yajo commented Mar 11, 2019

blaggacao commented Mar 11, 2019

blaggacao commented Mar 11, 2019 • edited Loading

yajo commented Mar 12, 2019

blaggacao commented Mar 18, 2019 • edited Loading

yajo commented Jul 11, 2019

yajo commented Aug 20, 2019

sbidoul commented Aug 30, 2019

yajo commented Aug 30, 2019

sbidoul commented Aug 31, 2019

codecov-io commented Sep 2, 2019 • edited Loading

Codecov Report

yajo commented Sep 2, 2019

yajo commented Sep 9, 2019

sbidoul commented Sep 23, 2019

yajo commented Sep 24, 2019

yajo commented Oct 1, 2019

yajo commented Feb 8, 2019 •

edited

Loading

codecov bot commented Feb 8, 2019 •

edited

Loading

blaggacao commented Feb 12, 2019 •

edited

Loading

blaggacao commented Feb 13, 2019 •

edited

Loading

blaggacao commented Feb 13, 2019 •

edited

Loading

blaggacao commented Mar 11, 2019 •

edited

Loading

blaggacao commented Mar 18, 2019 •

edited

Loading

codecov-io commented Sep 2, 2019 •

edited

Loading