-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update: Parallel DB locks watcher #38
update: Parallel DB locks watcher #38
Conversation
6a7e4c7
to
d463077
Compare
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
- Coverage 92.91% 91.89% -1.03%
==========================================
Files 13 13
Lines 706 740 +34
Branches 115 118 +3
==========================================
+ Hits 656 680 +24
- Misses 36 44 +8
- Partials 14 16 +2
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
=========================================
Coverage ? 91.46%
=========================================
Files ? 13
Lines ? 750
Branches ? 123
=========================================
Hits ? 686
Misses ? 51
Partials ? 13
Continue to review full report at Codecov.
|
Please set as WIP for now. |
f7ce758
to
d2ebe51
Compare
This shouldn't be WIP now, it's ready to review. I published https://youtu.be/20bUjtO_Mjk for you to be able to see what is this doing in practice. Basically, it wants to allow safely updating in parallel with a production running database, while not disturbing for other use cases. I'd appreciate some help on putting ✔️ travis, because the errors seem not related or at least I don't understand the logs. Thanks. |
This is a preview feature. Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere. However, this should be considered a beta feature. This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package.
This is a preview feature. Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere. However, this should be considered a beta feature. This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package. Closes #160.
This is a preview feature. Before #200, the only way to autoupdate addons was to use [OCA's `module_auto_update` module](https://www.odoo.com/apps/modules/11.0/module_auto_update/). As a preview feature, I'm pre-merging acsone/click-odoo-contrib#38 here in Doodba, to allow trusted parallel upgrades everywhere. However, this should be considered a beta feature. This commit should be reverted when the above PR is merged in upstream click-odoo-contrib package. Closes #160.
Firstly, here is what I understand of the spec:
I'm not the code owner, but I'm a interested third party. Therefore, going over that spec, first:
Please correct me if I misunderstood anything. |
Yes, that's the purpose. I don't want to ask my customers if I can update, I want to update at any time without them noticing at all. HA, basically. Of course, this feature must be combined with some kind of rolling update system to be helpful, but that's outside this PR's scope. |
d2ebe51
to
ac74926
Compare
I can't understand Travis... Where's the failure? Thanks! |
I don't understand it either... no time to dig into this now unfortunately. |
Strange, but without more details into the log, it's not possible to understand the problem. Since the error occurs into the code I've written, I'll try to find some time to dig into.... (but not today...) |
I guess we should increase logging in Travis. I can try if you tell me how. |
@yajo Got it. It looks like a two-sided sword to me. And I'm not sure if that would convince me of being the right approach. I repeat, I'm not the code owner, so I'm just a vocal spectator. You can just ignore me if you want. 😉 But still it's valuable to understand your thought process. For example, I cannot really figure out what would happen if an update transaction would update the schema while another user transaction still relies on a previous schema state. Which of both transactions would corrupt and fail? Is that possible at all? Please share your insights. 😉 As for now and current state of (my) knowledge, it seems to me that updates without schema changes could be executed safely. The problem I see is that this feature of an update is opaque during the odoo update process: We cannot scope such "hot" updates to non-schema-changes. I might just as likely be completely wrong in my analysis. Please let me know! 😄 |
Well, that's precisely the case we're handling here. What would happen is that both cursors would become locked, in an endless loop where each one is waiting for the other one to finish. You can see a little example in the video above. We have been updating in parallel for years now, and it works pretty well unless this precise situation happens, because most updates do not alter the schema, or at least not while a user is specifically using a previous schema in the precise field or table that is being altered. The solution we have developed for this problem is to just keep our eyes peeled over the logs, and if they become stuck, abort and restart the update process. It is far from ideal or maintainable! This patch is replacing our eyes by python code that spawns a watcher for locks and aborts the update if they happen. So, while you'd still be able to stop ➡️ update ➡️ start as before, now you are able to update (:repeat: repeat if failed) :arrow_right: restart. The final update time can be higher, but the downtime will be almost nothing. As a side note, for you to understand better, production Odoo process would be running in a different codebase than the update process. You can use separate installations or containers for such purpose. For instance, using containers:
The only downtime users would experience is in the last step, but even that can be removed by using more replicas for PCA and replacing them gradually. Of course this is not a one-size-fits-all solution, but at least it opens the door for real HA in Odoo, which has been closed for the last centuries. |
@yajo Understanding what you said, then, I guess this line is utterly confusing: https://github.com/acsone/click-odoo-contrib/pull/38/files#diff-b241a31b4530ee322670fff4974f8be1R33
Maybe you'd say: "Watch DB for existing locks while this process tries to update Odoo" Would that be more assertive? As for my motivation: In Odoo Operator I'm implementing a slightly different workflow:
For sure, most tasks are executed as a separate job (or container, if you want). If a version bump object is marked as
I think your idea is interesting for schema altering "bugfixing" workflows to cover those corner cases. I'm still not fully understanding how / if that benefits since the migration is not split up into several transaction, so the transaction span is really the TTL of the migration and for the migrations to succeed it would need to block other users (which is down time). Or maybe I did not understand this part:
EDIT: I guess this line answers it: https://github.com/acsone/click-odoo-contrib/pull/38/files#diff-b241a31b4530ee322670fff4974f8be1R278 Maybe the other method doc should be updated, indeed. Or maybe it's just me being stupid 😉 EDIT2: Probably it's me being stupid 😉 |
@yajo If I was more professional than I actually am, I'd probably just ask: would you mind sharing a spec? 😉 |
About your 1st comment, I see you answered everything yourself. About the 2nd... 😁 If I were more professional I'd understand what a spec is! In any case, the point of this PR is just opening a new door for us. You still are able to work as before, changing nothing (because the watcher would never hit a lock anyways, if you don't update in parallel), so no matter if you use Operator, Doodba, Ansible or Windows 98, this is just a new door you can choose to use if you want and know the implications. I think that we shouldn't clutter this PR anymore with specific implementations. 🤔 That discussion fits better in downstream infrastructure projects, don't you think? I see that I should update the README too. I'd like someone from @acsone to help me before with that nasty ❌ though. |
@yajo For sure, I'll check what's wrong with travis in the next days... 😏 I keep you informed once I've fixed the problem. |
ac74926
to
9cf0348
Compare
@blaggacao The scope of this improvement is to provide a tool as stated before. I don't want to document the best practices on how to use this in the wild here if possible, because there are many different deployment scenarios that can handle this differently. What it does is simple and explained, how do you use it is up to you. Also keep in mind that Odoo is anti-HA by design. I'm just trying to get as close as possible. The main point here is to reduce downtime to almost nothing. As always, I'm having a difficult time trying to understand your English! 😅 Write simpler, direct questions instead, please, if anything else is left. Thanks! |
@yajo Oh, I'm sorry for my English. Besides sharing my line of thought, the only actionable thing was the request of embedding the use cases in the the docs of this PR: for good code, I've read, that to document the why is more important than the what, isn't it? Also in order for others to be able to validate the why (the "specification"), it needs to be documented somehow. As you might see, I'm still struggling to get a constructive critical hold (in a sense what Karl Popper stands for) on this. That's probably the root cause for my English is getting complicated. 😉 |
If you feel I'm being too querulous here, please feel free to ignore. Just know that my true intention is Popperian. 😜 |
9b25cfb
to
af4674f
Compare
I refurbished the script's exegenesis to keep it abreast of this new feature's natality motivation. I hope it appeases your popperian querulousity. |
Exquisitely clear and tasty so! 😉 In general: |
876ac88
to
02b4ee0
Compare
02b4ee0
to
2ebbca9
Compare
2ebbca9
to
b7b1bdc
Compare
I recently faced a failure where the lock watcher aborted a query but that allowed Odoo continue updating. I changed to use |
It seems the failure is not related to this script. This has been working fine in production for 5 months now. Last patches work pretty fine. Is there anything else left for it to be merged? |
@yajo yes, I'll need to fix the tests and then I'll merge. I removed the WIP label. |
❤️ Thanks! |
@yajo could you rebase? master is green. |
With this patch, if you update a database while another Odoo instance is running (such as i.e. a production instance), the production instance will not be locked just because there's a DB lock. DB locks can happen i.e. when 2 or more transactions are happening in parallel and one of them wants to modify data in a field while another one is modifying the field itself. For example, imagine that a user is modifying a `res.partner`'s name, while another update process is adding a `UNIQUE` constraint in the `name` field of the `res_partner` table. This would produce a deadlock where each transaction is waiting for the other one to finish, and thus both the production instance and the update instance would be locked infinitely until one of them is aborted. You cannot detect such problem with common tools such as timeouts, because there still exists the possibility of a query being slow without actually being locked, like when you update an addon that has a pre-update migration script that performs lots of work, or when your queries or server are not optimized and perform slowly. So, the only way to detect deadlocks is by issuing a separate DB cursor that is not protected by a transaction and that watches other cursors' transactions and their locks. With this change, this is what happens now behind the scenes: - The DB lock watcher process is spawned in background using a separate watcher cursor and watches for locks. - The foreground process starts updating Odoo. - If a lock is detected, the update process is aborted, giving priority to the other cursors. This is by design because your production users have priority always, and all that would happen is that the update transaction would be rolled back, so you can just try updating again later. - A couple of CLI parameters allow you to modify the watcher behavior, or completely disable it. Keep in mind that an update in Odoo issues several commits, so before starting the parallel update you must make sure the production server is running in a mode that won't reload workers, and if using Odoo < 10, that won't launch cron jobs.
b7b1bdc
to
5c385d2
Compare
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
- Coverage 92.84% 91.16% -1.68%
==========================================
Files 13 13
Lines 713 770 +57
Branches 117 126 +9
==========================================
+ Hits 662 702 +40
- Misses 37 53 +16
- Partials 14 15 +1
Continue to review full report at Codecov.
|
Done, thanks! |
So? It's green now 😊 💚 |
@yajo I'm just back from holidays. What held me from clicking on merge before leaving was that I noticed the global variables which I did not see before (sorry about that). Do you think you could easily get rid of them, perhaps by implementing the watcher by subclassing |
I hope you enjoyed your holidays! 🏖️ 😎 The problem is that watcher thread needs to know if the main thread wants him to continue watching, and the main thread needs to know if the watcher thread has been aborted, and both react to conditions in the other one, so they need to share these 2 flags and the easiest way I found was using a couple of global variables. How would a |
Thanks! Finally! 🎉 Would you mind to release please? |
With this patch, if you update a database while another Odoo instance is running (such as i.e. a production instance), the production instance will not be locked just because there's a DB lock.
DB locks can happen i.e. when 2 or more transactions are happening in parallel and one of them wants to modify data in a field while another one is modifying the field itself. For example, imagine that a user is modifying a
res.partner
's name, while another update process is adding aUNIQUE
constraint in thename
field of theres_partner
table. This would produce a deadlock where each transaction is waiting for the other one to finish, and thus both the production instance and the update instance would be locked infinitely until one of them is aborted.You cannot detect such problem with common tools such as timeouts, because there still exists the possibility of a query being slow without actually being locked, like when you update an addon that has a pre-update migration script that performs lots of work, or when your queries or server are not optimized and perform slowly.
So, the only way to detect deadlocks is by issuing a separate DB cursor that is not protected by a transaction and that watches other cursors' transactions and their locks.
With this change, this is what happens now behind the scenes:
BTW, the changes in pre-commit are because I have Fedora, where the py3 version is 3.7. I guess this change will help other contributors too, to avoid pre-commit failures...
@Tecnativa