Skip to content
This repository has been archived by the owner on Jul 16, 2024. It is now read-only.

Add lifetime to the distributed locks #136

Merged
merged 14 commits into from
Apr 12, 2022

Conversation

achimnol
Copy link
Member

@achimnol achimnol commented Apr 6, 2022

This will auto-release the locks if a manager process holding the lock hangs or abruptly killed in HA setup.

FileLock and PgAdvisoryLock (in the manager) auto-releases the lock when the manager process gets terminated
because the OS will close the relevant file descriptors.

EtcdLock requires an explicit unlock and this may not release the lock in such cases.
We could avoid this problem by adding an explicit lifetime that automatically releases the lock in the server-side.

Let's set the default lifetime to be min(interval + 30, interval * 2) and implement:

  • EtcdLock: lease-based lock lifetime
  • FileLock: add a watchdog task to auto-release the lock in case of hang

@achimnol achimnol added this to the 22.03 milestone Apr 6, 2022
@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #136 (a1f4ce9) into main (9314476) will increase coverage by 0.11%.
The diff coverage is 80.48%.

@@            Coverage Diff             @@
##             main     #136      +/-   ##
==========================================
+ Coverage   76.82%   76.93%   +0.11%     
==========================================
  Files          26       26              
  Lines        3301     3317      +16     
==========================================
+ Hits         2536     2552      +16     
  Misses        765      765              
Impacted Files Coverage Δ
src/ai/backend/common/redis.py 47.66% <50.00%> (+1.43%) ⬆️
src/ai/backend/common/distributed.py 81.81% <63.63%> (-3.60%) ⬇️
src/ai/backend/common/lock.py 95.78% <89.28%> (-0.32%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9314476...a1f4ce9. Read the comment docs.

Copy link
Member Author

@achimnol achimnol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • etcetra: Please revoke the grant when unlocking.

@achimnol
Copy link
Member Author

Let's skip the watchdog for FileLock this time, as FileLock is for single-node non-HA setups. We could just leave it as a good first issue.

super().__init__(lifetime=lifetime)
self.lock_name = lock_name
self.etcd = etcd
self.lifetime = lifetime
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is self._lifetime inherited already. Please use it!

@achimnol achimnol marked this pull request as ready for review April 12, 2022 01:35
@achimnol achimnol merged commit 8aafb0d into main Apr 12, 2022
@achimnol achimnol deleted the feature/add-lock-timeout-to-distributed-locks branch April 12, 2022 09:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants