-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
curvefs: add a new distributed transaction model to improve rename performance #2884
Conversation
ce1213c
to
07efa1a
Compare
cicheck |
07efa1a
to
c4f665c
Compare
cicheck |
16914e5
to
bdbcf7f
Compare
cicheck |
bdbcf7f
to
a920cd9
Compare
cicheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you going to solve the compatibility issues introduced?
@@ -1780,6 +1785,12 @@ TopologyImpl::AllocOrGetMemcacheCluster(FsIdType fsId, | |||
return ret; | |||
} | |||
|
|||
bool TopologyImpl::Tso(uint64_t* ts, uint64_t* timestamp) { | |||
*timestamp = curve::common::TimeUtility::GetTimeofDayMs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the man page https://man7.org/linux/man-pages/man2/gettimeofday.2.html, gettimeofday
is not monotonical, so that, you may get a lower timestamp for latter request. And it may also have problem when mds leader exchaged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a problem. But normally the server will use NTP to synchronize time, If the difference is not enough to determine the transaction timeout (default 5s) there should be no impact. otherwise it may rollback the ongoing transaction in a transaction conflict scenario.
Do you have any feasible suggestions to get monotonically increasing physical time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a problem. But normally the server will use NTP to synchronize time, If the difference is not enough to determine the transaction timeout (default 5s) there should be no impact. otherwise it may rollback the ongoing transaction in a transaction conflict scenario. Do you have any feasible suggestions to get monotonically increasing physical time?
If ts is still persisted in each time, you can consider persist timestamp with it. And if GetTimeofDay
returns a value that is less than the previous timestamp, then sleep until GetTimeofDay
returns a bigger value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can try this function clock_gettime? CLOCK_MONOTONIC
args seems can do it. can man clock_gettime
see details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can try this function clock_gettime?
CLOCK_MONOTONIC
args seems can do it. canman clock_gettime
see details.
clock_gettime
with args CLOCK_REALTIME
is better than gettimeofday
here but the clocks on different MDS machines may still have discrepancies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clock_gettime
with argsCLOCK_REALTIME
is better thangettimeofday
here but the clocks on different MDS machines may still have discrepancies.
What I mean is that this function can get continuous and greater than over and over timestamps on a single machine, and is not affected by time drift and NTP alignment. If you want to get this on multiple machines, It's going to be a big challenge. For most storage or data services, they have a separate timing service.
@@ -107,6 +114,7 @@ class RenameOperator { | |||
// if dest exist, record the size and type of file or empty dir | |||
int64_t oldInodeSize_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oldInode means new parent inode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oldInode means srcInode, A rename to B is A.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oldInode means srcInode, A rename to B is A.
It seems oldInodeId_
is only assigned with dstDentry_.inodeid()
in https://github.com/opencurve/curve/blob/bfd5acbf1bfc24047af857be7218c76cfe27c8e2/curvefs/src/client/client_operator.cpp#L176C44-L176C44
@@ -83,6 +90,10 @@ class DentryCacheManagerImpl : public DentryCacheManager { | |||
const std::shared_ptr<MetaServerClient> &metaClient) | |||
: metaClient_(metaClient) {} | |||
|
|||
void Init(std::shared_ptr<MdsClient> mdsClient) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like DentryCacheManager
no longer has cache functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous dentry and inode cache has been optimized in v2.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous dentry and inode cache has been optimized in v2.6.
It's better to remove Cache
from its name
fixed
VLOG(3) << "FuseOpRename [start]: " << renameOp.DebugString(); | ||
RETURN_IF_UNSUCCESS(Precheck); | ||
RETURN_IF_UNSUCCESS(RecordOldInodeInfo); | ||
// Do not move LinkDestParentInode behind CommitTx. | ||
// If so, the nlink will be lost when the machine goes down | ||
RETURN_IF_UNSUCCESS(LinkDestParentInode); | ||
RETURN_IF_UNSUCCESS(PrewriteTx); | ||
RETURN_IF_UNSUCCESS(CommitTxV2); | ||
VLOG(3) << "FuseOpRename [success]: " << renameOp.DebugString(); | ||
// Do not check UnlinkSrcParentInode, beause rename is already success | ||
renameOp.UnlinkSrcParentInode(); | ||
renameOp.UnlinkOldInode(); | ||
if (parent != newparent) { | ||
renameOp.UpdateInodeParent(); | ||
} | ||
renameOp.UpdateInodeCtime(); | ||
|
||
if (enableSumInDir_.load()) { | ||
xattrManager_->UpdateParentXattrAfterRename( | ||
parent, newparent, newname, &renameOp); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn’t look very different from v1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire process is the same, the difference lies in the handling of dentry (prewriteTx, commitTx). In the v1 version, the rename process needs to be protected by a lock and cannot be concurrent.
The two transaction models are currently not compatible. The result of the previous discussion is that the old cluster continues to use the old version, and the newly deployed cluster turns on the v2 version of the transaction model. |
rt = PrewriteRenameTx(dentrys, txLockIn); | ||
if (rt == CURVEFS_ERROR::OK) { | ||
dentrys[0] = newDentry_; | ||
rt = PrewriteRenameTx(dentrys, txLockIn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should client clean up lock in primary in this case? otherwise, subsequent requests must wait until lock in primary timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the one hand, the current transaction model is to defer error handling to the next transaction involving the same dentry. On the other hand, the clean up lock is likely to fail when PrewriteRenameTx
failed.
metaClient_->CommitTx(dentrys, startTs_, commitTs); | ||
} | ||
} | ||
if (rt != MetaStatusCode::OK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, clean up locks in prewrite phase?
@@ -760,53 +734,31 @@ MetaStatusCode DentryStorage::CommitTx(const std::vector<Dentry>& dentrys, | |||
} | |||
WriteLockGuard lg(rwLock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most dentry operations are protected by this lock, it seeems you can pack almost all logic on client side Precheck/Prewrite/Commit into a single RPC when renaming files under the same directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, the txV1 and txV2 all need this improve, a new pr will do this later.
a3804e1
to
59aec64
Compare
cicheck |
b7d6ff1
to
57fbcba
Compare
cicheck |
2 similar comments
cicheck |
cicheck |
curvefs/src/mds/fs_storage.cpp
Outdated
@@ -570,5 +620,35 @@ FSStatusCode PersisKVStorage::DeleteFsUsage(const std::string& fsName) { | |||
return FSStatusCode::OK; | |||
} | |||
|
|||
FSStatusCode PersisKVStorage::Tso(uint32_t fsId, uint64_t* ts, | |||
uint64_t* timestamp) { | |||
WriteLockGuard lock(tsLock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock also causes requests from different file systems to be processed serially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, change to one tsIdGenerator in cluster.
curvefs/src/mds/fs_storage.cpp
Outdated
TS tsInfo; | ||
tsInfo.set_ts(*ts); | ||
// persist to storage | ||
std::string key = codec::EncodeTsKey(fsId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the method of allocating chunkid to allocate ts? so that, you don't have to persist to etch for each ts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -83,6 +90,10 @@ class DentryCacheManagerImpl : public DentryCacheManager { | |||
const std::shared_ptr<MetaServerClient> &metaClient) | |||
: metaClient_(metaClient) {} | |||
|
|||
void Init(std::shared_ptr<MdsClient> mdsClient) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the previous dentry and inode cache has been optimized in v2.6.
It's better to remove Cache
from its name
fixed
670ddfe
to
428eb0e
Compare
cicheck |
da75c9e
to
e1c15c8
Compare
Signed-off-by: wanghai01 <[email protected]>
e1c15c8
to
01ea0b3
Compare
cicheck |
…rformance Signed-off-by: wanghai01 <[email protected]>
01ea0b3
to
2a69743
Compare
cicheck |
@@ -186,14 +188,21 @@ FSStatusCode MemoryFsStorage::DeleteFsUsage(const std::string& fsName) { | |||
return FSStatusCode::OK; | |||
} | |||
|
|||
FSStatusCode MemoryFsStorage::Tso(uint64_t* ts, uint64_t* timestamp) { | |||
*timestamp = curve::common::TimeUtility::GetTimeofDayMs(); | |||
*ts = tsId_.fetch_add(1, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all tso
has increasing sequence number, not divide into every timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, tso has a monotonically increasing sequence number and a timestamp which used to determine tx lock timeout.
What problem does this PR solve?
first commit related pr: #2748
Issue Number: #xxx
Problem Summary:
What is changed and how it works?
What's Changed:
How it Works:
Side effects(Breaking backward compatibility? Performance regression?):
Check List