curvefs: add a new distributed transaction model to improve rename performance #2884

SeanHai · 2023-11-10T01:57:05Z

What problem does this PR solve?

first commit related pr: #2748

Issue Number: #xxx

Problem Summary:

What is changed and how it works?

What's Changed:

How it Works:

Side effects(Breaking backward compatibility? Performance regression?):

Check List

Relevant documentation/comments is changed or added
I acknowledge that all my contributions will be made under the project's license

SeanHai · 2023-11-20T08:42:03Z

cicheck

SeanHai · 2023-11-20T08:55:08Z

cicheck

SeanHai · 2023-11-24T02:44:13Z

cicheck

SeanHai · 2023-11-30T01:36:52Z

cicheck

wu-hanqing

How are you going to solve the compatibility issues introduced?

curvefs/src/mds/topology/topology.cpp

wu-hanqing · 2023-12-01T03:15:57Z

curvefs/src/mds/topology/topology.cpp

@@ -1780,6 +1785,12 @@ TopologyImpl::AllocOrGetMemcacheCluster(FsIdType fsId,
    return ret;
 }

+bool TopologyImpl::Tso(uint64_t* ts, uint64_t* timestamp) {
+    *timestamp = curve::common::TimeUtility::GetTimeofDayMs();


According to the man page https://man7.org/linux/man-pages/man2/gettimeofday.2.html, gettimeofday is not monotonical, so that, you may get a lower timestamp for latter request. And it may also have problem when mds leader exchaged.

This is indeed a problem. But normally the server will use NTP to synchronize time, If the difference is not enough to determine the transaction timeout (default 5s) there should be no impact. otherwise it may rollback the ongoing transaction in a transaction conflict scenario.
Do you have any feasible suggestions to get monotonically increasing physical time？

This is indeed a problem. But normally the server will use NTP to synchronize time, If the difference is not enough to determine the transaction timeout (default 5s) there should be no impact. otherwise it may rollback the ongoing transaction in a transaction conflict scenario. Do you have any feasible suggestions to get monotonically increasing physical time？

If ts is still persisted in each time, you can consider persist timestamp with it. And if GetTimeofDay returns a value that is less than the previous timestamp, then sleep until GetTimeofDay returns a bigger value.

Can try this function clock_gettime? CLOCK_MONOTONIC args seems can do it. can man clock_gettime see details.

Can try this function clock_gettime? CLOCK_MONOTONIC args seems can do it. can man clock_gettime see details.

clock_gettime with args CLOCK_REALTIME is better than gettimeofday here but the clocks on different MDS machines may still have discrepancies.

clock_gettime with args CLOCK_REALTIME is better than gettimeofday here but the clocks on different MDS machines may still have discrepancies.

What I mean is that this function can get continuous and greater than over and over timestamps on a single machine, and is not affected by time drift and NTP alignment. If you want to get this on multiple machines, It's going to be a big challenge. For most storage or data services, they have a separate timing service.

curvefs/src/mds/topology/topology.cpp

curvefs/src/mds/topology/topology.h

curvefs/src/metaserver/dentry_storage.cpp

curvefs/src/client/dentry_cache_manager.cpp

curvefs/src/client/fuse_client.cpp

wu-hanqing · 2023-12-01T08:23:26Z

curvefs/src/client/client_operator.h

@@ -107,6 +114,7 @@ class RenameOperator {
    // if dest exist, record the size and type of file or empty dir
    int64_t oldInodeSize_;


oldInode means new parent inode?

oldInode means srcInode, A rename to B is A.

oldInode means srcInode, A rename to B is A.

It seems oldInodeId_ is only assigned with dstDentry_.inodeid() in https://github.com/opencurve/curve/blob/bfd5acbf1bfc24047af857be7218c76cfe27c8e2/curvefs/src/client/client_operator.cpp#L176C44-L176C44

wu-hanqing · 2023-12-01T08:32:49Z

curvefs/src/client/dentry_cache_manager.h

@@ -83,6 +90,10 @@ class DentryCacheManagerImpl : public DentryCacheManager {
        const std::shared_ptr<MetaServerClient> &metaClient)
      : metaClient_(metaClient) {}

+    void Init(std::shared_ptr<MdsClient> mdsClient) override {


It looks like DentryCacheManager no longer has cache functionality

Yes, the previous dentry and inode cache has been optimized in v2.6.

Yes, the previous dentry and inode cache has been optimized in v2.6.

It's better to remove Cache from its name

fixed

wu-hanqing · 2023-12-01T08:35:46Z

curvefs/src/client/fuse_client.cpp

+        VLOG(3) << "FuseOpRename [start]: " << renameOp.DebugString();
+        RETURN_IF_UNSUCCESS(Precheck);
+        RETURN_IF_UNSUCCESS(RecordOldInodeInfo);
+        // Do not move LinkDestParentInode behind CommitTx.
+        // If so, the nlink will be lost when the machine goes down
+        RETURN_IF_UNSUCCESS(LinkDestParentInode);
+        RETURN_IF_UNSUCCESS(PrewriteTx);
+        RETURN_IF_UNSUCCESS(CommitTxV2);
+        VLOG(3) << "FuseOpRename [success]: " << renameOp.DebugString();
+        // Do not check UnlinkSrcParentInode, beause rename is already success
+        renameOp.UnlinkSrcParentInode();
+        renameOp.UnlinkOldInode();
+        if (parent != newparent) {
+            renameOp.UpdateInodeParent();
+        }
+        renameOp.UpdateInodeCtime();

+        if (enableSumInDir_.load()) {
+            xattrManager_->UpdateParentXattrAfterRename(
+                parent, newparent, newname, &renameOp);
+        }


It doesn’t look very different from v1

The entire process is the same, the difference lies in the handling of dentry (prewriteTx, commitTx). In the v1 version, the rename process needs to be protected by a lock and cannot be concurrent.

SeanHai · 2023-12-01T09:30:48Z

How are you going to solve the compatibility issues introduced?

The two transaction models are currently not compatible. The result of the previous discussion is that the old cluster continues to use the old version, and the newly deployed cluster turns on the v2 version of the transaction model.

curvefs/src/client/client_operator.cpp

curvefs/src/metaserver/dentry_storage.cpp

wu-hanqing · 2023-12-02T16:56:30Z

curvefs/src/client/client_operator.cpp

+        rt = PrewriteRenameTx(dentrys, txLockIn);
+        if (rt == CURVEFS_ERROR::OK) {
+            dentrys[0] = newDentry_;
+            rt = PrewriteRenameTx(dentrys, txLockIn);


Should client clean up lock in primary in this case? otherwise, subsequent requests must wait until lock in primary timeout.

On the one hand, the current transaction model is to defer error handling to the next transaction involving the same dentry. On the other hand, the clean up lock is likely to fail when PrewriteRenameTx failed.

wu-hanqing · 2023-12-02T16:58:24Z

curvefs/src/client/client_operator.cpp

+            metaClient_->CommitTx(dentrys, startTs_, commitTs);
+        }
+    }
+    if (rt != MetaStatusCode::OK) {


ditto, clean up locks in prewrite phase?

wu-hanqing · 2023-12-03T03:24:48Z

curvefs/src/metaserver/dentry_storage.cpp

@@ -760,53 +734,31 @@ MetaStatusCode DentryStorage::CommitTx(const std::vector<Dentry>& dentrys,
    }
    WriteLockGuard lg(rwLock_);


Since most dentry operations are protected by this lock, it seeems you can pack almost all logic on client side Precheck/Prewrite/Commit into a single RPC when renaming files under the same directory.

Agree, the txV1 and txV2 all need this improve, a new pr will do this later.

SeanHai · 2023-12-05T07:00:47Z

cicheck

SeanHai · 2023-12-05T09:09:08Z

cicheck

SeanHai · 2023-12-05T09:38:12Z

cicheck

SeanHai · 2023-12-05T10:55:40Z

cicheck

wu-hanqing · 2023-12-07T02:32:14Z

curvefs/src/mds/fs_storage.cpp

@@ -570,5 +620,35 @@ FSStatusCode PersisKVStorage::DeleteFsUsage(const std::string& fsName) {
    return FSStatusCode::OK;
 }

+FSStatusCode PersisKVStorage::Tso(uint32_t fsId, uint64_t* ts,
+                                  uint64_t* timestamp) {
+    WriteLockGuard lock(tsLock_);


This lock also causes requests from different file systems to be processed serially.

Fixed, change to one tsIdGenerator in cluster.

wu-hanqing · 2023-12-07T02:35:05Z

curvefs/src/mds/fs_storage.cpp

+    TS tsInfo;
+    tsInfo.set_ts(*ts);
+    // persist to storage
+    std::string key = codec::EncodeTsKey(fsId);


Can you use the method of allocating chunkid to allocate ts? so that, you don't have to persist to etch for each ts.

wu-hanqing · 2023-12-07T02:51:56Z

curvefs/src/client/dentry_cache_manager.h

@@ -83,6 +90,10 @@ class DentryCacheManagerImpl : public DentryCacheManager {
        const std::shared_ptr<MetaServerClient> &metaClient)
      : metaClient_(metaClient) {}

+    void Init(std::shared_ptr<MdsClient> mdsClient) override {


Yes, the previous dentry and inode cache has been optimized in v2.6.

It's better to remove Cache from its name

fixed

SeanHai · 2023-12-10T06:02:03Z

cicheck

Signed-off-by: wanghai01 <[email protected]>

SeanHai · 2023-12-11T11:59:23Z

cicheck

…rformance Signed-off-by: wanghai01 <[email protected]>

SeanHai · 2023-12-12T02:00:50Z

cicheck

Wine93 · 2023-12-12T06:13:31Z

curvefs/src/mds/fs_storage.cpp

@@ -186,14 +188,21 @@ FSStatusCode MemoryFsStorage::DeleteFsUsage(const std::string& fsName) {
    return FSStatusCode::OK;
 }

+FSStatusCode MemoryFsStorage::Tso(uint64_t* ts, uint64_t* timestamp) {
+    *timestamp = curve::common::TimeUtility::GetTimeofDayMs();
+    *ts = tsId_.fetch_add(1, std::memory_order_relaxed);


Is all tso has increasing sequence number, not divide into every timestamp?

Yes, tso has a monotonically increasing sequence number and a timestamp which used to determine tx lock timeout.

SeanHai force-pushed the improve_rename branch from ce1213c to 07efa1a Compare November 20, 2023 08:40

SeanHai changed the title ~~[draft] curvefs: replace distributed transaction model to improve rename performance~~ curvefs: add a new distributed transaction model to improve rename performance Nov 20, 2023

SeanHai force-pushed the improve_rename branch from 07efa1a to c4f665c Compare November 20, 2023 08:54

SeanHai force-pushed the improve_rename branch 3 times, most recently from 16914e5 to bdbcf7f Compare November 24, 2023 02:43

SeanHai force-pushed the improve_rename branch from bdbcf7f to a920cd9 Compare November 30, 2023 01:36

wu-hanqing reviewed Dec 1, 2023

View reviewed changes

wu-hanqing reviewed Dec 3, 2023

View reviewed changes

SeanHai force-pushed the improve_rename branch from a3804e1 to 59aec64 Compare December 5, 2023 07:00

SeanHai force-pushed the improve_rename branch 2 times, most recently from b7d6ff1 to 57fbcba Compare December 5, 2023 09:08

wu-hanqing reviewed Dec 7, 2023

View reviewed changes

SeanHai force-pushed the improve_rename branch 2 times, most recently from 670ddfe to 428eb0e Compare December 8, 2023 08:34

SeanHai force-pushed the improve_rename branch 2 times, most recently from da75c9e to e1c15c8 Compare December 11, 2023 11:47

fix curve rpc infinite retry logic to mds

1e4451d

Signed-off-by: wanghai01 <[email protected]>

SeanHai force-pushed the improve_rename branch from e1c15c8 to 01ea0b3 Compare December 11, 2023 11:58

curvefs: add a new distributed transaction model to improve rename pe…

2a69743

…rformance Signed-off-by: wanghai01 <[email protected]>

SeanHai force-pushed the improve_rename branch from 01ea0b3 to 2a69743 Compare December 12, 2023 01:59

wu-hanqing approved these changes Dec 12, 2023

View reviewed changes

Wine93 reviewed Dec 12, 2023

View reviewed changes

Wine93 approved these changes Dec 12, 2023

View reviewed changes

SeanHai merged commit 0c722ae into opencurve:master Dec 12, 2023
4 checks passed

		@@ -107,6 +114,7 @@ class RenameOperator {
		// if dest exist, record the size and type of file or empty dir
		int64_t oldInodeSize_;

		@@ -760,53 +734,31 @@ MetaStatusCode DentryStorage::CommitTx(const std::vector<Dentry>& dentrys,
		}
		WriteLockGuard lg(rwLock_);

curvefs: add a new distributed transaction model to improve rename performance #2884

curvefs: add a new distributed transaction model to improve rename performance #2884

Conversation

SeanHai commented Nov 10, 2023 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

SeanHai commented Nov 20, 2023

SeanHai commented Nov 20, 2023

SeanHai commented Nov 24, 2023

SeanHai commented Nov 30, 2023

wu-hanqing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanHai Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wu-hanqing Dec 7, 2023 • edited by SeanHai Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanHai commented Dec 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanHai commented Dec 5, 2023

SeanHai commented Dec 5, 2023

SeanHai commented Dec 5, 2023

SeanHai commented Dec 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wu-hanqing Dec 7, 2023 • edited by SeanHai Loading

Choose a reason for hiding this comment

SeanHai commented Dec 10, 2023

SeanHai commented Dec 11, 2023

SeanHai commented Dec 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanHai commented Nov 10, 2023 •

edited

Loading

SeanHai Dec 1, 2023 •

edited

Loading

wu-hanqing Dec 7, 2023 •

edited by SeanHai

Loading

wu-hanqing Dec 7, 2023 •

edited by SeanHai

Loading