Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dup): implement server handling of duplicate rpc (part 1) #456

Merged
merged 5 commits into from
Jan 13, 2020

Conversation

neverchanje
Copy link
Contributor

@neverchanje neverchanje commented Jan 8, 2020

What problem does this PR solve?

This PR implements server handling of duplicate_rpc. There's a document in Chinese that may help for understanding the changes: https://pegasus-kv.github.io/2019/06/09/duplication-design.html#%E9%9B%86%E7%BE%A4%E9%97%B4%E5%86%99%E5%86%B2%E7%AA%81

What is changed and how it works?

After a write replicated by 2PC, it will be applied via pegasus_server_impl::on_batched_write_requests.

Then if pegasus_server_write::on_batched_write_requests decodes this write as duplicate_rpc, it will passes down to pegasus_write_service::duplicate.

struct duplicate_request
{
    // The timestamp of this write.
    1: optional i64 timestamp

    // The code to identify this write.
    2: optional dsn.task_code task_code

    // The binary form of the write.
    3: optional dsn.blob raw_message

    // ID of the cluster where this write comes from.
    4: optional byte cluster_id

    // Whether to compare the timetag of old value with the new write's.
    5: optional bool verify_timetag
}

pegasus_write_service::duplicate will unwrap the real content of the mutation (in raw_message), a MULTI_PUT e.g, then apply the write to RocksDB.

In order to resolve collision (two clusters write on the same key) we introduce timetag for each write:

/// Generates timetag in host endian.
inline uint64_t generate_timetag(uint64_t timestamp, uint8_t cluster_id, bool delete_tag)
{
    return timestamp << 8u | cluster_id << 1u | delete_tag;
}

Normally the write with newer timestamp replaces the older, but when the timestamps are identical (rare case, where the write is both written to cluster A and cluster B), we use cluster_id to compare which is the larger. We define a concept timetag, which is the composition of timestamp and cluster_id.

struct db_write_context
{
    // the mutation decree
    int64_t decree{0};

    // The time when this mutation is generated.
    // This is used to calculate the new timetag.
    uint64_t timestamp{0};

    // timetag of the remote write, 0 if it's not from remote.
    uint64_t remote_timetag{0};

    // Whether to compare the timetag of old value with the new write's.
    // - If true, it requires a read to the DB before write. If the old record has a larger timetag
    // than the `remote_timetag`, the write will be ignored, otherwise it will be applied using
    // the new timetag generated by local cluster.
    // - If false, no overhead for the write but the eventual consistency on duplication
    // is not guaranteed.
    bool verfiy_timetag{false};
};
  • A duplicated write uses remote_timetag to compare the "local timestag" (the timetag stored in rocksdb value).
  • A normal write calculates its timetag from db_write_context::timestamp, and compares with the local timetag.

Both types of write can be written to rocksdb only when their timetag is larger.

Obviously the comparison of timetag requires a DB read before write because the old value is needed. If our user accepts lower level consistency, this overhead can be eliminated with verfiy_timetag=false.

Check List

Tests

  • Unit test (will be added in next part)

/// Generates timetag in host endian.
inline uint64_t generate_timetag(uint64_t timestamp, uint8_t cluster_id, bool delete_tag)
{
return timestamp << 8u | cluster_id << 1u | delete_tag;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个delete_tag有啥用了?我记得我看到文档了,但是现在找不到了。
在这段代码上要不要加上说明?

Copy link
Contributor Author

@neverchanje neverchanje Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块的注释后续会在下个PR补上

@neverchanje neverchanje added the component/duplication cluster duplication label Jan 9, 2020
@hycdong hycdong merged commit c5dec62 into apache:master Jan 13, 2020
@neverchanje neverchanje deleted the dup-part branch January 13, 2020 08:04
@neverchanje neverchanje mentioned this pull request Mar 31, 2020
acelyc111 pushed a commit that referenced this pull request Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants