Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing MongoDB sharding configuration for version vectors #1097

Merged
merged 1 commit into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions build/charts/yorkie-cluster/charts/yorkie-mongodb/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,11 @@ sharded:
- name: doc_id
method: "1"
unique: false
- collectionName: versionvectors
shardKeys:
- name: doc_id
method: "1"
unique: false

# Configuration for manual dmongodb sharded stack
mongodb-sharded:
Expand Down
59 changes: 33 additions & 26 deletions build/docker/sharding/scripts/init-mongos1.js
Original file line number Diff line number Diff line change
@@ -1,39 +1,46 @@
sh.addShard("shard-rs-1/shard1-1:27017")
sh.addShard("shard-rs-2/shard2-1:27017")
sh.addShard("shard-rs-1/shard1-1:27017");
sh.addShard("shard-rs-2/shard2-1:27017");

function findAnotherShard(shard) {
if (shard == "shard-rs-1") {
return "shard-rs-2"
} else {
return "shard-rs-1"
}
if (shard == "shard-rs-1") {
return "shard-rs-2";
} else {
return "shard-rs-1";
}
}

function shardOfChunk(minKeyOfChunk) {
return db.getSiblingDB("config").chunks.findOne({ min: { project_id: minKeyOfChunk } }).shard
return db
.getSiblingDB("config")
.chunks.findOne({ min: { project_id: minKeyOfChunk } }).shard;
}

// Shard the database for the mongo client test
const mongoClientDB = "test-yorkie-meta-mongo-client"
sh.enableSharding(mongoClientDB)
sh.shardCollection(mongoClientDB + ".clients", { project_id: 1 })
sh.shardCollection(mongoClientDB + ".documents", { project_id: 1 })
sh.shardCollection(mongoClientDB + ".changes", { doc_id: 1 })
sh.shardCollection(mongoClientDB + ".snapshots", { doc_id: 1 })
sh.shardCollection(mongoClientDB + ".syncedseqs", { doc_id: 1 })
const mongoClientDB = "test-yorkie-meta-mongo-client";
sh.enableSharding(mongoClientDB);
sh.shardCollection(mongoClientDB + ".clients", { project_id: 1 });
sh.shardCollection(mongoClientDB + ".documents", { project_id: 1 });
sh.shardCollection(mongoClientDB + ".changes", { doc_id: 1 });
sh.shardCollection(mongoClientDB + ".snapshots", { doc_id: 1 });
sh.shardCollection(mongoClientDB + ".syncedseqs", { doc_id: 1 });
sh.shardCollection(mongoClientDB + ".versionvectors", { doc_id: 1 });

// Split the inital range at `splitPoint` to allow doc_ids duplicate in different shards.
const splitPoint = ObjectId("500000000000000000000000")
sh.splitAt(mongoClientDB + ".documents", { project_id: splitPoint })
const splitPoint = ObjectId("500000000000000000000000");
sh.splitAt(mongoClientDB + ".documents", { project_id: splitPoint });
// Move the chunk to another shard.
db.adminCommand({ moveChunk: mongoClientDB + ".documents", find: { project_id: splitPoint }, to: findAnotherShard(shardOfChunk(splitPoint)) })
db.adminCommand({
moveChunk: mongoClientDB + ".documents",
find: { project_id: splitPoint },
to: findAnotherShard(shardOfChunk(splitPoint)),
});

// Shard the database for the server test
const serverDB = "test-yorkie-meta-server"
sh.enableSharding(serverDB)
sh.shardCollection(serverDB + ".clients", { project_id: 1 })
sh.shardCollection(serverDB + ".documents", { project_id: 1 })
sh.shardCollection(serverDB + ".changes", { doc_id: 1 })
sh.shardCollection(serverDB + ".snapshots", { doc_id: 1 })
sh.shardCollection(serverDB + ".syncedseqs", { doc_id: 1 })

const serverDB = "test-yorkie-meta-server";
sh.enableSharding(serverDB);
sh.shardCollection(serverDB + ".clients", { project_id: 1 });
sh.shardCollection(serverDB + ".documents", { project_id: 1 });
sh.shardCollection(serverDB + ".changes", { doc_id: 1 });
sh.shardCollection(serverDB + ".snapshots", { doc_id: 1 });
sh.shardCollection(serverDB + ".syncedseqs", { doc_id: 1 });
sh.shardCollection(serverDB + ".versionvectors", { doc_id: 1 });
36 changes: 22 additions & 14 deletions design/mongodb-sharding.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: mongodb-sharding
target-version: 0.4.14
target-version: 0.5.7
---

# MongoDB Sharding
Expand Down Expand Up @@ -29,11 +29,10 @@ This document will only explain the core concepts of the selected sharding strat

1. Cluster-wide: `users`, `projects`
2. Project-wide: `documents`, `clients`
3. Document-wide: `changes`, `snapshots`, `syncedseqs`
3. Document-wide: `changes`, `snapshots`, `syncedseqs`, `versionvectors`

<img src="media/mongodb-sharding-prev-relation.png">


**Sharding Goals**

Shard Project-wide and Document-wide collections due to the large number of data count in each collection
Expand All @@ -49,6 +48,7 @@ Shard Project-wide and Document-wide collections due to the large number of data
3. `Changes`: `(doc_id, server_seq)`
4. `Snapshots`: `(doc_id, server_seq)`
5. `Syncedseqs`: `(doc_id, client_id)`
6. `Versionvectors`: `(doc_id, client_id)`

**Main Query Patterns**

Expand All @@ -64,6 +64,7 @@ cursor, err := c.collection(ColClients).Find(ctx, bson.M{
},
}, options.Find().SetLimit(int64(candidatesLimit)))
```

```go
// Documents
filter := bson.M{
Expand Down Expand Up @@ -106,6 +107,7 @@ cursor, err := c.collection(colChanges).Find(ctx, bson.M{
},
}, options.Find())
```

```go
// Snapshots
result := c.collection(colSnapshots).FindOne(ctx, bson.M{
Expand All @@ -132,6 +134,7 @@ Every unique constraint can be satisfied because each has the shard key as a pre
3. `Changes`: `(doc_id, server_seq)`
4. `Snapshots`: `(doc_id, server_seq)`
5. `Syncedseqs`: `(doc_id, client_id)`
6. `Versionvectors`: `(doc_id, client_id)`

**Changes of Reference Keys**

Expand All @@ -142,6 +145,7 @@ Since the uniqueness of `_id` isn't guaranteed across shards, reference keys to
3. `Changes`: `_id` -> `(project_id, doc_id, server_seq)`
4. `Snapshots`: `_id` -> `(project_id, doc_id, server_seq)`
5. `Syncedseqs`: `_id` -> `(project_id, doc_id, client_id)`
6. `Versionvectors`: `_id` -> `(project_id, doc_id, client_id)`

Considering that MongoDB ensures the uniqueness of `_id` per shard, `Documents` and `Clients` can be identified with the combination of `project_id` and `_id`. Note that the reference keys of document-wide collections are also subsequently changed.

Expand All @@ -155,12 +159,12 @@ Considering that MongoDB ensures the uniqueness of `_id` per shard, `Documents`

For a production deployment, consider the following to ensure data redundancy and system availability.

* Config Server (3 member replica set): `config1`,`config2`,`config3`
* 3 Shards (each a 3 member replica set):
* `shard1-1`,`shard1-2`, `shard1-3`
* `shard2-1`,`shard2-2`, `shard2-3`
* `shard3-1`,`shard3-2`, `shard3-3`
* 2 Mongos: `mongos1`, `mongos2`
- Config Server (3 member replica set): `config1`,`config2`,`config3`
- 3 Shards (each a 3 member replica set):
- `shard1-1`,`shard1-2`, `shard1-3`
- `shard2-1`,`shard2-2`, `shard2-3`
- `shard3-1`,`shard3-2`, `shard3-3`
- 2 Mongos: `mongos1`, `mongos2`

![Cluster architecture](media/mongodb-sharding-cluster-arch.png)

Expand All @@ -178,11 +182,12 @@ Using a composite shard key like `(project_id, key)` can resolve this issue. Aft
However, this change of shard keys can lead to the value duplication of either `actor_id` or `owner`, which uses `client_id` as a value. Now the values of `client_id` can be duplicated, contrary to the current sharding strategy where locating every client in the same project into the same shard prevents this kind of duplications.

The duplication of `client_id` values can devastate the consistency of documents. There are three expected approaches to resolve this issue:

1. Use `client_key + client_id` as a value.
* this may increase the size of CRDT metadata and the size of document snapshots.
- this may increase the size of CRDT metadata and the size of document snapshots.
2. Introduce a cluster-level GUID generator.
3. Depend on the low possibility of duplication of MongoDB ObjectID.
* see details in the following contents.
- see details in the following contents.

**Duplicate MongoDB ObjectID**

Expand All @@ -192,17 +197,20 @@ Both `client_id` and `doc_id` are currently using MongoDB ObjectID as a value. W

However, the possibility of duplicate ObjectIDs is **extremely low in practical use cases** due to its mechanism.
ObjectID uses the following format:

```
TimeStamp(4 bytes) + MachineId(3 bytes) + ProcessId(2 bytes) + Counter(3 bytes)
```

The condition for duplicate ObjectIDs is that more than `16,777,216` documents/clients are created every single second in a single machine and process. Considering Google processes over `99,000` searches every single second, it is unlikely to occur.

When we have to meet that amount of traffic in the future, consider the following options:

1. Introduce a cluster-level GUID generator.
2. Disable auto-balancing chunks of documents and clients.
* Just isolate each shard for a single project.
- Just isolate each shard for a single project.

## References
* [Implementation of ObjectID generator in golang driver](https://github.com/mongodb/mongo-go-driver/blob/v0.0.18/bson/objectid/objectid.go#L40)
* [Generating globally unique identifiers for use with MongoDB](https://www.mongodb.com/blog/post/generating-globally-unique-identifiers-for-use-with-mongodb)

- [Implementation of ObjectID generator in golang driver](https://github.com/mongodb/mongo-go-driver/blob/v0.0.18/bson/objectid/objectid.go#L40)
- [Generating globally unique identifiers for use with MongoDB](https://www.mongodb.com/blog/post/generating-globally-unique-identifiers-for-use-with-mongodb)
2 changes: 1 addition & 1 deletion server/backend/database/mongo/indexes.go
Original file line number Diff line number Diff line change
Expand Up @@ -174,8 +174,8 @@ var collectionInfos = []collectionInfo{
name: ColVersionVectors,
indexes: []mongo.IndexModel{{
Keys: bsonx.Doc{
{Key: "doc_id", Value: bsonx.Int32(1)}, // shard key
{Key: "project_id", Value: bsonx.Int32(1)},
{Key: "doc_id", Value: bsonx.Int32(1)},
{Key: "client_id", Value: bsonx.Int32(1)},
},
Options: options.Index().SetUnique(true),
Expand Down
Loading