Add missing MongoDB sharding configuration for version vectors #1097

hackerwins · 2024-12-11T02:54:16Z

What this PR does / why we need it:

This PR adds the missing MongoDB sharding configuration for the version vectors collection introduced in PR #1047, as well as updates the related documentation to reflect these changes.

Which issue(s) this PR fixes:

Address #723

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Additional documentation:

Checklist:

Added relevant tests or not required
Addressed and resolved all CodeRabbit review comments
Didn't break anything

Summary by CodeRabbit

New Features
- Introduced a new collection named versionvectors with sharding capabilities in the MongoDB configuration.
- Added sharding operations for the versionvectors collection in the initialization script.
Documentation
- Updated the MongoDB sharding documentation to include details about the new versionvectors collection and its sharding strategy.
Bug Fixes
- Improved readability and consistency in the sharding initialization script through syntax enhancements.
Refactor
- Adjusted the indexing strategy for the ColVersionVectors collection to prioritize the doc_id key.

coderabbitai · 2024-12-11T02:54:24Z

Walkthrough

The changes in this pull request introduce a new collection configuration for sharding in the MongoDB setup, specifically adding a collection named versionvectors with a defined shard key. Additionally, modifications to the init-mongos1.js script enhance sharding functionality for this new collection, while the mongodb-sharding.md document is updated to reflect sharding strategies and constraints. The indexing strategy for the ColVersionVectors collection is also adjusted to prioritize the doc_id key.

Changes

File Path	Change Summary
build/charts/yorkie-cluster/charts/yorkie-mongodb/values.yaml	Added new collection configuration for `versionvectors` with `doc_id` as the shard key, non-unique.
build/docker/sharding/scripts/init-mongos1.js	Updated script for sharding operations for `versionvectors`, added semicolons for consistency.
design/mongodb-sharding.md	Updated document for sharding strategy, added `versionvectors`, and detailed unique constraints.
server/backend/database/mongo/indexes.go	Modified index definition for `ColVersionVectors`, moving `doc_id` to the first position in the index.

Possibly related PRs

Fix Sharding Initialization with Dynamic Replica Set Configuration #1087: This PR modifies the MongoDB sharding configuration, which is directly related to the new collection configuration for sharding introduced in the main PR.
Improve Version Vector Handling for Legacy SDK and Snapshots #1096: This PR enhances the handling of version vectors, which is relevant to the new versionvectors collection added in the main PR.

Suggested reviewers

sejongk
JOOHOJANG

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

build/charts/yorkie-cluster/charts/yorkie-mongodb/values.yaml (1)

60-64: Sharding configuration follows established patterns

The configuration for versionvectors collection follows the same pattern as other document-related collections, which is good for consistency. The unique: false setting is correct as uniqueness is handled by the compound index in indexes.go.

Consider documenting the following aspects:

Expected data volume and growth patterns

Rationale for choosing doc_id as the shard key

Impact on query patterns and data distribution
design/mongodb-sharding.md (2)
200-200: Add language specification to code block.

The code block showing ObjectID format should specify the language for better syntax highlighting.
-```
+```text
 TimeStamp(4 bytes) + MachineId(3 bytes) + ProcessId(2 bytes) + Counter(3 bytes)
185-190: Consider prioritizing risk mitigation approaches.

The document lists multiple approaches for handling client_id duplication and future scalability. Consider adding a recommendation section that:

Evaluates trade-offs for each approach

Suggests a preferred approach based on current scale

Defines triggers for when to switch approaches

For example, using client_key + client_id might be the simplest to implement initially, while the cluster-level GUID generator could be the long-term solution when scale demands it.

Also applies to: 208-211

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e3045dc and 7db03b2.

📒 Files selected for processing (4)

build/charts/yorkie-cluster/charts/yorkie-mongodb/values.yaml (1 hunks)
build/docker/sharding/scripts/init-mongos1.js (1 hunks)
design/mongodb-sharding.md (10 hunks)
server/backend/database/mongo/indexes.go (1 hunks)

🧰 Additional context used

🪛 LanguageTool

design/mongodb-sharding.md

[typographical] ~164-~164: Two consecutive commas
Context: ...eplica set): - shard1-1,shard1-2, shard1-3 - shard2-1,shard2-2, shard2-3 - shard3-1,`s...

(DOUBLE_PUNCTUATION)

[typographical] ~165-~165: Two consecutive commas
Context: ..., shard1-3 - shard2-1,shard2-2, shard2-3 - shard3-1,shard3-2, shard3-3 - 2 Mongos: `mong...

(DOUBLE_PUNCTUATION)

🪛 Markdownlint (0.35.0)

design/mongodb-sharding.md

201-201: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (6)

build/docker/sharding/scripts/init-mongos1.js (1)

26-26: Verify shard key selection for version vectors collection

The doc_id shard key aligns with other document-related collections, but we should ensure this provides optimal distribution for version vectors data.

Consider monitoring shard distribution patterns after deployment to ensure even data distribution.

Also applies to: 46-46

✅ Verification successful

Based on the results, I can now generate the final response as we have sufficient information about the sharding strategy from the design documents.

Shard key selection for version vectors collection is appropriate

The doc_id shard key for version vectors aligns with the documented sharding strategy in design/mongodb-sharding.md. The design explicitly states that document-wide operations use doc_id as a ranged shard key, which is consistent with this implementation. This choice ensures version vectors data follows the same distribution pattern as other document-related collections, maintaining data locality and efficient querying patterns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Analyze shard key distribution pattern
# Check for any existing patterns in doc_id distribution that might affect sharding

# Look for any comments or documentation about sharding strategy
rg -A 3 "shard.*strategy|shard.*pattern" 

# Check for any existing version vectors implementations
ast-grep --pattern 'class $_ {
  $$$
  versionVector($$$) {
    $$$
  }
  $$$
}'

Length of output: 2328

server/backend/database/mongo/indexes.go (1)

Line range hint 177-182: Index optimization aligns with sharding strategy

The compound index with doc_id as the leading field supports the sharding configuration while maintaining uniqueness constraints. However, consider the following:

The index order supports sharding but verify if it covers the most common query patterns
The unique constraint on all three fields matches the data integrity requirements

✅ Verification successful

Index optimization aligns with sharding strategy and query patterns

Based on the codebase analysis, the compound index with doc_id as the leading field is well-aligned with both the sharding strategy and the main query patterns:

The index supports the sharding configuration as evidenced in build/docker/sharding/scripts/init-mongos1.js where versionvectors collection is sharded on doc_id
The query patterns from server/backend/database/mongo/client.go show that lookups are primarily done using project_id and doc_id, which the index covers efficiently
The unique constraint on (doc_id, project_id, client_id) matches the data model requirements as shown in design/mongodb-sharding.md where version vectors are identified by this composite key

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Analyze query patterns and index usage
# Look for query patterns in the codebase

# Search for queries on version vectors
rg -A 3 "find.*versionvectors|aggregate.*versionvectors"

# Look for field access patterns
ast-grep --pattern 'db.$$$versionvectors$$$find($$$)'

Length of output: 114

Script:

#!/bin/bash
# Let's try different search patterns to find version vector related queries

# Search for collection name references
rg -i "versionvectors|version_vectors" -A 3

# Search for specific mongo operations on collections
rg "Find\(|Aggregate\(" -A 3 "server/backend/database/mongo"

# Look for struct definitions and methods related to version vectors
ast-grep --pattern 'type $_ struct {
  $$$
  VersionVector $$$
  $$$
}'

# Search for mongo operation methods
ast-grep --pattern 'func ($_ *$_) Find($$$) {
  $$$
}'

Length of output: 20152

build/charts/yorkie-cluster/charts/yorkie-mongodb/values.yaml (1)

Line range hint 26-26: Cross-file consistency verification

The sharding implementation is consistent across all files:

init-mongos1.js: Adds sharding operations
indexes.go: Optimizes index for sharding
values.yaml: Configures sharding rules

Also applies to: 46-46, 177-182, 60-64

✅ Verification successful

Let me verify the MongoDB sharding configuration consistency by checking the values.yaml configuration against the actual implementation.

Sharding configuration is consistent across all components

The verification confirms that the sharding implementation is fully consistent:

The shard key configuration in values.yaml matches the implementation in init-mongos1.js where all collections (changes, snapshots, syncedseqs, versionvectors) use doc_id as the shard key
The indexes in indexes.go correctly define doc_id as the shard key with the same configuration
The mongos replica count is consistently set to 1 and properly referenced across the configuration
The sharding rules in values.yaml align with the actual sharding implementation in the MongoDB initialization scripts

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistency across files
# Check for any inconsistencies in configuration

# Look for any conflicting shard key definitions
rg -A 2 "doc_id.*shard|shard.*doc_id"

# Check for any existing version vector related tests
fd -e test.go -e spec.go -e _test.go --exec rg -l "versionvectors|VersionVectors"

Length of output: 3074

Script:

#!/bin/bash
# Check mongos replica count configuration
rg -A 2 "mongos.*ReplicaCount|replicaCount.*mongos" build/charts/yorkie-cluster/

# Check sharding configuration in values.yaml
cat build/charts/yorkie-cluster/charts/yorkie-mongodb/values.yaml

# Check for any other sharding related configurations
rg "sharding:|shards:|mongos:" build/charts/yorkie-cluster/

Length of output: 7038

design/mongodb-sharding.md (3)

3-3: LGTM: Version update is appropriate for sharding support.

The version bump to 0.5.7 aligns with the significant architectural changes introduced by sharding support.

162-167: LGTM: Well-defined cluster architecture.

The cluster architecture with 3 config servers, 3 shards (each with 3 replicas), and 2 mongos routers provides good redundancy and availability.

🧰 Tools

🪛 LanguageTool

[typographical] ~164-~164: Two consecutive commas
Context: ...eplica set): - shard1-1,shard1-2, shard1-3 - shard2-1,shard2-2, shard2-3 - shard3-1,`s...

(DOUBLE_PUNCTUATION)

[typographical] ~165-~165: Two consecutive commas
Context: ..., shard1-3 - shard2-1,shard2-2, shard2-3 - shard3-1,shard3-2, shard3-3 - 2 Mongos: `mong...

(DOUBLE_PUNCTUATION)

32-32: Verify versionvectors integration with existing collections.

The addition of versionvectors to document-wide relations and its unique constraint (doc_id, client_id) looks correct. However, we should verify that all existing queries and indexes support this new collection.

Also applies to: 51-51

✅ Verification successful

Let me search for version vector related patterns in the codebase to verify its integration. The previous search might have been too specific, so I'm broadening the search terms and file types.

Based on the search results, I can now provide a final response about the versionvectors integration:

Verified versionvectors collection integration is properly implemented

The versionvectors collection is well integrated into the system with:

Proper MongoDB schema definition and sharding configuration in build/docker/sharding/scripts/init-mongos1.js
Complete database client implementation in server/backend/database/mongo/client.go with CRUD operations
Consistent usage across the codebase for version control and garbage collection
Proper indexing with doc_id as shard key and unique constraint on (doc_id, client_id)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for existing queries and indexes related to versionvectors
rg -A 5 "versionvectors\|ColVersionVectors" --type go

# Search for any potential conflicts with the new unique constraint
ast-grep --pattern 'collection($_, "versionvectors").$$$'

Length of output: 115

Script:

#!/bin/bash
# Search for versionvectors in MongoDB schema or collection definitions
rg -i "versionvectors" --type-add 'schema:*.{js,ts,go,java,py,rb}' --type schema -A 5

# Look for version vector related code patterns
rg -i "version.?vector" --type-add 'code:*.{js,ts,go,java,py,rb}' --type code -A 5

# Search for collection definitions or indexes
fd -e js -e ts -e go -e java -e py -e rb -x grep -l -i "collection.*versionvectors\|index.*versionvectors" {}

Length of output: 82687

codecov · 2024-12-11T03:01:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.84%. Comparing base (e3045dc) to head (7db03b2).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1097   +/-   ##
=======================================
  Coverage   46.84%   46.84%           
=======================================
  Files          84       84           
  Lines       12256    12256           
=======================================
  Hits         5741     5741           
  Misses       5939     5939           
  Partials      576      576

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Add sharding support for version vectors

7db03b2

coderabbitai bot reviewed Dec 11, 2024

View reviewed changes

hackerwins changed the title ~~Add sharding support for version vectors~~ Add missing MongoDB sharding configuration for version vectors Dec 11, 2024

hackerwins merged commit c9a86db into main Dec 11, 2024
5 checks passed

hackerwins deleted the shard-versionvectors branch December 11, 2024 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing MongoDB sharding configuration for version vectors #1097

Add missing MongoDB sharding configuration for version vectors #1097

hackerwins commented Dec 11, 2024 •

edited

Loading

coderabbitai bot commented Dec 11, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

codecov bot commented Dec 11, 2024 •

edited

Loading

Add missing MongoDB sharding configuration for version vectors #1097

Add missing MongoDB sharding configuration for version vectors #1097

Conversation

hackerwins commented Dec 11, 2024 • edited Loading

Summary by CodeRabbit

coderabbitai bot commented Dec 11, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 11, 2024 • edited Loading

Codecov Report

hackerwins commented Dec 11, 2024 •

edited

Loading

coderabbitai bot commented Dec 11, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Dec 11, 2024 •

edited

Loading