Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(package): Enable replica set for the MongoDB results cache and configure it when starting the package. #632

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Dec 11, 2024

Description

  1. Enable replica set for the MongoDB results cache.
  2. Configure replica set in the indices building script.
  3. Change the results cache's Docker network mode to host and specify --bind-ip and --port when starting the mongo image.

Validation performed

Search query performance compassion

With & Without the replica set enabled, performed a search query in the WebUI with a screen recording of 30fps, and extract below times:

  1. T0 (relative 0 second): the query was started.
  2. T1: the first batch of results were available and displayed in the browser.
  3. T2: the total number of results are available (either from the results-metadata or summed from the aggregated time buckets) and displayed in the browser.
  4. T3: the job is marked as done and displayed in the browser.
  T0 T1 T2 T3
Trial 1 (with replica-set) 0 0.7 1.266667 2.1
Trial 2 (with replica-set) 0 0.633333 1.433333 2.166667
Trial 3 (with replica-set) 0 0.433333 1.4 2.1
Trial 4 (w/o replica-set) 0 0.466667 1.433333 2.1
Trial 5 (w/o replica-set) 0 0.5 1.433333 2.066667
Trial 6 (w/o replica-set) 0 0.6 1.433333 2.2

image

The speed performance does not seem to significantly differ with and without replica-set enabled.

Result cache host switching

Previous attempts to enable replica set failed because there were host reachability issues observed. Below validations ensure the clp-config.yml/results_cache host and port are configurable without affecting the result cache's avaialbility.

 build/clp-package/lib/python3/site-packages/clp_py_utils/create-results-cache-indices.py

  1. Built the clp-package.

  2. clp-package/sbin/start-clp.sh with the default config and observed below logs:

    2024-12-25T22:45:58.892 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:45:59,548 [DEBUG] Replica set initialization requested for localhost:27017
    2024-12-26 03:45:59,551 [DEBUG] Replica set has not been previously initialized.
    2024-12-26 03:45:59,551 [DEBUG] Initializing replica set at localhost:27017
    2024-12-26 03:45:59,693 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:46:00.517 INFO [start_clp] Created results_cache indices.
    
  3. Changed the result_cache's host and port config in clp-package/etc/clp-config.yml to be localhost:27018. e.g.,

    results_cache:
      host: "localhost"
      port: 27018
      db_name: "clp-query-results"
      stream_collection_name: "stream-files"

    clp-package/sbin/start-clp.sh and observed below logs:

    2024-12-25T22:50:29.309 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:50:29,761 [DEBUG] Replica set initialization requested for localhost:27018
    2024-12-26 03:50:30,266 [DEBUG] Replica set is already initialized but reports invalid config.
    2024-12-26 03:50:30,266 [DEBUG] Initializing replica set at localhost:27018
    2024-12-26 03:50:30,269 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:50:31.042 INFO [start_clp] Created results_cache indices.
    
  4. Used ip addr to find another address that belongs to the clp-package host (i got 172.22.78.194). Changed the result_cache's host and port config in clp-package/etc/clp-config.yml to be 172.22.78.194:27018. clp-package/sbin/start-clp.sh and observed below logs:

    2024-12-25T22:54:43.400 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:54:43,919 [DEBUG] Replica set initialization requested for 172.22.78.194:27018
    2024-12-26 03:54:44,422 [DEBUG] Replica set is already initialized but reports invalid config.
    2024-12-26 03:54:44,422 [DEBUG] Initializing replica set at 172.22.78.194:27018
    2024-12-26 03:54:44,427 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:54:45.239 INFO [start_clp] Created results_cache indices.
    
  5. Changed the result_cache's host and port config in clp-package/etc/clp-config.yml to be localhost:27017. clp-package/sbin/start-clp.sh and observed below logs:

    2024-12-25T22:57:01.151 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:57:01,619 [DEBUG] Replica set initialization requested for localhost:27017
    2024-12-26 03:57:02,122 [DEBUG] Replica set is already initialized but reports invalid config.
    2024-12-26 03:57:02,122 [DEBUG] Initializing replica set at localhost:27017
    2024-12-26 03:57:02,125 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:57:02.894 INFO [start_clp] Created results_cache indices.
    

Copy link
Contributor

coderabbitai bot commented Dec 11, 2024

Warning

Rate limit exceeded

@junhaoliao has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 23 minutes and 53 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between fb4516e and 7298026.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3 hunks)
  • components/package-template/src/etc/mongo/mongod.conf (1 hunks)

Walkthrough

The pull request introduces modifications to support MongoDB replica set initialization in a Python utility script and updates the MongoDB configuration file. The changes enable automatic replica set configuration during the index creation process for a stream collection. The script now checks the replica set status and initializes it if not already configured, ensuring proper MongoDB replication setup.

Changes

File Change Summary
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py Added initialize_replica_set() function to check and initialize MongoDB replica set. Imported OperationFailure from pymongo.errors. Modified main() to call replica set initialization.
components/package-template/src/etc/mongo/mongod.conf Added replication configuration block with replSetName set to "rs0" to enable replica set support.

Sequence Diagram

sequenceDiagram
    participant Script as Create Results Cache Indices Script
    participant MongoDB as MongoDB Client
    
    Script->>MongoDB: Establish Connection
    Script->>MongoDB: Check Replica Set Status
    alt Replica Set Not Initialized
        Script->>MongoDB: Initialize Replica Set
        MongoDB-->>Script: Replica Set Configured
    else Replica Set Already Initialized
        Script-->>MongoDB: Continue Processing
    end
    Script->>MongoDB: Create Indexes
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@junhaoliao
Copy link
Member Author

junhaoliao commented Dec 26, 2024

An issue is found when in clp-config.yml I configure a non-27017 port on host localhost for the replica set:

pymongo.errors.OperationFailure: No host described in new configuration with {version: 36921, term: -1} for replica set rs0 maps to this node, full error: {'ok': 0.0, 'errmsg': 'No host described in new configuration with {version: 36921, term: -1} for replica set rs0 maps to this node', 'code': 74, 'codeName': 'NodeNotFound'}

However, such issue is only observed on localhost hosts. If the host is not localhost / 127.0.0.1 (e.g., my WSL's virtual eth0 interface gets assigned 172.22.78.194), with any port the replica set initialization would succeed and there is no issue connecting to the clp-configured mongo address.

By looking into the community Mongo server source code, it was found they have two ways to check whether a node is up before the server configures itself as a replica node: isSelfFastPath and isSelfSlowPath.

  • isSelfFastPath is purely logic based as it tries to check identity of the server's address with a given config's host and port. It first checks if the port matches then it checks for host.
  • isSelfSlowPath handles the case that the fast path cannot find a match of the server's address in any of the configs. It tries to make a connection with the given config host and port and issue a command to check for identity.

Why localhost: any non-27017 port fails

Since our mongo server instance runs within a Docker container with only the docker's localhost's port 27017 mapped to the Docker host, if we request replica set to be configured on localhost:27017, the fast path check should succeed; if we request for localhost: any non-27017 port, the fast path wouldn't find a match because the server is started with 27017, and the slow path wouldn't find a match because in the Docker container's own network, the Docker host's port (configured by clp-config and used in the replica set config) is not reachable.

Why any non-localhost host: any port works (e.g., 172.22.78.194:27018)

The fast path would fail but the slow path would succeed as Docker knows non-localhost is outside of its network and Docker routes the connection to the Docker host.

Proposed solution

  1. Create a docker network and put all services (docker containers) that use Mongo into the same network.
  2. docker run the result cache with --network "host" so that the mongo Docker container sees / exposes everything on the Docker host.

Considering existing CLP utilities, Option 2. seems the quickest way. @haiqi96 may comment on the impacts on CLP cloud.

@junhaoliao
Copy link
Member Author

You may also see from the {version: 36921, term: -1} in the error message that they may have some uninitialized memory issue, lol

@junhaoliao junhaoliao changed the title WIP - Enable MongoDB replica set for the results cache and initialize it when starting the package. feat(package): Enable MongoDB replica set for the results cache and initialize it when starting the package. Dec 26, 2024
@junhaoliao junhaoliao changed the title feat(package): Enable MongoDB replica set for the results cache and initialize it when starting the package. feat(package): Enable MongoDB replica set for the results cache and configure it when starting the package. Dec 26, 2024
@junhaoliao junhaoliao marked this pull request as ready for review December 26, 2024 03:59
@junhaoliao junhaoliao requested a review from haiqi96 December 26, 2024 03:59
@junhaoliao junhaoliao changed the title feat(package): Enable MongoDB replica set for the results cache and configure it when starting the package. feat(package): Enable replica set for the MongoDB results cache and configure it when starting the package. Dec 26, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (2)

19-33: Consider parametrising the replica set host and port.

Although this initialisation flow works well for the default port (27017) on localhost, it may not be sufficient if the user wants to set up a replica set on a non-standard port or host. Passing in the host and port as parameters will improve flexibility and will align with the PR’s objective of making replica set configuration more adaptable.


47-49: Avoid creating two separate MongoClient connections.

Re-using the same MongoClient after invoking initialize_replica_set can be more efficient and simpler. If you do require a new client connection for index creation, consider adding a short comment explaining why a fresh connection is needed, so it is clear to maintainers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd8fc1 and fb4516e.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3 hunks)
  • components/package-template/src/etc/mongo/mongod.conf (1 hunks)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (1)

6-6: Good use of the OperationFailure import.

This import from pymongo.errors is essential for handling replica set status checks and ensures that any non-initialised replica set is properly caught and configured.

components/package-template/src/etc/mongo/mongod.conf (1)

3-4: Replica set configuration approved.

Enabling the replica set by specifying replSetName is a straightforward and effective strategy. If you later find a need to run multiple replica sets on the same machine, you may consider making rs0 a configurable parameter in the environment or in a generated config file.

@haiqi96
Copy link
Contributor

haiqi96 commented Jan 9, 2025

Quick question: for Result cache host switching validation, I assumed you also ensured the search in webui is still working with different IPs and ports?

@haiqi96
Copy link
Contributor

haiqi96 commented Jan 9, 2025

Discussed with Junhao offline and clarified the two points:

  1. The current replica set only adds one member, so it doesn't actually provide redundancy.
  2. The purpose of replica set is not support redundancy, but to enable oplog that supports watch(), which replaces polling to check updates in the collection.

Imo, we should make these two points clear in the code. We have agreed on renaming some replica related functions and adding docstrings so developer who only reads code can also have the context above.

@@ -491,6 +491,7 @@ def start_results_cache(instance_id: str, clp_config: CLPConfig, conf_dir: pathl
cmd = [
"docker", "run",
"-d",
"--network", "host",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do you know exactly why we need --network host? is there any IP other than results_cache.port that will be accessed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --network host with MongoDB in Docker ensures proper replica set configuration because Docker's network containerization interferes with MongoDB's isSelf checks when using non-localhost-&-27017 ports. By exposing the host network to the MongoDB container, we can avoid this issue.

More details of the investigation can be found at #632 (comment) Let me know if I can elaborate on any part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants