Enhance GetDocuments API by adding bulk retrieval #931

kokodak · 2024-07-16T17:27:02Z

What this PR does / why we need it:

This PR implements a bulk retrieval operation for the GetDocuments API to enhance performance.

The specific tasks accomplished include:

Addition of the FindDocInfosByKeys() method to the Database interface
Implementation of the FindDocInfosByKeys() logic in mongo.Client and memory.DB respectively
Addition of test code for mongo and in-memory implementations
Replacement of DB queries used in GetDocument API and GetDocuments API

While the query to retrieve DocInfos has been reduced from N times to once when calling the GetDocuments API, there still remains an issue where packs.BuildDocumentForServerSeq() is called N times.

However, this logic seems to be related to CRDT or logical clock functionalities, which I do not fully understand yet, so I could not work on it. Therefore, I did not remove the TODO comment regarding the N+1 issue.

Which issue(s) this PR fixes:

Fixes #921

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Additional documentation:

Checklist:

Added relevant tests or not required
Didn't break anything

Summary by CodeRabbit

New Features
- Enhanced document retrieval by adding a method to find document information based on given keys.
- Introduced a new field to specify whether to include snapshots in document requests.
Bug Fixes
- Improved efficiency of document information retrieval, addressing the N+1 query problem.
Documentation
- Updated OpenAPI specifications for improved readability and consistency.

coderabbitai · 2024-07-16T17:27:11Z

Walkthrough

The new method FindDocInfosByKeys was introduced to the DB struct in database.go, enabling the retrieval of multiple documents based on given keys. A corresponding test function, RunFindDocInfosByKeysTest, was also added to verify the functionality. These enhancements aim to improve the performance of the GetDocuments API by facilitating efficient bulk data queries.

Changes

File	Change Summary
`server/backend/database/memory/database.go`	Added `FindDocInfosByKeys` method to retrieve documents based on given keys.
`server/backend/database/testcases/testcases.go`	Added `RunFindDocInfosByKeysTest` to test the `FindDocInfosByKeys` method by creating documents with specified keys and verifying the retrieval process.
`server/documents/documents.go`	Revised `GetDocumentSummary` and `GetDocumentSummaries` to use the new `FindDocInfosByKeys` method for improved bulk retrieval efficiency.
`server/rpc/admin_server.go`	Updated `GetDocuments` function to include a new parameter for the `include_snapshot` flag to enhance data retrieval options.
`api/yorkie/v1/admin.proto`	Added `include_snapshot` field to `GetDocumentsRequest` for optional snapshot inclusion in responses.

Assessment against linked issues

Objective	Addressed	Explanation
Implement DB Query for GetDocuments API to improve performance (#921)	✅

In the realm of keys and docs they spin,
Where queries dance and tests begin,
Performance soared, the code refined,
In FindDocInfosByKeys, success we find.
🐇🚀

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b468f8b and eb1425d.

Files selected for processing (8)

server/backend/database/database.go (1 hunks)
server/backend/database/memory/database.go (1 hunks)
server/backend/database/memory/database_test.go (1 hunks)
server/backend/database/mongo/client.go (1 hunks)
server/backend/database/mongo/client_test.go (1 hunks)
server/backend/database/testcases/testcases.go (1 hunks)
server/documents/documents.go (2 hunks)
test/sharding/mongo_client_test.go (1 hunks)

Additional comments not posted (8)

server/backend/database/memory/database_test.go (1)

51-53: Approval of new test case addition.

The addition of the RunFindDocInfosByKeys test is aligned with the PR's objectives to enhance the GetDocuments API performance. This test ensures the new bulk retrieval method works as expected in the memory database implementation.

server/backend/database/mongo/client_test.go (1)

66-68: Approval of new test case addition.

The addition of the RunFindDocInfosByKeys test is aligned with the PR's objectives to enhance the GetDocuments API performance. This test ensures the new bulk retrieval method works as expected in the MongoDB database implementation.

test/sharding/mongo_client_test.go (1)

75-77: Approval of new test case addition.

The addition of the RunFindDocInfosByKeys test is aligned with the PR's objectives to enhance the GetDocuments API performance. This test ensures the new bulk retrieval method works as expected in the MongoDB client with sharded database configuration.

server/documents/documents.go (2)

100-100: Approval of updated GetDocumentSummary function.

The simplification of the document retrieval process in GetDocumentSummary by using FindDocInfoByKey is a positive change, enhancing the efficiency and maintainability of the code.

123-148: Approval of updated GetDocumentSummaries function.

The update to GetDocumentSummaries to use FindDocInfosByKeys for bulk document retrieval is a significant improvement. This change effectively addresses the N+1 problem and enhances the performance of the API.

server/backend/database/database.go (1)

167-172: New method FindDocInfosByKeys added to Database interface

The addition of FindDocInfosByKeys to the Database interface is a key enhancement for supporting bulk document retrieval. The method signature correctly takes a context, a project ID, and a slice of document keys, which is consistent with the interface's pattern for similar methods.

Correctness: The method signature is correct and aligns with Go's conventions for interfaces.

Performance: This method supports bulk operations, which should improve performance as noted in the PR objectives.

Maintainability: The method is clearly defined and fits well with the existing structure of the interface.

server/backend/database/mongo/client.go (1)

766-793: Review of the new method FindDocInfosByKeys.

This method aims to fetch multiple documents based on their keys, which aligns with the PR's objective to enhance performance by reducing the number of database queries. Here are a few observations and suggestions:

Error Handling: The method correctly handles potential errors from the MongoDB operations, which is crucial for robustness.

Efficiency: Using the $in operator with the keys array is efficient for fetching multiple documents in a single query.

Filter Construction: The method constructs a filter to exclude documents marked as removed, which is a good practice for data integrity.

However, consider the following improvements:

Logging: Adding logging before and after the MongoDB operations could help in debugging and monitoring the performance of this method.

Testing: Ensure that there are comprehensive tests covering various scenarios, including cases with large numbers of keys, no keys, and keys that do not match any documents.

Overall, the implementation looks solid and should contribute positively to the system's performance.

server/backend/database/testcases/testcases.go (1)

96-129: Review of RunFindDocInfosByKeysTest Function

Context Setup: The function correctly sets up the test context and activates a client. This is a standard setup for database-related tests.

Document Creation Simulation: The function simulates the creation of documents by attempting to find document information for a set of keys. However, it does not actually create any documents but only checks if they can be retrieved, assuming they exist. This might be misleading as the name suggests creation but it only checks existence.

Bulk Retrieval and Validation: The bulk retrieval using FindDocInfosByKeys is correctly implemented. The function checks if the keys of the retrieved documents match the expected keys using assert.ElementsMatch, which is appropriate for unordered comparisons.

Length Check: The function also checks if the number of retrieved documents matches the number of requested keys using assert.Len. This is a good practice to ensure that no documents are missing or unexpectedly added.

Error Handling: The function properly checks for errors after each database operation, which is crucial for identifying issues early in the test.

Test Isolation: Each test run is isolated using t.Run, which is good for separating test cases and identifying which specific test fails if there are multiple failures.

Suggestions:

Consider actually creating the documents in the database before trying to retrieve them. This would make the test more comprehensive and realistic.

Add more detailed assertions to check the contents of the retrieved documents, not just their keys.

server/backend/database/memory/database.go

sejongk · 2024-07-17T04:01:22Z

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization.
Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

server/backend/database/testcases/testcases.go

kokodak · 2024-07-17T10:01:48Z

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

@sejongk
I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

sejongk · 2024-07-17T11:15:46Z

@sejongk I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

Sure. If you have any suggestions about this, please let me know.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between eb1425d and d847567.

Files selected for processing (2)

server/backend/database/memory/database.go (1 hunks)
server/backend/database/testcases/testcases.go (1 hunks)

Files skipped from review as they are similar to previous changes (2)

server/backend/database/memory/database.go
server/backend/database/testcases/testcases.go

hackerwins · 2024-07-19T02:09:37Z

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?

@sejongk
I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

@kokodak @sejongk

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

sejongk · 2024-07-19T03:43:06Z

Although the logic you mentioned is related to CRDT, it seems okay to understand it roughly and focus on query optimization. Do you think it is possible to implement the bulk operation of BuildDocumentForServerSeq as well, perhaps using $or?
@sejongk
I'll give it a try. Is it okay if I ask questions if I encounter any issues during the implementation?

@kokodak @sejongk

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

Thanks for your suggestion. I believe this suggested method is somewhat related to #597.

kokodak · 2024-07-19T05:09:32Z

I have reviewed all the comments provided.

Currently, I have completed the implementation of bulk query methods for DB.FindClosestSnapshotInfo() and DB.FindChangesBetweenServerSeqs(), which are used in BuildDocumentForServerSeq().

However, I am facing some issues and need help with the following:

Although the bulk query operations are implemented, I am having difficulty writing test cases. Creating good test scenarios is challenging. Could I get some help with this?
I generally understand the context of @hackerwins comment, but I am a bit unclear about the exact meaning of "snapshot" since the term is used in several places in the code.
If the request value for include snapshot is false, does it mean that DB.FindClosestSnapshotInfo() should be called with includeSnapshot set to false, or does it mean that packs.BuildDocumentForServerSeq() should not be executed at all? (I am inclined to believe it's the latter.)

2-a. If the latter is correct, should we still keep the bulk query code mentioned in point 1, to handle cases where include snapshot is true?

I also agree that passing only the minimal information needed to render the screen is a good idea. However, if it turns out that snapshots will never be used in the GetDocuments API, we might consider configuring the code to exclude snapshots without adding an option to the API request. What are your thoughts on this? Should we still include the option in the request for flexibility?

3-a. If we decide to include the option, we will need to coordinate with the front-end regarding the changes in the API structure. How should we approach this discussion?

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d847567 and 327e91d.

Files ignored due to path filters (1)

api/yorkie/v1/admin.pb.go is excluded by !**/*.pb.go

Files selected for processing (6)

api/docs/yorkie/v1/admin.openapi.yaml (56 hunks)
api/docs/yorkie/v1/resources.openapi.yaml (64 hunks)
api/docs/yorkie/v1/yorkie.openapi.yaml (49 hunks)
api/yorkie/v1/admin.proto (1 hunks)
server/documents/documents.go (2 hunks)
server/rpc/admin_server.go (1 hunks)

Files skipped from review due to trivial changes (3)

api/docs/yorkie/v1/admin.openapi.yaml
api/docs/yorkie/v1/resources.openapi.yaml
api/docs/yorkie/v1/yorkie.openapi.yaml

Files skipped from review as they are similar to previous changes (1)

server/documents/documents.go

Additional comments not posted (2)

api/yorkie/v1/admin.proto (1)

120-120: LGTM! The include_snapshot field is correctly added.

The new field include_snapshot is correctly defined and integrated within the GetDocumentsRequest message.

server/rpc/admin_server.go (1)

228-228: LGTM! But verify the function usage in the codebase.

The code changes are approved.

However, ensure that all function calls to GetDocuments match the new signature.

hackerwins

Thanks for your contribution. 👍

kokodak · 2024-07-19T16:57:10Z

DocumentSummaries, which is in the response from GetDocuments API, contains both time-related metadata about Document and its content, snapshot. Unlike metadata, retrieving snapshot requires loading the document into memory, which can be relatively resource-intensive.

Document List Page in CodePair, which uses this API, only uses the time-related metadata and not snapshot.

https://www.figma.com/design/OYc1Cr0nvFuBnWZxhscfDk/Code-Pair?node-id=42-101&t=lCXENp1HuDnFAkwq-0

Therefore, how about adding an option(include snapshot) in the API request to specify whether snapshot should be included.

Based on the discussions with @hackerwins and @sejongk regarding the comment ideas above, we have decided to implement the option to include or exclude snapshots in the API request.

As a result, the GetDocuments API request specification has changed, which can be reviewed in this commit.

Consequently, by adding the include_snapshot field with a value of false in the CodePair code, we can expect performance improvements in the GetDocuments API.

There was an issue with the updatedAt of a document showing another document updatedAt. Specifically, the updatedAt in the document list was being reversed and reflecting a different document's value. This issue occured during the bulk retrieval of document lists using yorkie-team/yorkie#931. In this process, there was no guarantee that the order of the keys passed to the DB query matches the order of the documents in the query result.

kokodak added 2 commits July 17, 2024 00:51

Implement bulk version of read query

c806837

Replace DB query used in GetDocumentSummary and GetDocumentSummaries

eb1425d

coderabbitai bot reviewed Jul 16, 2024

View reviewed changes

server/backend/database/memory/database.go Outdated Show resolved Hide resolved

krapie requested review from sejongk and devleejb July 17, 2024 04:02

krapie assigned kokodak Jul 17, 2024

krapie added the enhancement 🌟 New feature or request label Jul 17, 2024

hackerwins requested changes Jul 17, 2024

View reviewed changes

server/backend/database/testcases/testcases.go Show resolved Hide resolved

Modify logic so that MemDB and MongoDB behave same for non-existent keys

d847567

coderabbitai bot reviewed Jul 18, 2024

View reviewed changes

Add snapshot inclusion option to GetDocuments API request spec

327e91d

coderabbitai bot reviewed Jul 19, 2024

View reviewed changes

hackerwins self-requested a review July 19, 2024 16:48

hackerwins approved these changes Jul 19, 2024

View reviewed changes

hackerwins changed the title ~~Implement bulk retrieval operation for GetDocuments API to enhance performance~~ Enhance GetDocuments API by adding bulk retrieval Jul 19, 2024

hackerwins merged commit a4ce314 into yorkie-team:main Jul 19, 2024
4 checks passed

This was referenced Jul 20, 2024

Update GetDocuments API request specification in CodePair to match recent changes in Admin Server yorkie-team/codepair#240

Closed

Update Code to Match GetDocuments API Request Specifications yorkie-team/codepair#242

Merged

BrewTestBot mentioned this pull request Jul 25, 2024

yorkie 0.4.28 Homebrew/homebrew-core#178403

Merged

blurfx mentioned this pull request Jul 25, 2024

Issue with updatedAt field showing incorrect values in document list yorkie-team/codepair#253

Closed

coderabbitai bot mentioned this pull request Sep 12, 2024

Add consistency test for ClientInfo update failure in PushPull #1000

Merged

2 tasks

This was referenced Oct 15, 2024

Update CHANGELOG.md for v0.5.1 #1034

Merged

Detach documents when client is deactivated #1036

Merged

coderabbitai bot mentioned this pull request Oct 27, 2024

Introduce cmap for distributing mutexes per documents #1051

Merged

2 tasks

kokodak mentioned this pull request Nov 1, 2024

Add kokodak to members yorkie-team/community#1

Merged

3 tasks

This was referenced Nov 1, 2024

Introduce dedicated event publisher per document #1052

Merged

Optimize document detachment in Cluster Server #1055

Merged

coderabbitai bot mentioned this pull request Nov 14, 2024

Convert presence change from string to binary #1069

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance GetDocuments API by adding bulk retrieval #931

Enhance GetDocuments API by adding bulk retrieval #931

kokodak commented Jul 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 16, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

sejongk commented Jul 17, 2024 •

edited

Loading

kokodak commented Jul 17, 2024

sejongk commented Jul 17, 2024 •

edited

Loading

coderabbitai bot left a comment

hackerwins commented Jul 19, 2024 •

edited

Loading

sejongk commented Jul 19, 2024 •

edited

Loading

kokodak commented Jul 19, 2024

coderabbitai bot left a comment

hackerwins left a comment

kokodak commented Jul 19, 2024 •

edited

Loading

Enhance GetDocuments API by adding bulk retrieval #931

Enhance GetDocuments API by adding bulk retrieval #931

Conversation

kokodak commented Jul 16, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jul 16, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

sejongk commented Jul 17, 2024 • edited Loading

kokodak commented Jul 17, 2024

sejongk commented Jul 17, 2024 • edited Loading

coderabbitai bot left a comment

Choose a reason for hiding this comment

hackerwins commented Jul 19, 2024 • edited Loading

sejongk commented Jul 19, 2024 • edited Loading

kokodak commented Jul 19, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

hackerwins left a comment

Choose a reason for hiding this comment

kokodak commented Jul 19, 2024 • edited Loading

kokodak commented Jul 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 16, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

sejongk commented Jul 17, 2024 •

edited

Loading

sejongk commented Jul 17, 2024 •

edited

Loading

hackerwins commented Jul 19, 2024 •

edited

Loading

sejongk commented Jul 19, 2024 •

edited

Loading

kokodak commented Jul 19, 2024 •

edited

Loading