Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-s): Add support for reading and searching single file archives. #656

Merged
merged 13 commits into from
Jan 15, 2025

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Jan 8, 2025

Description

This PR adds support for reading and searching clp-s single file archives. The general approach is to introduce a new "adaptor" class that allows most of the code to be oblivious as to whether the underlying archive is a single or multi-file archive.

This class is called "ArchiveReaderAdaptor" and makes up the bulk of this change.

Other notable changes include:

  • Modifying the multi-file-archive format to:
    • Remove the dedicated timestamp dict file
    • Add a new "header" file which includes header and metadata identical to the multi-file archive
  • Allowing ZstdDecompressor to accept a clp::ReaderInterface
  • Adapting read-side code to invoke ArchiveReaderAdaptor instead of opening files directly
  • Changing when different sections of the archive get opened during the read-side flow to ensure that single-file-archives can be read without seeks

Benchmarks

Some benchmarking was performed to get an idea of compression, decompression, and search performance with single file archives. All of the following benchmarking results are from averaging over 3-5 tests on the local filesystem clearing the cache between runs.

Compression is slightly slower (~1.2%) when compressing single-file archives (compared to multi-file archives).

dataset sfa compression time / mfa compression time
cockroach 0.996
mongodb 1.040
elasticsearch 1.013
spark 0.998
postgresql 1.011
average 1.012

Surprisingly decompression is also slightly slower (~1%). This should be acceptable for now though, and we can work on optimizing bottlenecks later.

dataset sfa decompression time / mfa decompression time
cockroach 0.996
mongodb 1.006
elasticsearch 1.010
spark 1.038
postgresql 1.001
average 1.010

Likewise search seems to be slower (~3.7%). Again this is worth optimizing for later, but should be sufficient for now.

query (from our paper) sfa search time / mfa search time
J 0.963
M 1.090
K 1.060
average 1.037

Validation performed

  • Validated that multi-file-archives on the local filesystem can be decompressed and searched successfully on the command line
  • Validated that single-file-archives on s3 can be decompressed and searched succesfully on the command line
  • Validated that single-file-archives on the local filesystem can be decompressed and searched successfully on the command line

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced ArchiveReaderAdaptor to enhance archive reading capabilities.
    • Added support for opening packed streams in the archive reader.
    • New methods for managing archive metadata and file information.
  • Improvements

    • Streamlined dictionary and stream reading processes.
    • Enhanced error handling for archive operations.
    • Simplified interactions between archive readers and writers.
  • Technical Updates

    • Updated methods for reading and managing archive metadata.
    • Modified decompression utilities to support new input types.
    • Refactored archive reading and processing workflows.
    • Updated method signatures to improve modularity and clarity.
    • Removed deprecated methods related to timestamp dictionary handling.

These changes improve the overall robustness and flexibility of the archive reading and processing infrastructure.

Copy link
Contributor

coderabbitai bot commented Jan 8, 2025

Walkthrough

This pull request introduces a comprehensive refactoring of the archive reading and processing infrastructure in the clp_s namespace. The changes center around the introduction of a new ArchiveReaderAdaptor class, which provides a more flexible and modular approach to reading archives. The modifications span multiple components, including dictionary readers, decompressors, and utility classes, with a focus on improving how archives are opened, read, and managed across different input sources.

Changes

File Change Summary
components/core/CMakeLists.txt Added ArchiveReaderAdaptor.cpp and ArchiveReaderAdaptor.hpp to source files
components/core/src/clp_s/CMakeLists.txt Added ArchiveReaderAdaptor.cpp and ArchiveReaderAdaptor.hpp to CLP_S_SOURCES
components/core/src/clp_s/ArchiveReader.cpp/hpp Integrated ArchiveReaderAdaptor, added open_packed_streams() method, updated dictionary and metadata reading
components/core/src/clp_s/DictionaryReader.hpp Updated to use ArchiveReaderAdaptor, renamed read_new_entries() to read_entries()
components/core/src/clp_s/ReaderUtils.cpp/hpp Modified methods to accept ArchiveReaderAdaptor instead of directory paths
components/core/src/clp_s/ZstdDecompressor.cpp/hpp Added support for clp::ReaderInterface input type
components/core/src/clp_s/TimestampDictionaryReader.cpp/hpp Removed file management methods, added read() method with decompressor
components/core/src/clp_s/archive_constants.hpp Added cArchiveHeaderFile, removed cArchiveTimestampDictFile

Sequence Diagram

sequenceDiagram
    participant AR as ArchiveReader
    participant ARA as ArchiveReaderAdaptor
    participant DR as DictionaryReader
    participant ZD as ZstdDecompressor

    AR->>ARA: load_archive_metadata()
    ARA-->>AR: Metadata loaded
    AR->>ARA: checkout_reader_for_section()
    ARA-->>AR: Reader for section
    AR->>DR: read_entries()
    DR->>ARA: Get reader
    ARA-->>DR: Reader interface
    DR->>ZD: read with decompressor
    ZD-->>DR: Entries read
Loading

The sequence diagram illustrates the new workflow for reading archives, highlighting the central role of the ArchiveReaderAdaptor in managing reader interfaces, sections, and decompression across different components.

Possibly related PRs

Suggested reviewers

  • kirkrodrigues

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gibber9809 gibber9809 requested a review from wraymo January 8, 2025 21:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🔭 Outside diff range comments (3)
components/core/src/clp_s/ReaderUtils.cpp (2)

Line range hint 17-61: Ensure schema_tree_reader is always checked back in

To prevent potential resource leaks, consider using RAII (Resource Acquisition Is Initialization) or a scope guard to ensure that adaptor.checkin_reader_for_section is called even if an exception is thrown before it. This will guarantee that the reader is always checked back in, maintaining resource integrity.


Line range hint 90-141: Ensure schema_id_reader is always checked back in

In the read_schemas method, to prevent potential resource leaks, consider implementing RAII or a scope guard for schema_id_reader. This will ensure that adaptor.checkin_reader_for_section is called even if exceptions occur, maintaining resource integrity.

components/core/src/clp_s/PackedStreamReader.cpp (1)

Line range hint 121-126: Fix incorrect assignment from std::unique_ptr<char[]> to std::shared_ptr<char[]>

The code incorrectly assigns a std::unique_ptr<char[]> to a std::shared_ptr<char[]>, which is invalid as there is no implicit conversion between them for array types. This can lead to undefined behaviour.

Apply the following fix to correctly create a std::shared_ptr<char[]>:

-    buf = std::make_unique<char[]>(uncompressed_size);
+    buf = std::shared_ptr<char[]>(new char[uncompressed_size]);

Alternatively, consider using std::vector<char> for buffer management if appropriate.

🧹 Nitpick comments (12)
components/core/src/clp_s/PackedStreamReader.hpp (1)

47-49: Consider passing adaptor by const reference to improve efficiency

In the open_packed_streams method, consider passing the adaptor parameter as a const std::shared_ptr<ArchiveReaderAdaptor>& instead of by value. This avoids unnecessary copying and reference count modifications, which can improve performance.

Suggested change:

-void open_packed_streams(std::shared_ptr<ArchiveReaderAdaptor> adaptor);
+void open_packed_streams(const std::shared_ptr<ArchiveReaderAdaptor>& adaptor);
components/core/src/clp_s/ArchiveWriter.cpp (2)

59-73: Improve clarity of offset calculation

The current implementation temporarily uses the o field to store the size before calculating offsets. This dual use of the field is not immediately clear and could confuse future maintainers.

Consider using a more explicit approach:

     std::vector<ArchiveFileInfo> files{
             {constants::cArchiveSchemaTreeFile, schema_tree_compressed_size},
             {constants::cArchiveSchemaMapFile, schema_map_compressed_size},
             {constants::cArchiveTableMetadataFile, table_metadata_compressed_size},
             {constants::cArchiveVarDictFile, var_dict_compressed_size},
             {constants::cArchiveLogDictFile, log_dict_compressed_size},
             {constants::cArchiveArrayDictFile, array_dict_compressed_size},
             {constants::cArchiveTablesFile, table_compressed_size}
     };
     uint64_t offset = 0;
     for (auto& file : files) {
-        uint64_t original_size = file.o;
-        file.o = offset;
-        offset += original_size;
+        uint64_t const file_size = file.o;  // Store the size from the initialization
+        file.o = offset;                    // Set the file's offset
+        offset += file_size;                // Increment offset by file size
     }

86-89: Improve readability of compressed size calculation

The calculation of m_compressed_size spans multiple lines and includes many components, making it difficult to verify that all necessary components are included.

Consider using a more structured approach:

-        m_compressed_size
-                = var_dict_compressed_size + log_dict_compressed_size + array_dict_compressed_size
-                  + metadata_size + schema_tree_compressed_size + schema_map_compressed_size
-                  + table_metadata_compressed_size + table_compressed_size + sizeof(ArchiveHeader);
+        // Calculate total compressed size from all components
+        m_compressed_size = sizeof(ArchiveHeader)  // Header size
+                           // Dictionary sizes
+                           + var_dict_compressed_size
+                           + log_dict_compressed_size
+                           + array_dict_compressed_size
+                           // Schema sizes
+                           + schema_tree_compressed_size
+                           + schema_map_compressed_size
+                           // Table sizes
+                           + table_metadata_compressed_size
+                           + table_compressed_size
+                           // Metadata size
+                           + metadata_size;
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

41-41: Use consistent data types: change loop index to uint64_t

The loop index i in for (int i = 0; i < num_patterns; ++i) is declared as an int, while num_patterns is of type uint64_t. To avoid potential issues with signed-unsigned comparisons and to maintain consistency, consider declaring i as a uint64_t.

components/core/src/clp_s/TimestampDictionaryReader.hpp (1)

26-30: Enhance method documentation for read

To improve clarity and maintainability, consider expanding the documentation for the read method. Including details such as any exceptions that may be thrown and elaborating on the method's behaviour will be beneficial.

components/core/src/clp_s/DictionaryReader.hpp (1)

42-44: Enhance method documentation for read_entries

Consider providing detailed documentation for the read_entries method, including descriptions of the parameters like lazy, expected behaviour, and any exceptions that might be thrown. This will enhance code readability and ease future maintenance.

components/core/src/clp_s/ReaderUtils.cpp (2)

67-70: Consistent use of smart pointers for resource management

In the get_variable_dictionary_reader method, ensure consistent and safe resource management by verifying that the VariableDictionaryReader object correctly handles resource acquisition and release, especially in exceptional circumstances.


75-78: Consistent use of smart pointers for resource management

Similarly, in the get_log_type_dictionary_reader method, confirm that the LogTypeDictionaryReader object properly manages resources to prevent leaks or dangling pointers.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)

31-33: Add move constructor/assignment operator declarations.

The class manages resources (shared_ptr, unique_ptr) but doesn't declare move operations. Consider adding them to support efficient resource transfer.

Add the following declarations:

     explicit ArchiveReaderAdaptor(Path const& archive_path, NetworkAuthOption const& network_auth);
 
     ~ArchiveReaderAdaptor();
+
+    // Move constructor/assignment
+    ArchiveReaderAdaptor(ArchiveReaderAdaptor&&) = default;
+    ArchiveReaderAdaptor& operator=(ArchiveReaderAdaptor&&) = default;
+
+    // Explicitly delete copy operations
+    ArchiveReaderAdaptor(const ArchiveReaderAdaptor&) = delete;
+    ArchiveReaderAdaptor& operator=(const ArchiveReaderAdaptor&) = delete;
components/core/src/clp_s/ZstdDecompressor.hpp (1)

46-46: Rename parameter for clarity.

The parameter name file_read_buffer_capacity suggests file-specific usage, but this method handles generic reader interfaces.

Apply this diff to improve clarity:

-    void open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) override;
+    void open(clp::ReaderInterface& reader, size_t read_buffer_capacity) override;
components/core/src/clp_s/JsonConstructor.cpp (1)

46-46: LGTM! Consider adding documentation.

The addition of open_packed_streams() follows a logical sequence in the initialization process. Consider adding a comment explaining the initialization order requirements.

+    // Open packed streams after reading dictionaries and metadata
     m_archive_reader->open_packed_streams();
components/core/src/clp_s/search/Output.cpp (1)

68-69: LGTM! Consider adding error handling.

The addition of open_packed_streams() is well-placed in the execution flow. Consider adding error handling for potential stream opening failures.

-    m_archive_reader->open_packed_streams();
+    try {
+        m_archive_reader->open_packed_streams();
+    } catch (const std::exception& e) {
+        SPDLOG_ERROR("Failed to open packed streams: {}", e.what());
+        return false;
+    }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3b671 and ae419b0.

📒 Files selected for processing (23)
  • components/core/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/ArchiveReader.cpp (6 hunks)
  • components/core/src/clp_s/ArchiveReader.hpp (6 hunks)
  • components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)
  • components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1 hunks)
  • components/core/src/clp_s/ArchiveWriter.cpp (2 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (1 hunks)
  • components/core/src/clp_s/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/Decompressor.hpp (2 hunks)
  • components/core/src/clp_s/DictionaryReader.hpp (5 hunks)
  • components/core/src/clp_s/JsonConstructor.cpp (1 hunks)
  • components/core/src/clp_s/PackedStreamReader.cpp (4 hunks)
  • components/core/src/clp_s/PackedStreamReader.hpp (3 hunks)
  • components/core/src/clp_s/ReaderUtils.cpp (4 hunks)
  • components/core/src/clp_s/ReaderUtils.hpp (2 hunks)
  • components/core/src/clp_s/TimestampDictionaryReader.cpp (2 hunks)
  • components/core/src/clp_s/TimestampDictionaryReader.hpp (1 hunks)
  • components/core/src/clp_s/Utils.cpp (1 hunks)
  • components/core/src/clp_s/ZstdDecompressor.cpp (5 hunks)
  • components/core/src/clp_s/ZstdDecompressor.hpp (4 hunks)
  • components/core/src/clp_s/archive_constants.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/src/clp_s/search/Output.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (21)
components/core/src/clp_s/Decompressor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/search/Output.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/archive_constants.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampDictionaryReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonConstructor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ZstdDecompressor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/PackedStreamReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveWriter.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveWriter.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ZstdDecompressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/PackedStreamReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/DictionaryReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (25)
components/core/src/clp_s/clp-s.cpp (1)

147-147: LGTM! Method name change aligns with the new architecture.

The change from read_timestamp_dictionary() to get_timestamp_dictionary() reflects the architectural shift to use the ArchiveReaderAdaptor for handling both single and multi-file archives.

Let's verify the implementation of the new method:

✅ Verification successful

Implementation verified: get_timestamp_dictionary() correctly supports the new architecture

The method is properly implemented in ArchiveReaderAdaptor and correctly delegated through ArchiveReader, with consistent usage across the codebase for timestamp operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of get_timestamp_dictionary()
# Expected: The method should be defined in ArchiveReaderAdaptor and should handle both single and multi-file archives

# Search for the method definition
ast-grep --pattern 'get_timestamp_dictionary() {
  $$$
}'

# Search for references to ensure consistent usage
rg -A 2 'get_timestamp_dictionary'

Length of output: 2331

components/core/src/clp_s/PackedStreamReader.hpp (3)

9-10: Includes necessary headers for new functionality

The addition of ReaderInterface.hpp and ArchiveReaderAdaptor.hpp is appropriate and ensures that the necessary interfaces are available for the updated implementation.


89-90: Proper use of smart pointers for resource management

The introduction of m_adaptor as a std::shared_ptr and updating m_packed_stream_reader to a std::unique_ptr align with best practices for smart pointer usage, ensuring appropriate ownership and lifetime management of resources.


93-93: Initialization of m_begin_offset ensures defined behaviour

Initializing m_begin_offset to zero ensures that it has a well-defined starting value, preventing potential issues with uninitialized variables.

components/core/src/clp_s/archive_constants.hpp (1)

10-12: LGTM! The new header constant follows consistent naming and organization.

The addition of cArchiveHeaderFile and its placement in a dedicated "Header and metadata section" maintains good code organization. The constant's value follows the established pattern of other archive path constants.

components/core/src/clp_s/Utils.cpp (2)

Line range hint 79-87: LGTM! The updated file check aligns with the new archive format.

The replacement of cArchiveTimestampDictFile with cArchiveHeaderFile in the condition maintains the correct detection of multi-file archives while adapting to the new archive format. The condition follows the coding guideline of placing the constant on the left side of the equality operator.


Line range hint 79-87: Verify comprehensive testing of archive operations.

Since this change affects archive format detection, please ensure:

  1. All archive operations (read, write, search) are tested with both single and multi-file archives
  2. Migration paths exist for existing archives using the old timestamp dictionary format

Let's verify the usage of the old timestamp dictionary format:

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

31-35: Verify the conditional logic for setting m_single_file_archive

The condition if (InputSource::Filesystem != archive_path.source || std::filesystem::is_regular_file(archive_path.path)) will set m_single_file_archive to true if either the input source is not the filesystem or the path is a regular file. Please verify that this logic correctly identifies single-file archives in all intended scenarios.

components/core/src/clp_s/ArchiveReader.cpp (1)

Line range hint 1-360: Code changes look good

The integration of ArchiveReaderAdaptor enhances the modularity and maintainability of the archive reading process. The updates are well-structured and align with the project architecture.

components/core/src/clp_s/Decompressor.hpp (1)

54-59: LGTM! Well-documented interface extension.

The new virtual method enhances the Decompressor interface to support reading from a clp::ReaderInterface, which aligns well with the PR's objective of supporting single file archives.

components/core/src/clp_s/ReaderUtils.hpp (1)

30-34: LGTM! Documentation updates are clear and consistent.

The documentation has been updated to accurately reflect the switch from string paths to ArchiveReaderAdaptor references.

Also applies to: 37-41, 44-49, 53-58, 62-67

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (2)

17-30: LGTM! Well-designed class with proper error handling.

The class follows SOLID principles and includes thorough error handling through a custom exception class.


43-59: LGTM! Robust checkout/checkin pattern.

The checkout/checkin pattern effectively prevents concurrent access to sections and enforces sequential access to prevent backwards seeks.

components/core/src/clp_s/ZstdDecompressor.hpp (1)

115-116: LGTM! Consistent extension for ClpReader support.

The InputType enum and member variable additions properly support the new reader interface capability.

Also applies to: 134-134

components/core/src/clp_s/ArchiveReader.hpp (4)

45-49: LGTM! Documentation follows standard format.

The new method declaration is well-documented and follows the established documentation pattern in the codebase.


113-113: LGTM! Delegation to adaptor improves modularity.

The change to delegate timestamp dictionary access to the adaptor improves separation of concerns and aligns with the adaptor pattern.


201-201: LGTM! Member variable follows naming convention.

The new member variable follows the established m_ prefix naming convention.


56-56: Verify the impact of changing dictionary read behaviour.

The change from read_new_entries to read_entries could affect the dictionary loading behaviour. Please ensure this change is intentional and doesn't impact existing functionality.

Also applies to: 66-66, 76-76

✅ Verification successful

The change is intentional and correct as part of the CLP_S implementation

This change is part of a new implementation in the clp_s directory, which is separate from the existing CLP/GLT implementations. The new read_entries method includes enhanced functionality with optional lazy loading support. The existing implementations using read_new_entries remain unchanged and unaffected.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any other uses of read_new_entries to ensure all occurrences are updated
rg "read_new_entries" -A 2

Length of output: 2561


Script:

#!/bin/bash
# Search for read_entries implementation and usage
ast-grep --pattern 'read_entries($_) { $$$ }'

# Also search with ripgrep for any other occurrences
rg "read_entries" -A 2

Length of output: 1539

components/core/src/clp_s/ArchiveWriter.hpp (1)

183-183: LGTM! Documentation update improves clarity.

The documentation change appropriately generalizes the method's purpose to support both single and multi-file archives.

components/core/src/clp_s/CMakeLists.txt (1)

91-92: LGTM! New source files are properly integrated.

The new ArchiveReaderAdaptor files are correctly added to the CLP_S_SOURCES set, maintaining alphabetical order.

components/core/CMakeLists.txt (1)

264-265: LGTM! New source files are properly integrated into the test suite.

The ArchiveReaderAdaptor files are correctly added to SOURCE_FILES_clp_s_unitTest, ensuring they'll be included in the unit test build.

Let's verify the test coverage for the new ArchiveReaderAdaptor:

components/core/src/clp_s/PackedStreamReader.cpp (4)

3-5: Appropriate inclusion of new headers

The added header files BoundedReader.hpp, archive_constants.hpp, and ArchiveReaderAdaptor.hpp are necessary for the new functionality and are correctly included.


44-59: open_packed_streams method updated appropriately

The open_packed_streams method correctly updates the state, initializes m_adaptor, and handles errors appropriately when checking the position. The use of ArchiveReaderAdaptor enhances the modularity and maintainability of the code.


63-78: Proper resource management in close method

The close method effectively uses the needs_checkin flag to determine if resources need to be checked in. The state transitions and resetting of member variables are correctly handled to ensure proper cleanup.


Line range hint 94-114: Correct state management and error handling in read_stream

The read_stream method appropriately manages state transitions, validates stream IDs, calculates adjusted file offsets, and handles errors properly. The integration with BoundedReader and usage of the decompressor are correctly implemented.

components/core/src/clp_s/ArchiveWriter.cpp Show resolved Hide resolved
Comment on lines 19 to 22
entry.try_read_from_file(decompressor);
if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) {
throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure consistent error handling: replace exception with error code return

In the read method, error handling has shifted to returning ErrorCode instead of throwing exceptions. However, the call to StringUtils::tokenize_column_descriptor still throws an exception upon failure. For consistency and to maintain uniform error handling, consider returning an appropriate ErrorCode instead of throwing an exception.

components/core/src/clp_s/TimestampDictionaryReader.cpp Outdated Show resolved Hide resolved
Comment on lines +101 to +104
// TODO: attempt to translate clp error codes
return ErrorCodeFailure;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Implement error code translation for CLP errors

The TODO comment indicates that CLP error codes are not currently translated. Instead of returning a generic ErrorCodeFailure, consider mapping the CLP error codes to the appropriate internal error codes to provide more informative error handling.

Do you want assistance in implementing the error code translation or should I open a GitHub issue to track this task?

components/core/src/clp_s/ReaderUtils.hpp Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/core/src/clp_s/TimestampDictionaryReader.cpp (2)

19-21: ⚠️ Potential issue

Ensure consistent error handling approach.

The method mixes error handling approaches by throwing exceptions in these sections while using error codes elsewhere. Consider maintaining consistency by returning error codes instead of throwing exceptions.

-        if (auto rc = entry.try_read_from_file(decompressor); ErrorCodeSuccess != rc) {
-            throw OperationFailed(rc, __FILENAME__, __LINE__);
+        if (auto rc = entry.try_read_from_file(decompressor); ErrorCodeSuccess != rc) {
+            return rc;
         }

-        if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) {
-            throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__);
+        if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) {
+            return ErrorCodeCorrupt;
         }

Also applies to: 22-25


41-58: ⚠️ Potential issue

Add bounds checking for numeric inputs.

The pattern reading section should validate num_patterns and pattern_len to prevent potential memory exhaustion or buffer overflow attacks from malformed input.

     error = decompressor.try_read_numeric_value<uint64_t>(num_patterns);
     if (ErrorCodeSuccess != error) {
         return error;
     }
+    if (num_patterns > MAX_PATTERNS) {
+        return ErrorCodeInvalidInput;
+    }
     for (int i = 0; i < num_patterns; ++i) {
         uint64_t id, pattern_len;
         std::string pattern;
         error = decompressor.try_read_numeric_value<uint64_t>(id);
         if (ErrorCodeSuccess != error) {
             return error;
         }
         error = decompressor.try_read_numeric_value<uint64_t>(pattern_len);
         if (ErrorCodeSuccess != error) {
             return error;
         }
+        if (pattern_len > MAX_PATTERN_LENGTH) {
+            return ErrorCodeInvalidInput;
+        }
         error = decompressor.try_read_string(pattern_len, pattern);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc834a6 and af0b784.

📒 Files selected for processing (1)
  • components/core/src/clp_s/TimestampDictionaryReader.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (1)
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

8-14: LGTM! Well-structured error handling pattern.

The new method signature and initial read operation demonstrate good practices with consistent error handling and clear variable naming.

if (ErrorCodeSuccess != error) {
throw OperationFailed(error, __FILENAME__, __LINE__);
return error;
}
for (int i = 0; i < num_patterns; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix type mismatch in loop counter.

The loop counter 'i' is declared as int while num_patterns is uint64_t. This could cause overflow issues if num_patterns exceeds INT_MAX.

-    for (int i = 0; i < num_patterns; ++i) {
+    for (uint64_t i = 0; i < num_patterns; ++i) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (int i = 0; i < num_patterns; ++i) {
for (uint64_t i = 0; i < num_patterns; ++i) {

Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Here are some initial comments.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp Outdated Show resolved Hide resolved
Comment on lines 68 to 80
ErrorCode try_read_archive_file_info(ZstdDecompressor& decompressor, size_t size);

ErrorCode try_read_timestamp_dictionary(ZstdDecompressor& decompressor, size_t size);

ErrorCode try_read_archive_info(ZstdDecompressor& decompressor, size_t size);

std::shared_ptr<clp::ReaderInterface> try_create_reader_at_header();

std::unique_ptr<clp::ReaderInterface> checkout_reader_for_sfa_section(std::string_view section);

ErrorCode try_read_header(clp::ReaderInterface& reader);

ErrorCode try_read_archive_metadata(ZstdDecompressor& reader);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add descriptions for these methods?

components/core/src/clp_s/ArchiveReaderAdaptor.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/ArchiveReaderAdaptor.hpp Outdated Show resolved Hide resolved
m_current_reader_holder.reset();
}

std::shared_ptr<clp::ReaderInterface> ArchiveReaderAdaptor::try_create_reader_at_header() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this method before ArchiveReaderAdaptor::checkout...

@@ -162,6 +186,28 @@ void ZstdDecompressor::open(FileReader& file_reader, size_t file_read_buffer_cap
reset_stream();
}

void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the future FileReader is an inherited class of clp::ReaderInterface, so we don't need to have two methods and two types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will probably want to refactor this in the future + converge on one set of readers for clp and clp-s. I could put up a github issue to delete clp_s::FileReader as a near-term step if you want? I don't think we have too many uses of it left in our codebase, and it should be possible to switch the ones we do have over to clp::FileReader.

@@ -239,9 +287,22 @@ ErrorCode ZstdDecompressor::open(std::string const& compressed_file_path) {

void ZstdDecompressor::reset_stream() {
if (InputType::File == m_input_type) {
m_file_reader->seek_from_begin(m_file_reader_initial_pos);
auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like

Suggested change
auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos);
if (auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos); ErrorCodeSuccess != rc && ErrorCodeEndOfFile != rc) {
throw OperationFailed(rc, __FILENAME__, __LINE__);
}

@@ -1,5 +1,7 @@
#include "Utils.hpp"

#include <charconv>
#include <cstdint>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this header file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- std::from_chars comes from charconv, and we need cstdint for the (u)int(X)_t types.

* Load metadata for an archive including the header and metadata section. This method must be
* invoked before checking out any section of an archive, or calling `get_timestamp_dictionary`.
* @return ErrorCodeSuccess on success
* @return ErrorCode_errno on failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrorCodeErrno?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to imply that it can return any error. Maybe I can rephrase as "relevant error code on failure" or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased as "relevant ErrorCode on failure."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ErrorCodeSuccess is in clp-s, while ErrorCode_error is in clp. "Relevant ErrorCode on failure' also seems good.

}

ArchiveReaderAdaptor::~ArchiveReaderAdaptor() {
m_reader.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not. I will get rid of it.

@gibber9809 gibber9809 requested a review from wraymo January 14, 2025 17:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (12)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (3)

31-35: Improve readability of single file archive condition.

The condition for determining if an archive is single file could be more explicit.

-    if (InputSource::Filesystem != archive_path.source
-        || std::filesystem::is_regular_file(archive_path.path))
-    {
-        m_single_file_archive = true;
-    }
+    m_single_file_archive = (InputSource::Filesystem != archive_path.source)
+                           || std::filesystem::is_regular_file(archive_path.path);

95-97: Define a constant for the expected segment count.

The magic number 1 for the expected number of segments should be defined as a named constant to improve code maintainability and readability.

+    static constexpr uint32_t cExpectedNumSegments = 1;
-    if (1 != m_archive_info.num_segments) {
+    if (cExpectedNumSegments != m_archive_info.num_segments) {
         return ErrorCodeUnsupported;
     }

222-226: Optimize lambda capture in find_if.

The lambda captures section by reference unnecessarily, which could be captured by value since it's a string_view.

     auto it = std::find_if(
             m_archive_file_info.files.begin(),
             m_archive_file_info.files.end(),
-            [&](ArchiveFileInfo& info) { return info.n == section; }
+            [section](ArchiveFileInfo& info) { return info.n == section; }
     );
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (7)

1-16: Consider adding file-level documentation.

Add a file-level documentation block explaining the purpose of this header file, its role in the archive reading infrastructure, and any important implementation notes.

+/**
+ * @file ArchiveReaderAdaptor.hpp
+ * @brief Adaptor class for reading single and multi-file archives from various sources
+ *
+ * This header defines the ArchiveReaderAdaptor class which provides a unified interface
+ * for reading archives regardless of their storage type (single/multi-file) or location
+ * (S3/local filesystem).
+ */
 #ifndef CLP_S_ARCHIVEREADERADAPTOR_HPP

18-22: Enhance class documentation.

While the current documentation is good, consider adding:

  • @brief tag for quick reference
  • @details section explaining the adaptor pattern usage
  • @note section about thread safety guarantees
 /**
+ * @brief Adaptor for unified access to different types of archives
+ *
  * ArchiveReaderAdaptor is an adaptor class which helps with reading single and multi-file archives
  * which exist on either S3 or a locally mounted file system.
+ *
+ * @details This class implements the adaptor pattern to provide a unified interface for reading
+ * archives, abstracting away the differences between single and multi-file archives, as well as
+ * their storage locations (S3 or local filesystem).
+ *
+ * @note This class is not thread-safe and should not be accessed concurrently from multiple threads.
  */

24-29: Document the exception class.

Add documentation for the OperationFailed exception class explaining when it's thrown and what the error codes mean.

+    /**
+     * @brief Exception thrown when archive operations fail
+     *
+     * This exception is thrown when operations such as checking out/in readers
+     * or accessing sections fail due to invalid state or I/O errors.
+     */
     class OperationFailed : public TraceableException {

31-31: Document the constructor.

Add documentation for the constructor explaining the parameters and any preconditions.

+    /**
+     * @brief Constructs an ArchiveReaderAdaptor
+     * @param archive_path Path to the archive file
+     * @param network_auth Network authentication options for remote archives
+     */
     explicit ArchiveReaderAdaptor(Path const& archive_path, NetworkAuthOption const& network_auth);

59-64: Document getter methods.

Add documentation for get_timestamp_dictionary() and get_header() methods explaining their purpose and any preconditions.

+    /**
+     * @brief Gets the timestamp dictionary reader
+     * @return Shared pointer to the timestamp dictionary reader
+     * @note load_archive_metadata() must be called before using this method
+     */
     std::shared_ptr<TimestampDictionaryReader> get_timestamp_dictionary() {
         return m_timestamp_dictionary;
     }

+    /**
+     * @brief Gets the archive header
+     * @return Constant reference to the archive header
+     * @note load_archive_metadata() must be called before using this method
+     */
     ArchiveHeader const& get_header() const { return m_archive_header; }

125-135: Document member variables and consider using polymorphism.

  1. Add documentation for member variables to improve maintainability.
  2. Consider using polymorphism instead of the boolean flag for handling different archive types.
+    /// Path to the archive file
     Path m_archive_path{};
+    /// Network authentication options for remote archives
     NetworkAuthOption m_network_auth{};
+    /// Flag indicating whether this is a single file archive
+    /// @note Consider using polymorphism instead of this flag
     bool m_single_file_archive{false};
+    /// Information about files in the archive
     ArchiveFileInfoPacket m_archive_file_info{};
+    /// Archive header information
     ArchiveHeader m_archive_header{};
+    /// General archive information
     ArchiveInfoPacket m_archive_info{};
+    /// Offset to the files section in the archive
     size_t m_files_section_offset{};
+    /// Name of the section currently being read, if any
     std::optional<std::string> m_current_reader_holder;
+    /// Reader for accessing timestamp information
     std::shared_ptr<TimestampDictionaryReader> m_timestamp_dictionary;
+    /// Current reader interface
     std::shared_ptr<clp::ReaderInterface> m_reader;

Consider refactoring to use polymorphism:

class ArchiveReaderInterface {
public:
    virtual ~ArchiveReaderInterface() = default;
    virtual ErrorCode load_archive_metadata() = 0;
    // ... other methods
};

class SingleFileArchiveReader : public ArchiveReaderInterface {
    // Implementation for single file archives
};

class MultiFileArchiveReader : public ArchiveReaderInterface {
    // Implementation for multi-file archives
};

// Factory function
std::unique_ptr<ArchiveReaderInterface> create_archive_reader(
    Path const& archive_path,
    NetworkAuthOption const& network_auth);

22-22: Consider adding move semantics support.

The class manages resources (readers, paths) but lacks move constructor and assignment operator. Consider adding them to support efficient resource transfer.

 class ArchiveReaderAdaptor {
 public:
+    // Rule of 5
+    ArchiveReaderAdaptor(ArchiveReaderAdaptor&&) noexcept = default;
+    ArchiveReaderAdaptor& operator=(ArchiveReaderAdaptor&&) noexcept = default;
+    ArchiveReaderAdaptor(const ArchiveReaderAdaptor&) = delete;
+    ArchiveReaderAdaptor& operator=(const ArchiveReaderAdaptor&) = delete;
components/core/src/clp_s/ZstdDecompressor.cpp (2)

189-209: Consider refactoring to reduce code duplication with FileReader open method.

The implementation is nearly identical to the open(FileReader&, size_t) method. Consider extracting the common logic into a private helper method to improve maintainability and reduce duplication.

-void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) {
+template<typename ReaderType>
+void ZstdDecompressor::open_reader(ReaderType& reader, InputType type, size_t file_read_buffer_capacity) {
     if (InputType::NotInitialized != m_input_type) {
         throw OperationFailed(ErrorCodeNotReady, __FILENAME__, __LINE__);
     }
-    m_input_type = InputType::ClpReader;
+    m_input_type = type;

     m_reader = &reader;
     m_file_reader_initial_pos = m_reader->get_pos();

     // Avoid reallocating the internal buffer if this instance is being re-used with an
     // unchanged buffer size.
     if (file_read_buffer_capacity != m_file_read_buffer_capacity) {
         m_file_read_buffer_capacity = file_read_buffer_capacity;
         m_file_read_buffer = std::make_unique<char[]>(m_file_read_buffer_capacity);
     }
     m_file_read_buffer_length = 0;

     m_compressed_stream_block = {m_file_read_buffer.get(), m_file_read_buffer_length, 0};

     reset_stream();
 }
+
+void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) {
+    open_reader(reader, InputType::ClpReader, file_read_buffer_capacity);
+}

290-306: Improve error handling consistency and code structure.

  1. Error handling differs between FileReader and ClpReader cases. Consider using the same pattern for both.
  2. Based on previous review comments, the seek condition could be simplified.
-    if (InputType::File == m_input_type) {
-        if (auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos);
-            ErrorCodeSuccess != rc && ErrorCodeEndOfFile != rc)
-        {
-            throw OperationFailed(rc, __FILENAME__, __LINE__);
-        }
-        m_file_read_buffer_length = 0;
-        m_compressed_stream_block.size = m_file_read_buffer_length;
-    } else if (InputType::ClpReader == m_input_type) {
-        auto rc = m_reader->try_seek_from_begin(m_file_reader_initial_pos);
-        m_file_read_buffer_length = 0;
-        m_compressed_stream_block.size = m_file_read_buffer_length;
-        if (false
-            == (clp::ErrorCode::ErrorCode_Success == rc || clp::ErrorCode::ErrorCode_EndOfFile == rc
-            ))
-        {
-            throw OperationFailed(static_cast<ErrorCode>(rc), __FILENAME__, __LINE__);
-        }
+    if (InputType::File == m_input_type || InputType::ClpReader == m_input_type) {
+        auto* reader = (InputType::File == m_input_type) ? m_file_reader : m_reader;
+        auto rc = reader->try_seek_from_begin(m_file_reader_initial_pos);
+        if (auto success = (InputType::File == m_input_type) 
+                ? (ErrorCodeSuccess == rc || ErrorCodeEndOfFile == rc)
+                : (clp::ErrorCode::ErrorCode_Success == rc || clp::ErrorCode::ErrorCode_EndOfFile == rc);
+            !success)
+        {
+            throw OperationFailed(static_cast<ErrorCode>(rc), __FILENAME__, __LINE__);
+        }
+        m_file_read_buffer_length = 0;
+        m_compressed_stream_block.size = m_file_read_buffer_length;
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85f4b5e and de9fb38.

📒 Files selected for processing (4)
  • components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)
  • components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1 hunks)
  • components/core/src/clp_s/TimestampDictionaryReader.cpp (2 hunks)
  • components/core/src/clp_s/ZstdDecompressor.cpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/TimestampDictionaryReader.cpp
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp_s/ZstdDecompressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: build (macos-latest)
🔇 Additional comments (4)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

49-72: Resolve the FIXME and simplify deserialization.

The current implementation manually parses the MessagePack object due to the commented-out line m_archive_file_info = obj.as<clp_s::ArchiveFileInfoPacket>();. Addressing the FIXME by fixing the issue with obj.as<clp_s::ArchiveFileInfoPacket>() would simplify the code and improve maintainability.

components/core/src/clp_s/ZstdDecompressor.cpp (3)

86-109: Implement error code translation for CLP errors.

The TODO comment indicates that CLP error codes are not currently translated. Instead of returning a generic ErrorCodeFailure, consider mapping the CLP error codes to the appropriate internal error codes to provide more informative error handling.

Do you want assistance in implementing the error code translation or should I open a GitHub issue to track this task?


220-225: LGTM! Clean and consistent implementation.

The ClpReader case is handled appropriately, maintaining consistency with the existing pattern.


238-238: Consider future refactoring opportunity.

The condition has been correctly updated to include ClpReader. However, as discussed in previous comments, if FileReader becomes an inherited class of ReaderInterface in the future, this check could be simplified.

Let's check for any remaining uses of FileReader that could be migrated:

✅ Verification successful

Refactoring already implemented in most components

The condition in ZstdDecompressor.cpp is correct for the current architecture. Our verification shows that FileReader already inherits from ReaderInterface in both clp and glt namespaces. Only the clp_s::FileReader implementation remains to be refactored to follow this pattern.

  • components/core/src/clp_s/FileReader.hpp: Consider refactoring this implementation to inherit from ReaderInterface, matching the pattern used in other components.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for FileReader usage
rg -l "FileReader" | grep -v "ZstdDecompressor"

Length of output: 4204


Script:

#!/bin/bash
# Check FileReader class definitions and ReaderInterface
echo "=== FileReader class definitions ==="
ast-grep --pattern 'class FileReader$$$'

echo -e "\n=== ReaderInterface definitions ==="
ast-grep --pattern 'class ReaderInterface$$$'

Length of output: 260


Script:

#!/bin/bash
# Examine FileReader and ReaderInterface class definitions with context
echo "=== FileReader Headers ==="
rg -U "class FileReader" --type hpp -A 5

echo -e "\n=== ReaderInterface Headers ==="
rg -U "class ReaderInterface" --type hpp -A 5

Length of output: 288


Script:

#!/bin/bash
echo "=== FileReader in clp ==="
cat "components/core/src/clp/FileReader.hpp"

echo -e "\n=== FileReader in clp_s ==="
cat "components/core/src/clp_s/FileReader.hpp"

echo -e "\n=== FileReader in glt ==="
cat "components/core/src/glt/FileReader.hpp"

echo -e "\n=== ReaderInterface definitions ==="
cat "components/core/src/clp/ReaderInterface.hpp" "components/core/src/glt/ReaderInterface.hpp"

Length of output: 22712

components/core/src/clp_s/ArchiveReaderAdaptor.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (4)

70-72: Log exceptions when catching them to aid in debugging

In the catch block, the exception is caught but not logged. Logging the exception message can help with debugging and provide more context when errors occur.

Apply this diff to add logging:

        } catch (std::exception const& e) {
+            SPDLOG_ERROR("Exception while reading archive file info: {}", e.what());
            return ErrorCodeCorrupt;
        }

91-93: Log exceptions when deserializing archive info

The exception caught during deserialization of archive info is not logged. Logging the exception can assist in diagnosing issues during runtime.

Apply this diff to add logging:

    } catch (std::exception const& e) {
+        SPDLOG_ERROR("Exception while reading archive info: {}", e.what());
        return ErrorCodeCorrupt;
    }

191-193: Use std::filesystem::path for path concatenation in try_create_reader_at_header

Concatenating paths using string operations may result in invalid paths if path separators are not properly handled. It is recommended to use std::filesystem::path to safely concatenate paths.

Apply this diff to use std::filesystem::path:

            return std::make_shared<clp::FileReader>(
-                m_archive_path.path + constants::cArchiveHeaderFile
+                (std::filesystem::path(m_archive_path.path) / constants::cArchiveHeaderFile).string()
            );

214-215: Use std::filesystem::path for path concatenation to handle path separators correctly

Concatenating paths using string concatenation may result in invalid file paths if path separators are not correctly managed. It is recommended to use std::filesystem::path for path concatenation to ensure correctness and portability.

Apply this diff to use std::filesystem::path:

        return std::make_unique<clp::FileReader>(
-            m_archive_path.path + std::string{section}
+            (std::filesystem::path(m_archive_path.path) / section).string()
        );
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de9fb38 and eff2d33.

📒 Files selected for processing (1)
  • components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (2)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (2)

49-72: Address the FIXME and resolve the deserialization issue

The code is currently manually parsing the MessagePack object due to the commented-out line m_archive_file_info = obj.as<clp_s::ArchiveFileInfoPacket>();. Resolving the FIXME by fixing the deserialization issue would simplify the code and improve maintainability.


178-180: Handle unknown metadata packet types appropriately

The default case in the switch statement silently ignores unknown metadata packet types, which could lead to unnoticed issues. Consider logging a warning or returning an error when an unknown packet type is encountered.

Apply this diff to handle unknown packet types:

            default:
+                SPDLOG_WARN("Unknown metadata packet type: {}", static_cast<int>(packet_type));
+                return ErrorCodeUnsupported;
-                break;

Comment on lines 122 to 128
auto const clp_rc = m_reader->try_read_exact_length(
reinterpret_cast<char*>(&m_archive_header),
sizeof(m_archive_header)
);
if (clp::ErrorCode::ErrorCode_Success != clp_rc) {
return ErrorCodeErrno;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use the passed reader object instead of m_reader in try_read_header

The function try_read_header should use the reader parameter provided to it rather than m_reader. Using m_reader may lead to unexpected behaviour if it is not properly initialised or if a different reader is intended.

Apply this diff to fix the issue:

-    auto const clp_rc = m_reader->try_read_exact_length(
+    auto const clp_rc = reader.try_read_exact_length(
            reinterpret_cast<char*>(&m_archive_header),
            sizeof(m_archive_header)
    );

Committable suggestion skipped: line range outside the PR's diff.

wraymo
wraymo previously approved these changes Jan 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (4)

95-98: Enhance error message for unsupported segment count.

When multiple segments are detected, provide more context in the error message to aid debugging.

     if (1 != m_archive_info.num_segments) {
+        SPDLOG_ERROR("Unsupported number of segments: {}", m_archive_info.num_segments);
         return ErrorCodeUnsupported;
     }

102-102: Improve naming of the decompressor buffer capacity constant.

The constant cDecompressorFileReadBufferCapacity could be more descriptive. Consider a name that indicates its purpose and unit, e.g., cMetadataDecompressorBufferSizeBytes.


130-138: Use constant-time comparison for magic number.

When comparing magic numbers or other security-sensitive data, use a constant-time comparison to prevent timing attacks.

-    if (0
-        != std::memcmp(
-                m_archive_header.magic_number,
-                cStructuredSFAMagicNumber,
-                sizeof(cStructuredSFAMagicNumber)
-        ))
+    if (!constant_time_compare(
+            m_archive_header.magic_number,
+            cStructuredSFAMagicNumber,
+            sizeof(cStructuredSFAMagicNumber)))

203-216: Use RAII pattern for reader checkout/checkin.

The current implementation relies on manual checkin. Consider using RAII to automatically checkin readers when they go out of scope.

Example implementation:

class SectionReaderGuard {
    ArchiveReaderAdaptor& adaptor;
    std::string section;
public:
    SectionReaderGuard(ArchiveReaderAdaptor& a, std::string_view s)
        : adaptor(a), section(s) {}
    ~SectionReaderGuard() {
        adaptor.checkin_reader_for_section(section);
    }
};
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eff2d33 and fe63954.

📒 Files selected for processing (1)
  • components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-focal-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-focal-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

49-50: Resolve the FIXME and simplify deserialization.

The current implementation manually parses the MessagePack object due to the commented-out line. Addressing the FIXME would simplify the code and improve maintainability.

@@ -119,7 +119,7 @@ ErrorCode ArchiveReaderAdaptor::load_archive_metadata() {
}

ErrorCode ArchiveReaderAdaptor::try_read_header(clp::ReaderInterface& reader) {
auto const clp_rc = m_reader->try_read_exact_length(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we don't need to pass reader in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing reader to make it more generic and follow the convention of the other helpers that read some part of the metadata. I prefer doing it this way because I don't want to embed the assumption that the MFA flow initializes m_reader in a particular way.

I was erroneously using m_reader in this function before the most recent commit.

@gibber9809 gibber9809 requested a review from wraymo January 15, 2025 20:54
@gibber9809 gibber9809 merged commit 8f00463 into y-scope:main Jan 15, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants