feat(clp-s): Add support for reading and searching single file archives. #656

gibber9809 · 2025-01-08T21:38:13Z

Description

This PR adds support for reading and searching clp-s single file archives. The general approach is to introduce a new "adaptor" class that allows most of the code to be oblivious as to whether the underlying archive is a single or multi-file archive.

This class is called "ArchiveReaderAdaptor" and makes up the bulk of this change.

Other notable changes include:

Modifying the multi-file-archive format to:
- Remove the dedicated timestamp dict file
- Add a new "header" file which includes header and metadata identical to the multi-file archive
Allowing ZstdDecompressor to accept a clp::ReaderInterface
Adapting read-side code to invoke ArchiveReaderAdaptor instead of opening files directly
Changing when different sections of the archive get opened during the read-side flow to ensure that single-file-archives can be read without seeks

Benchmarks

Some benchmarking was performed to get an idea of compression, decompression, and search performance with single file archives. All of the following benchmarking results are from averaging over 3-5 tests on the local filesystem clearing the cache between runs.

Compression is slightly slower (~1.2%) when compressing single-file archives (compared to multi-file archives).

dataset	sfa compression time / mfa compression time
cockroach	0.996
mongodb	1.040
elasticsearch	1.013
spark	0.998
postgresql	1.011
average	1.012

Surprisingly decompression is also slightly slower (~1%). This should be acceptable for now though, and we can work on optimizing bottlenecks later.

dataset	sfa decompression time / mfa decompression time
cockroach	0.996
mongodb	1.006
elasticsearch	1.010
spark	1.038
postgresql	1.001
average	1.010

Likewise search seems to be slower (~3.7%). Again this is worth optimizing for later, but should be sufficient for now.

query (from our paper)	sfa search time / mfa search time
J	0.963
M	1.090
K	1.060
average	1.037

Validation performed

Validated that multi-file-archives on the local filesystem can be decompressed and searched successfully on the command line
Validated that single-file-archives on s3 can be decompressed and searched succesfully on the command line
Validated that single-file-archives on the local filesystem can be decompressed and searched successfully on the command line

Summary by CodeRabbit

Release Notes

New Features
- Introduced ArchiveReaderAdaptor to enhance archive reading capabilities.
- Added support for opening packed streams in the archive reader.
- New methods for managing archive metadata and file information.
Improvements
- Streamlined dictionary and stream reading processes.
- Enhanced error handling for archive operations.
- Simplified interactions between archive readers and writers.
Technical Updates
- Updated methods for reading and managing archive metadata.
- Modified decompression utilities to support new input types.
- Refactored archive reading and processing workflows.
- Updated method signatures to improve modularity and clarity.
- Removed deprecated methods related to timestamp dictionary handling.

These changes improve the overall robustness and flexibility of the archive reading and processing infrastructure.

…gle and multi-file archives

coderabbitai · 2025-01-08T21:38:21Z

Walkthrough

This pull request introduces a comprehensive refactoring of the archive reading and processing infrastructure in the clp_s namespace. The changes center around the introduction of a new ArchiveReaderAdaptor class, which provides a more flexible and modular approach to reading archives. The modifications span multiple components, including dictionary readers, decompressors, and utility classes, with a focus on improving how archives are opened, read, and managed across different input sources.

Changes

File	Change Summary
`components/core/CMakeLists.txt`	Added `ArchiveReaderAdaptor.cpp` and `ArchiveReaderAdaptor.hpp` to source files
`components/core/src/clp_s/CMakeLists.txt`	Added `ArchiveReaderAdaptor.cpp` and `ArchiveReaderAdaptor.hpp` to `CLP_S_SOURCES`
`components/core/src/clp_s/ArchiveReader.cpp/hpp`	Integrated `ArchiveReaderAdaptor`, added `open_packed_streams()` method, updated dictionary and metadata reading
`components/core/src/clp_s/DictionaryReader.hpp`	Updated to use `ArchiveReaderAdaptor`, renamed `read_new_entries()` to `read_entries()`
`components/core/src/clp_s/ReaderUtils.cpp/hpp`	Modified methods to accept `ArchiveReaderAdaptor` instead of directory paths
`components/core/src/clp_s/ZstdDecompressor.cpp/hpp`	Added support for `clp::ReaderInterface` input type
`components/core/src/clp_s/TimestampDictionaryReader.cpp/hpp`	Removed file management methods, added `read()` method with decompressor
`components/core/src/clp_s/archive_constants.hpp`	Added `cArchiveHeaderFile`, removed `cArchiveTimestampDictFile`

Sequence Diagram

sequenceDiagram
    participant AR as ArchiveReader
    participant ARA as ArchiveReaderAdaptor
    participant DR as DictionaryReader
    participant ZD as ZstdDecompressor

    AR->>ARA: load_archive_metadata()
    ARA-->>AR: Metadata loaded
    AR->>ARA: checkout_reader_for_section()
    ARA-->>AR: Reader for section
    AR->>DR: read_entries()
    DR->>ARA: Get reader
    ARA-->>DR: Reader interface
    DR->>ZD: read with decompressor
    ZD-->>DR: Entries read

The sequence diagram illustrates the new workflow for reading archives, highlighting the central role of the ArchiveReaderAdaptor in managing reader interfaces, sections, and decompression across different components.

Possibly related PRs

fix(core-clp): Advance to the next message when a message has an out-of-range timestamp when searching archives (fixes #659). #660: The changes in this PR involve modifications to the handling of message indices in the context of searching archives, which may relate to the overall functionality of archive reading and processing introduced in the main PR. However, the specific changes do not directly connect to the new files or methods added in the main PR.

Suggested reviewers

kirkrodrigues

Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 8

🔭 Outside diff range comments (3)

components/core/src/clp_s/ReaderUtils.cpp (2)

Line range hint 17-61: Ensure schema_tree_reader is always checked back in

To prevent potential resource leaks, consider using RAII (Resource Acquisition Is Initialization) or a scope guard to ensure that adaptor.checkin_reader_for_section is called even if an exception is thrown before it. This will guarantee that the reader is always checked back in, maintaining resource integrity.

Line range hint 90-141: Ensure schema_id_reader is always checked back in

In the read_schemas method, to prevent potential resource leaks, consider implementing RAII or a scope guard for schema_id_reader. This will ensure that adaptor.checkin_reader_for_section is called even if exceptions occur, maintaining resource integrity.
components/core/src/clp_s/PackedStreamReader.cpp (1)
Line range hint 121-126: Fix incorrect assignment from std::unique_ptr<char[]> to std::shared_ptr<char[]>

The code incorrectly assigns a std::unique_ptr<char[]> to a std::shared_ptr<char[]>, which is invalid as there is no implicit conversion between them for array types. This can lead to undefined behaviour.

Apply the following fix to correctly create a std::shared_ptr<char[]>:
-    buf = std::make_unique<char[]>(uncompressed_size);
+    buf = std::shared_ptr<char[]>(new char[uncompressed_size]);
Alternatively, consider using std::vector<char> for buffer management if appropriate.

🧹 Nitpick comments (12)

components/core/src/clp_s/PackedStreamReader.hpp (1)
47-49: Consider passing adaptor by const reference to improve efficiency

In the open_packed_streams method, consider passing the adaptor parameter as a const std::shared_ptr<ArchiveReaderAdaptor>& instead of by value. This avoids unnecessary copying and reference count modifications, which can improve performance.

Suggested change:
-void open_packed_streams(std::shared_ptr<ArchiveReaderAdaptor> adaptor);
+void open_packed_streams(const std::shared_ptr<ArchiveReaderAdaptor>& adaptor);
components/core/src/clp_s/ArchiveWriter.cpp (2)
59-73: Improve clarity of offset calculation

The current implementation temporarily uses the o field to store the size before calculating offsets. This dual use of the field is not immediately clear and could confuse future maintainers.

Consider using a more explicit approach:
     std::vector<ArchiveFileInfo> files{
             {constants::cArchiveSchemaTreeFile, schema_tree_compressed_size},
             {constants::cArchiveSchemaMapFile, schema_map_compressed_size},
             {constants::cArchiveTableMetadataFile, table_metadata_compressed_size},
             {constants::cArchiveVarDictFile, var_dict_compressed_size},
             {constants::cArchiveLogDictFile, log_dict_compressed_size},
             {constants::cArchiveArrayDictFile, array_dict_compressed_size},
             {constants::cArchiveTablesFile, table_compressed_size}
     };
     uint64_t offset = 0;
     for (auto& file : files) {
-        uint64_t original_size = file.o;
-        file.o = offset;
-        offset += original_size;
+        uint64_t const file_size = file.o;  // Store the size from the initialization
+        file.o = offset;                    // Set the file's offset
+        offset += file_size;                // Increment offset by file size
     }
86-89: Improve readability of compressed size calculation

The calculation of m_compressed_size spans multiple lines and includes many components, making it difficult to verify that all necessary components are included.

Consider using a more structured approach:
-        m_compressed_size
-                = var_dict_compressed_size + log_dict_compressed_size + array_dict_compressed_size
-                  + metadata_size + schema_tree_compressed_size + schema_map_compressed_size
-                  + table_metadata_compressed_size + table_compressed_size + sizeof(ArchiveHeader);
+        // Calculate total compressed size from all components
+        m_compressed_size = sizeof(ArchiveHeader)  // Header size
+                           // Dictionary sizes
+                           + var_dict_compressed_size
+                           + log_dict_compressed_size
+                           + array_dict_compressed_size
+                           // Schema sizes
+                           + schema_tree_compressed_size
+                           + schema_map_compressed_size
+                           // Table sizes
+                           + table_metadata_compressed_size
+                           + table_compressed_size
+                           // Metadata size
+                           + metadata_size;
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

41-41: Use consistent data types: change loop index to uint64_t

The loop index i in for (int i = 0; i < num_patterns; ++i) is declared as an int, while num_patterns is of type uint64_t. To avoid potential issues with signed-unsigned comparisons and to maintain consistency, consider declaring i as a uint64_t.

components/core/src/clp_s/TimestampDictionaryReader.hpp (1)

26-30: Enhance method documentation for read

To improve clarity and maintainability, consider expanding the documentation for the read method. Including details such as any exceptions that may be thrown and elaborating on the method's behaviour will be beneficial.

components/core/src/clp_s/DictionaryReader.hpp (1)

42-44: Enhance method documentation for read_entries

Consider providing detailed documentation for the read_entries method, including descriptions of the parameters like lazy, expected behaviour, and any exceptions that might be thrown. This will enhance code readability and ease future maintenance.

components/core/src/clp_s/ReaderUtils.cpp (2)

67-70: Consistent use of smart pointers for resource management

In the get_variable_dictionary_reader method, ensure consistent and safe resource management by verifying that the VariableDictionaryReader object correctly handles resource acquisition and release, especially in exceptional circumstances.

75-78: Consistent use of smart pointers for resource management

Similarly, in the get_log_type_dictionary_reader method, confirm that the LogTypeDictionaryReader object properly manages resources to prevent leaks or dangling pointers.
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
31-33: Add move constructor/assignment operator declarations.

The class manages resources (shared_ptr, unique_ptr) but doesn't declare move operations. Consider adding them to support efficient resource transfer.

Add the following declarations:
     explicit ArchiveReaderAdaptor(Path const& archive_path, NetworkAuthOption const& network_auth);
 
     ~ArchiveReaderAdaptor();
+
+    // Move constructor/assignment
+    ArchiveReaderAdaptor(ArchiveReaderAdaptor&&) = default;
+    ArchiveReaderAdaptor& operator=(ArchiveReaderAdaptor&&) = default;
+
+    // Explicitly delete copy operations
+    ArchiveReaderAdaptor(const ArchiveReaderAdaptor&) = delete;
+    ArchiveReaderAdaptor& operator=(const ArchiveReaderAdaptor&) = delete;
components/core/src/clp_s/ZstdDecompressor.hpp (1)
46-46: Rename parameter for clarity.

The parameter name file_read_buffer_capacity suggests file-specific usage, but this method handles generic reader interfaces.

Apply this diff to improve clarity:
-    void open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) override;
+    void open(clp::ReaderInterface& reader, size_t read_buffer_capacity) override;
components/core/src/clp_s/JsonConstructor.cpp (1)
46-46: LGTM! Consider adding documentation.

The addition of open_packed_streams() follows a logical sequence in the initialization process. Consider adding a comment explaining the initialization order requirements.
+    // Open packed streams after reading dictionaries and metadata
     m_archive_reader->open_packed_streams();
components/core/src/clp_s/search/Output.cpp (1)
68-69: LGTM! Consider adding error handling.

The addition of open_packed_streams() is well-placed in the execution flow. Consider adding error handling for potential stream opening failures.
-    m_archive_reader->open_packed_streams();
+    try {
+        m_archive_reader->open_packed_streams();
+    } catch (const std::exception& e) {
+        SPDLOG_ERROR("Failed to open packed streams: {}", e.what());
+        return false;
+    }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3b671 and ae419b0.

📒 Files selected for processing (23)

components/core/CMakeLists.txt (1 hunks)
components/core/src/clp_s/ArchiveReader.cpp (6 hunks)
components/core/src/clp_s/ArchiveReader.hpp (6 hunks)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1 hunks)
components/core/src/clp_s/ArchiveWriter.cpp (2 hunks)
components/core/src/clp_s/ArchiveWriter.hpp (1 hunks)
components/core/src/clp_s/CMakeLists.txt (1 hunks)
components/core/src/clp_s/Decompressor.hpp (2 hunks)
components/core/src/clp_s/DictionaryReader.hpp (5 hunks)
components/core/src/clp_s/JsonConstructor.cpp (1 hunks)
components/core/src/clp_s/PackedStreamReader.cpp (4 hunks)
components/core/src/clp_s/PackedStreamReader.hpp (3 hunks)
components/core/src/clp_s/ReaderUtils.cpp (4 hunks)
components/core/src/clp_s/ReaderUtils.hpp (2 hunks)
components/core/src/clp_s/TimestampDictionaryReader.cpp (2 hunks)
components/core/src/clp_s/TimestampDictionaryReader.hpp (1 hunks)
components/core/src/clp_s/Utils.cpp (1 hunks)
components/core/src/clp_s/ZstdDecompressor.cpp (5 hunks)
components/core/src/clp_s/ZstdDecompressor.hpp (4 hunks)
components/core/src/clp_s/archive_constants.hpp (1 hunks)
components/core/src/clp_s/clp-s.cpp (1 hunks)
components/core/src/clp_s/search/Output.cpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (21)

components/core/src/clp_s/Decompressor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/search/Output.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/archive_constants.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampDictionaryReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonConstructor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ZstdDecompressor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/PackedStreamReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveWriter.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveWriter.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ZstdDecompressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/PackedStreamReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/DictionaryReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-focal-static-linked-bins
GitHub Check: ubuntu-focal-dynamic-linked-bins
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-13, false)
GitHub Check: lint-check (ubuntu-latest)
GitHub Check: build-macos (macos-13, true)
GitHub Check: build (macos-latest)
GitHub Check: lint-check (macos-latest)

🔇 Additional comments (25)

components/core/src/clp_s/clp-s.cpp (1)

147-147: LGTM! Method name change aligns with the new architecture.

The change from read_timestamp_dictionary() to get_timestamp_dictionary() reflects the architectural shift to use the ArchiveReaderAdaptor for handling both single and multi-file archives.

Let's verify the implementation of the new method:

✅ Verification successful

Implementation verified: get_timestamp_dictionary() correctly supports the new architecture

The method is properly implemented in ArchiveReaderAdaptor and correctly delegated through ArchiveReader, with consistent usage across the codebase for timestamp operations.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the implementation of get_timestamp_dictionary()
# Expected: The method should be defined in ArchiveReaderAdaptor and should handle both single and multi-file archives

# Search for the method definition
ast-grep --pattern 'get_timestamp_dictionary() {
  $$$
}'

# Search for references to ensure consistent usage
rg -A 2 'get_timestamp_dictionary'
Length of output: 2331
components/core/src/clp_s/PackedStreamReader.hpp (3)

9-10: Includes necessary headers for new functionality

The addition of ReaderInterface.hpp and ArchiveReaderAdaptor.hpp is appropriate and ensures that the necessary interfaces are available for the updated implementation.

89-90: Proper use of smart pointers for resource management

The introduction of m_adaptor as a std::shared_ptr and updating m_packed_stream_reader to a std::unique_ptr align with best practices for smart pointer usage, ensuring appropriate ownership and lifetime management of resources.

93-93: Initialization of m_begin_offset ensures defined behaviour

Initializing m_begin_offset to zero ensures that it has a well-defined starting value, preventing potential issues with uninitialized variables.

components/core/src/clp_s/archive_constants.hpp (1)

10-12: LGTM! The new header constant follows consistent naming and organization.

The addition of cArchiveHeaderFile and its placement in a dedicated "Header and metadata section" maintains good code organization. The constant's value follows the established pattern of other archive path constants.

components/core/src/clp_s/Utils.cpp (2)

Line range hint 79-87: LGTM! The updated file check aligns with the new archive format.

The replacement of cArchiveTimestampDictFile with cArchiveHeaderFile in the condition maintains the correct detection of multi-file archives while adapting to the new archive format. The condition follows the coding guideline of placing the constant on the left side of the equality operator.

Line range hint 79-87: Verify comprehensive testing of archive operations.

Since this change affects archive format detection, please ensure:

All archive operations (read, write, search) are tested with both single and multi-file archives

Migration paths exist for existing archives using the old timestamp dictionary format

Let's verify the usage of the old timestamp dictionary format:

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

31-35: Verify the conditional logic for setting m_single_file_archive

The condition if (InputSource::Filesystem != archive_path.source || std::filesystem::is_regular_file(archive_path.path)) will set m_single_file_archive to true if either the input source is not the filesystem or the path is a regular file. Please verify that this logic correctly identifies single-file archives in all intended scenarios.

components/core/src/clp_s/ArchiveReader.cpp (1)

Line range hint 1-360: Code changes look good

The integration of ArchiveReaderAdaptor enhances the modularity and maintainability of the archive reading process. The updates are well-structured and align with the project architecture.

components/core/src/clp_s/Decompressor.hpp (1)

54-59: LGTM! Well-documented interface extension.

The new virtual method enhances the Decompressor interface to support reading from a clp::ReaderInterface, which aligns well with the PR's objective of supporting single file archives.

components/core/src/clp_s/ReaderUtils.hpp (1)

30-34: LGTM! Documentation updates are clear and consistent.

The documentation has been updated to accurately reflect the switch from string paths to ArchiveReaderAdaptor references.

Also applies to: 37-41, 44-49, 53-58, 62-67

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (2)

17-30: LGTM! Well-designed class with proper error handling.

The class follows SOLID principles and includes thorough error handling through a custom exception class.

43-59: LGTM! Robust checkout/checkin pattern.

The checkout/checkin pattern effectively prevents concurrent access to sections and enforces sequential access to prevent backwards seeks.

components/core/src/clp_s/ZstdDecompressor.hpp (1)

115-116: LGTM! Consistent extension for ClpReader support.

The InputType enum and member variable additions properly support the new reader interface capability.

Also applies to: 134-134
components/core/src/clp_s/ArchiveReader.hpp (4)

45-49: LGTM! Documentation follows standard format.

The new method declaration is well-documented and follows the established documentation pattern in the codebase.

113-113: LGTM! Delegation to adaptor improves modularity.

The change to delegate timestamp dictionary access to the adaptor improves separation of concerns and aligns with the adaptor pattern.

201-201: LGTM! Member variable follows naming convention.

The new member variable follows the established m_ prefix naming convention.

56-56: Verify the impact of changing dictionary read behaviour.

The change from read_new_entries to read_entries could affect the dictionary loading behaviour. Please ensure this change is intentional and doesn't impact existing functionality.

Also applies to: 66-66, 76-76

✅ Verification successful

The change is intentional and correct as part of the CLP_S implementation

This change is part of a new implementation in the clp_s directory, which is separate from the existing CLP/GLT implementations. The new read_entries method includes enhanced functionality with optional lazy loading support. The existing implementations using read_new_entries remain unchanged and unaffected.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for any other uses of read_new_entries to ensure all occurrences are updated
rg "read_new_entries" -A 2
Length of output: 2561

Script:
#!/bin/bash
# Search for read_entries implementation and usage
ast-grep --pattern 'read_entries($_) { $$$ }'

# Also search with ripgrep for any other occurrences
rg "read_entries" -A 2
Length of output: 1539
components/core/src/clp_s/ArchiveWriter.hpp (1)

183-183: LGTM! Documentation update improves clarity.

The documentation change appropriately generalizes the method's purpose to support both single and multi-file archives.

components/core/src/clp_s/CMakeLists.txt (1)

91-92: LGTM! New source files are properly integrated.

The new ArchiveReaderAdaptor files are correctly added to the CLP_S_SOURCES set, maintaining alphabetical order.

components/core/CMakeLists.txt (1)

264-265: LGTM! New source files are properly integrated into the test suite.

The ArchiveReaderAdaptor files are correctly added to SOURCE_FILES_clp_s_unitTest, ensuring they'll be included in the unit test build.

Let's verify the test coverage for the new ArchiveReaderAdaptor:

components/core/src/clp_s/PackedStreamReader.cpp (4)

3-5: Appropriate inclusion of new headers

The added header files BoundedReader.hpp, archive_constants.hpp, and ArchiveReaderAdaptor.hpp are necessary for the new functionality and are correctly included.

44-59: open_packed_streams method updated appropriately

The open_packed_streams method correctly updates the state, initializes m_adaptor, and handles errors appropriately when checking the position. The use of ArchiveReaderAdaptor enhances the modularity and maintainability of the code.

63-78: Proper resource management in close method

The close method effectively uses the needs_checkin flag to determine if resources need to be checked in. The state transitions and resetting of member variables are correctly handled to ensure proper cleanup.

Line range hint 94-114: Correct state management and error handling in read_stream

The read_stream method appropriately manages state transitions, validates stream IDs, calculates adjusted file offsets, and handles errors properly. The integration with BoundedReader and usage of the decompressor are correctly implemented.

components/core/src/clp_s/ArchiveWriter.cpp

coderabbitai · 2025-01-08T21:42:29Z

components/core/src/clp_s/TimestampDictionaryReader.cpp

+        entry.try_read_from_file(decompressor);
        if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) {
            throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__);
        }


🛠️ Refactor suggestion

Ensure consistent error handling: replace exception with error code return

In the read method, error handling has shifted to returning ErrorCode instead of throwing exceptions. However, the call to StringUtils::tokenize_column_descriptor still throws an exception upon failure. For consistency and to maintain uniform error handling, consider returning an appropriate ErrorCode instead of throwing an exception.

components/core/src/clp_s/TimestampDictionaryReader.cpp

components/core/src/clp_s/DictionaryReader.hpp

components/core/src/clp_s/ArchiveReaderAdaptor.cpp

coderabbitai · 2025-01-08T21:42:29Z

components/core/src/clp_s/ZstdDecompressor.cpp

+                            // TODO: attempt to translate clp error codes
+                            return ErrorCodeFailure;
+                        }
+                    }


🛠️ Refactor suggestion

Implement error code translation for CLP errors

The TODO comment indicates that CLP error codes are not currently translated. Instead of returning a generic ErrorCodeFailure, consider mapping the CLP error codes to the appropriate internal error codes to provide more informative error handling.

Do you want assistance in implementing the error code translation or should I open a GitHub issue to track this task?

components/core/src/clp_s/ReaderUtils.hpp

…nt number

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

components/core/src/clp_s/TimestampDictionaryReader.cpp (2)

19-21: ⚠️ Potential issue

Ensure consistent error handling approach.

The method mixes error handling approaches by throwing exceptions in these sections while using error codes elsewhere. Consider maintaining consistency by returning error codes instead of throwing exceptions.

-        if (auto rc = entry.try_read_from_file(decompressor); ErrorCodeSuccess != rc) {
-            throw OperationFailed(rc, __FILENAME__, __LINE__);
+        if (auto rc = entry.try_read_from_file(decompressor); ErrorCodeSuccess != rc) {
+            return rc;
         }

-        if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) {
-            throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__);
+        if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) {
+            return ErrorCodeCorrupt;
         }

Also applies to: 22-25

41-58: ⚠️ Potential issue

Add bounds checking for numeric inputs.

The pattern reading section should validate num_patterns and pattern_len to prevent potential memory exhaustion or buffer overflow attacks from malformed input.

     error = decompressor.try_read_numeric_value<uint64_t>(num_patterns);
     if (ErrorCodeSuccess != error) {
         return error;
     }
+    if (num_patterns > MAX_PATTERNS) {
+        return ErrorCodeInvalidInput;
+    }
     for (int i = 0; i < num_patterns; ++i) {
         uint64_t id, pattern_len;
         std::string pattern;
         error = decompressor.try_read_numeric_value<uint64_t>(id);
         if (ErrorCodeSuccess != error) {
             return error;
         }
         error = decompressor.try_read_numeric_value<uint64_t>(pattern_len);
         if (ErrorCodeSuccess != error) {
             return error;
         }
+        if (pattern_len > MAX_PATTERN_LENGTH) {
+            return ErrorCodeInvalidInput;
+        }
         error = decompressor.try_read_string(pattern_len, pattern);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc834a6 and af0b784.

📒 Files selected for processing (1)

components/core/src/clp_s/TimestampDictionaryReader.cpp (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: ubuntu-focal-static-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-focal-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: build-macos (macos-14, false)
GitHub Check: lint-check (ubuntu-latest)
GitHub Check: build-macos (macos-13, false)
GitHub Check: lint-check (macos-latest)
GitHub Check: build-macos (macos-13, true)
GitHub Check: build (macos-latest)

🔇 Additional comments (1)

components/core/src/clp_s/TimestampDictionaryReader.cpp (1)

8-14: LGTM! Well-structured error handling pattern.

The new method signature and initial read operation demonstrate good practices with consistent error handling and clear variable naming.

coderabbitai · 2025-01-10T19:31:36Z

components/core/src/clp_s/TimestampDictionaryReader.cpp

    if (ErrorCodeSuccess != error) {
-        throw OperationFailed(error, __FILENAME__, __LINE__);
+        return error;
    }
    for (int i = 0; i < num_patterns; ++i) {


⚠️ Potential issue

Fix type mismatch in loop counter.

The loop counter 'i' is declared as int while num_patterns is uint64_t. This could cause overflow issues if num_patterns exceeds INT_MAX.

- for (int i = 0; i < num_patterns; ++i) { + for (uint64_t i = 0; i < num_patterns; ++i) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (int i = 0; i < num_patterns; ++i) {

for (uint64_t i = 0; i < num_patterns; ++i) {

wraymo

Thanks for the PR! Here are some initial comments.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp

wraymo · 2025-01-14T14:01:33Z

components/core/src/clp_s/ArchiveReaderAdaptor.hpp

+    ErrorCode try_read_archive_file_info(ZstdDecompressor& decompressor, size_t size);
+
+    ErrorCode try_read_timestamp_dictionary(ZstdDecompressor& decompressor, size_t size);
+
+    ErrorCode try_read_archive_info(ZstdDecompressor& decompressor, size_t size);
+
+    std::shared_ptr<clp::ReaderInterface> try_create_reader_at_header();
+
+    std::unique_ptr<clp::ReaderInterface> checkout_reader_for_sfa_section(std::string_view section);
+
+    ErrorCode try_read_header(clp::ReaderInterface& reader);
+
+    ErrorCode try_read_archive_metadata(ZstdDecompressor& reader);


Should we add descriptions for these methods?

components/core/src/clp_s/ArchiveReaderAdaptor.hpp

wraymo · 2025-01-14T14:17:10Z

components/core/src/clp_s/ArchiveReaderAdaptor.cpp

+    m_current_reader_holder.reset();
+}
+
+std::shared_ptr<clp::ReaderInterface> ArchiveReaderAdaptor::try_create_reader_at_header() {


Can we move this method before ArchiveReaderAdaptor::checkout...

wraymo · 2025-01-14T14:35:51Z

components/core/src/clp_s/ZstdDecompressor.cpp

@@ -162,6 +186,28 @@ void ZstdDecompressor::open(FileReader& file_reader, size_t file_read_buffer_cap
    reset_stream();
 }

+void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) {


Maybe in the future FileReader is an inherited class of clp::ReaderInterface, so we don't need to have two methods and two types.

Yeah, will probably want to refactor this in the future + converge on one set of readers for clp and clp-s. I could put up a github issue to delete clp_s::FileReader as a near-term step if you want? I don't think we have too many uses of it left in our codebase, and it should be possible to switch the ones we do have over to clp::FileReader.

wraymo · 2025-01-14T14:38:35Z

components/core/src/clp_s/ZstdDecompressor.cpp

@@ -239,9 +287,22 @@ ErrorCode ZstdDecompressor::open(std::string const& compressed_file_path) {

 void ZstdDecompressor::reset_stream() {
    if (InputType::File == m_input_type) {
-        m_file_reader->seek_from_begin(m_file_reader_initial_pos);
+        auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos);


Can we do something like

Suggested change

auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos);

if (auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos); ErrorCodeSuccess != rc && ErrorCodeEndOfFile != rc) {

throw OperationFailed(rc, __FILENAME__, __LINE__);

}

wraymo · 2025-01-14T14:43:13Z

components/core/src/clp_s/Utils.cpp

@@ -1,5 +1,7 @@
 #include "Utils.hpp"

+#include <charconv>
+#include <cstdint>


Do we need this header file?

Yes -- std::from_chars comes from charconv, and we need cstdint for the (u)int(X)_t types.

wraymo · 2025-01-14T14:48:03Z

components/core/src/clp_s/ArchiveReaderAdaptor.hpp

+     * Load metadata for an archive including the header and metadata section. This method must be
+     * invoked before checking out any section of an archive, or calling `get_timestamp_dictionary`.
+     * @return ErrorCodeSuccess on success
+     * @return ErrorCode_errno on failure


ErrorCodeErrno?

Meant to imply that it can return any error. Maybe I can rephrase as "relevant error code on failure" or something?

Rephrased as "relevant ErrorCode on failure."

I think ErrorCodeSuccess is in clp-s, while ErrorCode_error is in clp. "Relevant ErrorCode on failure' also seems good.

wraymo · 2025-01-14T14:50:56Z

components/core/src/clp_s/ArchiveReaderAdaptor.cpp

+}
+
+ArchiveReaderAdaptor::~ArchiveReaderAdaptor() {
+    m_reader.reset();


Do we need this?

We do not. I will get rid of it.

Co-authored-by: wraymo <[email protected]>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (12)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (3)

31-35: Improve readability of single file archive condition.

The condition for determining if an archive is single file could be more explicit.
-    if (InputSource::Filesystem != archive_path.source
-        || std::filesystem::is_regular_file(archive_path.path))
-    {
-        m_single_file_archive = true;
-    }
+    m_single_file_archive = (InputSource::Filesystem != archive_path.source)
+                           || std::filesystem::is_regular_file(archive_path.path);
95-97: Define a constant for the expected segment count.

The magic number 1 for the expected number of segments should be defined as a named constant to improve code maintainability and readability.
+    static constexpr uint32_t cExpectedNumSegments = 1;
-    if (1 != m_archive_info.num_segments) {
+    if (cExpectedNumSegments != m_archive_info.num_segments) {
         return ErrorCodeUnsupported;
     }
222-226: Optimize lambda capture in find_if.

The lambda captures section by reference unnecessarily, which could be captured by value since it's a string_view.
     auto it = std::find_if(
             m_archive_file_info.files.begin(),
             m_archive_file_info.files.end(),
-            [&](ArchiveFileInfo& info) { return info.n == section; }
+            [section](ArchiveFileInfo& info) { return info.n == section; }
     );

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (7)

1-16: Consider adding file-level documentation.

Add a file-level documentation block explaining the purpose of this header file, its role in the archive reading infrastructure, and any important implementation notes.

+/**
+ * @file ArchiveReaderAdaptor.hpp
+ * @brief Adaptor class for reading single and multi-file archives from various sources
+ *
+ * This header defines the ArchiveReaderAdaptor class which provides a unified interface
+ * for reading archives regardless of their storage type (single/multi-file) or location
+ * (S3/local filesystem).
+ */
 #ifndef CLP_S_ARCHIVEREADERADAPTOR_HPP

18-22: Enhance class documentation.

While the current documentation is good, consider adding:

@brief tag for quick reference
@details section explaining the adaptor pattern usage
@note section about thread safety guarantees

 /**
+ * @brief Adaptor for unified access to different types of archives
+ *
  * ArchiveReaderAdaptor is an adaptor class which helps with reading single and multi-file archives
  * which exist on either S3 or a locally mounted file system.
+ *
+ * @details This class implements the adaptor pattern to provide a unified interface for reading
+ * archives, abstracting away the differences between single and multi-file archives, as well as
+ * their storage locations (S3 or local filesystem).
+ *
+ * @note This class is not thread-safe and should not be accessed concurrently from multiple threads.
  */

24-29: Document the exception class.

Add documentation for the OperationFailed exception class explaining when it's thrown and what the error codes mean.

+    /**
+     * @brief Exception thrown when archive operations fail
+     *
+     * This exception is thrown when operations such as checking out/in readers
+     * or accessing sections fail due to invalid state or I/O errors.
+     */
     class OperationFailed : public TraceableException {

31-31: Document the constructor.

Add documentation for the constructor explaining the parameters and any preconditions.

+    /**
+     * @brief Constructs an ArchiveReaderAdaptor
+     * @param archive_path Path to the archive file
+     * @param network_auth Network authentication options for remote archives
+     */
     explicit ArchiveReaderAdaptor(Path const& archive_path, NetworkAuthOption const& network_auth);

59-64: Document getter methods.

Add documentation for get_timestamp_dictionary() and get_header() methods explaining their purpose and any preconditions.

+    /**
+     * @brief Gets the timestamp dictionary reader
+     * @return Shared pointer to the timestamp dictionary reader
+     * @note load_archive_metadata() must be called before using this method
+     */
     std::shared_ptr<TimestampDictionaryReader> get_timestamp_dictionary() {
         return m_timestamp_dictionary;
     }

+    /**
+     * @brief Gets the archive header
+     * @return Constant reference to the archive header
+     * @note load_archive_metadata() must be called before using this method
+     */
     ArchiveHeader const& get_header() const { return m_archive_header; }

125-135: Document member variables and consider using polymorphism.

Add documentation for member variables to improve maintainability.
Consider using polymorphism instead of the boolean flag for handling different archive types.

+    /// Path to the archive file
     Path m_archive_path{};
+    /// Network authentication options for remote archives
     NetworkAuthOption m_network_auth{};
+    /// Flag indicating whether this is a single file archive
+    /// @note Consider using polymorphism instead of this flag
     bool m_single_file_archive{false};
+    /// Information about files in the archive
     ArchiveFileInfoPacket m_archive_file_info{};
+    /// Archive header information
     ArchiveHeader m_archive_header{};
+    /// General archive information
     ArchiveInfoPacket m_archive_info{};
+    /// Offset to the files section in the archive
     size_t m_files_section_offset{};
+    /// Name of the section currently being read, if any
     std::optional<std::string> m_current_reader_holder;
+    /// Reader for accessing timestamp information
     std::shared_ptr<TimestampDictionaryReader> m_timestamp_dictionary;
+    /// Current reader interface
     std::shared_ptr<clp::ReaderInterface> m_reader;

Consider refactoring to use polymorphism:

class ArchiveReaderInterface {
public:
    virtual ~ArchiveReaderInterface() = default;
    virtual ErrorCode load_archive_metadata() = 0;
    // ... other methods
};

class SingleFileArchiveReader : public ArchiveReaderInterface {
    // Implementation for single file archives
};

class MultiFileArchiveReader : public ArchiveReaderInterface {
    // Implementation for multi-file archives
};

// Factory function
std::unique_ptr<ArchiveReaderInterface> create_archive_reader(
    Path const& archive_path,
    NetworkAuthOption const& network_auth);

22-22: Consider adding move semantics support.

The class manages resources (readers, paths) but lacks move constructor and assignment operator. Consider adding them to support efficient resource transfer.

 class ArchiveReaderAdaptor {
 public:
+    // Rule of 5
+    ArchiveReaderAdaptor(ArchiveReaderAdaptor&&) noexcept = default;
+    ArchiveReaderAdaptor& operator=(ArchiveReaderAdaptor&&) noexcept = default;
+    ArchiveReaderAdaptor(const ArchiveReaderAdaptor&) = delete;
+    ArchiveReaderAdaptor& operator=(const ArchiveReaderAdaptor&) = delete;

components/core/src/clp_s/ZstdDecompressor.cpp (2)

189-209: Consider refactoring to reduce code duplication with FileReader open method.

The implementation is nearly identical to the open(FileReader&, size_t) method. Consider extracting the common logic into a private helper method to improve maintainability and reduce duplication.

-void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) {
+template<typename ReaderType>
+void ZstdDecompressor::open_reader(ReaderType& reader, InputType type, size_t file_read_buffer_capacity) {
     if (InputType::NotInitialized != m_input_type) {
         throw OperationFailed(ErrorCodeNotReady, __FILENAME__, __LINE__);
     }
-    m_input_type = InputType::ClpReader;
+    m_input_type = type;

     m_reader = &reader;
     m_file_reader_initial_pos = m_reader->get_pos();

     // Avoid reallocating the internal buffer if this instance is being re-used with an
     // unchanged buffer size.
     if (file_read_buffer_capacity != m_file_read_buffer_capacity) {
         m_file_read_buffer_capacity = file_read_buffer_capacity;
         m_file_read_buffer = std::make_unique<char[]>(m_file_read_buffer_capacity);
     }
     m_file_read_buffer_length = 0;

     m_compressed_stream_block = {m_file_read_buffer.get(), m_file_read_buffer_length, 0};

     reset_stream();
 }
+
+void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) {
+    open_reader(reader, InputType::ClpReader, file_read_buffer_capacity);
+}

290-306: Improve error handling consistency and code structure.

Error handling differs between FileReader and ClpReader cases. Consider using the same pattern for both.
Based on previous review comments, the seek condition could be simplified.

-    if (InputType::File == m_input_type) {
-        if (auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos);
-            ErrorCodeSuccess != rc && ErrorCodeEndOfFile != rc)
-        {
-            throw OperationFailed(rc, __FILENAME__, __LINE__);
-        }
-        m_file_read_buffer_length = 0;
-        m_compressed_stream_block.size = m_file_read_buffer_length;
-    } else if (InputType::ClpReader == m_input_type) {
-        auto rc = m_reader->try_seek_from_begin(m_file_reader_initial_pos);
-        m_file_read_buffer_length = 0;
-        m_compressed_stream_block.size = m_file_read_buffer_length;
-        if (false
-            == (clp::ErrorCode::ErrorCode_Success == rc || clp::ErrorCode::ErrorCode_EndOfFile == rc
-            ))
-        {
-            throw OperationFailed(static_cast<ErrorCode>(rc), __FILENAME__, __LINE__);
-        }
+    if (InputType::File == m_input_type || InputType::ClpReader == m_input_type) {
+        auto* reader = (InputType::File == m_input_type) ? m_file_reader : m_reader;
+        auto rc = reader->try_seek_from_begin(m_file_reader_initial_pos);
+        if (auto success = (InputType::File == m_input_type) 
+                ? (ErrorCodeSuccess == rc || ErrorCodeEndOfFile == rc)
+                : (clp::ErrorCode::ErrorCode_Success == rc || clp::ErrorCode::ErrorCode_EndOfFile == rc);
+            !success)
+        {
+            throw OperationFailed(static_cast<ErrorCode>(rc), __FILENAME__, __LINE__);
+        }
+        m_file_read_buffer_length = 0;
+        m_compressed_stream_block.size = m_file_read_buffer_length;
     }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85f4b5e and de9fb38.

📒 Files selected for processing (4)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1 hunks)
components/core/src/clp_s/TimestampDictionaryReader.cpp (2 hunks)
components/core/src/clp_s/ZstdDecompressor.cpp (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/core/src/clp_s/TimestampDictionaryReader.cpp

🧰 Additional context used

📓 Path-based instructions (3)

components/core/src/clp_s/ZstdDecompressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: ubuntu-focal-static-linked-bins
GitHub Check: ubuntu-focal-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: build-macos (macos-14, false)
GitHub Check: build-macos (macos-13, false)
GitHub Check: build-macos (macos-13, true)
GitHub Check: lint-check (ubuntu-latest)
GitHub Check: lint-check (macos-latest)
GitHub Check: build (macos-latest)

🔇 Additional comments (4)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

49-72: Resolve the FIXME and simplify deserialization.

The current implementation manually parses the MessagePack object due to the commented-out line m_archive_file_info = obj.as<clp_s::ArchiveFileInfoPacket>();. Addressing the FIXME by fixing the issue with obj.as<clp_s::ArchiveFileInfoPacket>() would simplify the code and improve maintainability.
components/core/src/clp_s/ZstdDecompressor.cpp (3)

86-109: Implement error code translation for CLP errors.

The TODO comment indicates that CLP error codes are not currently translated. Instead of returning a generic ErrorCodeFailure, consider mapping the CLP error codes to the appropriate internal error codes to provide more informative error handling.

Do you want assistance in implementing the error code translation or should I open a GitHub issue to track this task?

220-225: LGTM! Clean and consistent implementation.

The ClpReader case is handled appropriately, maintaining consistency with the existing pattern.

238-238: Consider future refactoring opportunity.

The condition has been correctly updated to include ClpReader. However, as discussed in previous comments, if FileReader becomes an inherited class of ReaderInterface in the future, this check could be simplified.

Let's check for any remaining uses of FileReader that could be migrated:

✅ Verification successful

Refactoring already implemented in most components

The condition in ZstdDecompressor.cpp is correct for the current architecture. Our verification shows that FileReader already inherits from ReaderInterface in both clp and glt namespaces. Only the clp_s::FileReader implementation remains to be refactored to follow this pattern.

components/core/src/clp_s/FileReader.hpp: Consider refactoring this implementation to inherit from ReaderInterface, matching the pattern used in other components.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for FileReader usage
rg -l "FileReader" | grep -v "ZstdDecompressor"
Length of output: 4204

Script:
#!/bin/bash
# Check FileReader class definitions and ReaderInterface
echo "=== FileReader class definitions ==="
ast-grep --pattern 'class FileReader$$$'

echo -e "\n=== ReaderInterface definitions ==="
ast-grep --pattern 'class ReaderInterface$$$'
Length of output: 260

Script:
#!/bin/bash
# Examine FileReader and ReaderInterface class definitions with context
echo "=== FileReader Headers ==="
rg -U "class FileReader" --type hpp -A 5

echo -e "\n=== ReaderInterface Headers ==="
rg -U "class ReaderInterface" --type hpp -A 5
Length of output: 288

Script:
#!/bin/bash
echo "=== FileReader in clp ==="
cat "components/core/src/clp/FileReader.hpp"

echo -e "\n=== FileReader in clp_s ==="
cat "components/core/src/clp_s/FileReader.hpp"

echo -e "\n=== FileReader in glt ==="
cat "components/core/src/glt/FileReader.hpp"

echo -e "\n=== ReaderInterface definitions ==="
cat "components/core/src/clp/ReaderInterface.hpp" "components/core/src/glt/ReaderInterface.hpp"
Length of output: 22712

components/core/src/clp_s/ArchiveReaderAdaptor.cpp

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (4)
70-72: Log exceptions when catching them to aid in debugging

In the catch block, the exception is caught but not logged. Logging the exception message can help with debugging and provide more context when errors occur.

Apply this diff to add logging:
        } catch (std::exception const& e) {
+            SPDLOG_ERROR("Exception while reading archive file info: {}", e.what());
            return ErrorCodeCorrupt;
        }
91-93: Log exceptions when deserializing archive info

The exception caught during deserialization of archive info is not logged. Logging the exception can assist in diagnosing issues during runtime.

Apply this diff to add logging:
    } catch (std::exception const& e) {
+        SPDLOG_ERROR("Exception while reading archive info: {}", e.what());
        return ErrorCodeCorrupt;
    }
191-193: Use std::filesystem::path for path concatenation in try_create_reader_at_header

Concatenating paths using string operations may result in invalid paths if path separators are not properly handled. It is recommended to use std::filesystem::path to safely concatenate paths.

Apply this diff to use std::filesystem::path:
            return std::make_shared<clp::FileReader>(
-                m_archive_path.path + constants::cArchiveHeaderFile
+                (std::filesystem::path(m_archive_path.path) / constants::cArchiveHeaderFile).string()
            );
214-215: Use std::filesystem::path for path concatenation to handle path separators correctly

Concatenating paths using string concatenation may result in invalid file paths if path separators are not correctly managed. It is recommended to use std::filesystem::path for path concatenation to ensure correctness and portability.

Apply this diff to use std::filesystem::path:
        return std::make_unique<clp::FileReader>(
-            m_archive_path.path + std::string{section}
+            (std::filesystem::path(m_archive_path.path) / section).string()
        );

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de9fb38 and eff2d33.

📒 Files selected for processing (1)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: ubuntu-focal-static-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-focal-dynamic-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: build-macos (macos-14, false)
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-13, false)
GitHub Check: build-macos (macos-13, true)

🔇 Additional comments (2)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (2)

49-72: Address the FIXME and resolve the deserialization issue

The code is currently manually parsing the MessagePack object due to the commented-out line m_archive_file_info = obj.as<clp_s::ArchiveFileInfoPacket>();. Resolving the FIXME by fixing the deserialization issue would simplify the code and improve maintainability.

178-180: Handle unknown metadata packet types appropriately

The default case in the switch statement silently ignores unknown metadata packet types, which could lead to unnoticed issues. Consider logging a warning or returning an error when an unknown packet type is encountered.

Apply this diff to handle unknown packet types:
            default:
+                SPDLOG_WARN("Unknown metadata packet type: {}", static_cast<int>(packet_type));
+                return ErrorCodeUnsupported;
-                break;

coderabbitai · 2025-01-14T20:40:59Z

components/core/src/clp_s/ArchiveReaderAdaptor.cpp

+    auto const clp_rc = m_reader->try_read_exact_length(
+            reinterpret_cast<char*>(&m_archive_header),
+            sizeof(m_archive_header)
+    );
+    if (clp::ErrorCode::ErrorCode_Success != clp_rc) {
+        return ErrorCodeErrno;
+    }


⚠️ Potential issue

Use the passed reader object instead of m_reader in try_read_header

The function try_read_header should use the reader parameter provided to it rather than m_reader. Using m_reader may lead to unexpected behaviour if it is not properly initialised or if a different reader is intended.

Apply this diff to fix the issue:

- auto const clp_rc = m_reader->try_read_exact_length( + auto const clp_rc = reader.try_read_exact_length( reinterpret_cast<char*>(&m_archive_header), sizeof(m_archive_header) );

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (4)
95-98: Enhance error message for unsupported segment count.

When multiple segments are detected, provide more context in the error message to aid debugging.
     if (1 != m_archive_info.num_segments) {
+        SPDLOG_ERROR("Unsupported number of segments: {}", m_archive_info.num_segments);
         return ErrorCodeUnsupported;
     }
102-102: Improve naming of the decompressor buffer capacity constant.

The constant cDecompressorFileReadBufferCapacity could be more descriptive. Consider a name that indicates its purpose and unit, e.g., cMetadataDecompressorBufferSizeBytes.

130-138: Use constant-time comparison for magic number.

When comparing magic numbers or other security-sensitive data, use a constant-time comparison to prevent timing attacks.
-    if (0
-        != std::memcmp(
-                m_archive_header.magic_number,
-                cStructuredSFAMagicNumber,
-                sizeof(cStructuredSFAMagicNumber)
-        ))
+    if (!constant_time_compare(
+            m_archive_header.magic_number,
+            cStructuredSFAMagicNumber,
+            sizeof(cStructuredSFAMagicNumber)))
203-216: Use RAII pattern for reader checkout/checkin.

The current implementation relies on manual checkin. Consider using RAII to automatically checkin readers when they go out of scope.

Example implementation:
class SectionReaderGuard {
    ArchiveReaderAdaptor& adaptor;
    std::string section;
public:
    SectionReaderGuard(ArchiveReaderAdaptor& a, std::string_view s)
        : adaptor(a), section(s) {}
    ~SectionReaderGuard() {
        adaptor.checkin_reader_for_section(section);
    }
};

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eff2d33 and fe63954.

📒 Files selected for processing (1)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-focal-static-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-focal-dynamic-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: build-macos (macos-14, false)
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-13, false)
GitHub Check: build-macos (macos-13, true)

🔇 Additional comments (1)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

49-50: Resolve the FIXME and simplify deserialization.

The current implementation manually parses the MessagePack object due to the commented-out line. Addressing the FIXME would simplify the code and improve maintainability.

components/core/src/clp_s/ArchiveReaderAdaptor.cpp

wraymo · 2025-01-15T20:32:10Z

components/core/src/clp_s/ArchiveReaderAdaptor.cpp

@@ -119,7 +119,7 @@ ErrorCode ArchiveReaderAdaptor::load_archive_metadata() {
 }

 ErrorCode ArchiveReaderAdaptor::try_read_header(clp::ReaderInterface& reader) {
-    auto const clp_rc = m_reader->try_read_exact_length(


Seems that we don't need to pass reader in this case?

Passing reader to make it more generic and follow the convention of the other helpers that read some part of the metadata. I prefer doing it this way because I don't want to embed the assumption that the MFA flow initializes m_reader in a particular way.

I was erroneously using m_reader in this function before the most recent commit.

gibber9809 added 7 commits January 8, 2025 21:21

Write header for multi-file archives

cac0298

Add ArchiveReaderAdaptor to provide generic interface for reading sin…

db97a3b

…gle and multi-file archives

Update set of files present in multi-file archives

2c9dc60

Update ZstdDecompressor to accept clp::ReaderInterface

345b771

Simplify TimestampDictionaryReader implementation

686c028

Update CMakeLists

b1af99c

Update several components to use ArchiveReaderAdaptor

ae419b0

gibber9809 requested a review from wraymo January 8, 2025 21:38

coderabbitai bot reviewed Jan 8, 2025

View reviewed changes

gibber9809 added 2 commits January 9, 2025 16:02

Fix bug where any path starting with a number gets treated as a segme…

bc834a6

…nt number

Address rabbit comment

af0b784

coderabbitai bot reviewed Jan 10, 2025

View reviewed changes

wraymo reviewed Jan 14, 2025

View reviewed changes

gibber9809 and others added 2 commits January 14, 2025 12:25

Apply suggestions from code review

85f4b5e

Co-authored-by: wraymo <[email protected]>

Address review comments

de9fb38

gibber9809 requested a review from wraymo January 14, 2025 17:52

coderabbitai bot reviewed Jan 14, 2025

View reviewed changes

components/core/src/clp_s/ArchiveReaderAdaptor.cpp Outdated Show resolved Hide resolved

components/core/src/clp_s/ArchiveReaderAdaptor.cpp Show resolved Hide resolved

Address rabbit comment

eff2d33

coderabbitai bot reviewed Jan 14, 2025

View reviewed changes

wraymo previously approved these changes Jan 15, 2025

View reviewed changes

Fix one more issue caught by rabbit

fe63954

gibber9809 dismissed wraymo’s stale review via fe63954 January 15, 2025 14:45

gibber9809 requested a review from wraymo January 15, 2025 14:46

coderabbitai bot reviewed Jan 15, 2025

View reviewed changes

components/core/src/clp_s/ArchiveReaderAdaptor.cpp Show resolved Hide resolved

components/core/src/clp_s/ArchiveReaderAdaptor.cpp Show resolved Hide resolved

components/core/src/clp_s/ArchiveReaderAdaptor.cpp Show resolved Hide resolved

wraymo reviewed Jan 15, 2025

View reviewed changes

gibber9809 requested a review from wraymo January 15, 2025 20:54

wraymo approved these changes Jan 15, 2025

View reviewed changes

gibber9809 merged commit 8f00463 into y-scope:main Jan 15, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(clp-s): Add support for reading and searching single file archives. #656

feat(clp-s): Add support for reading and searching single file archives. #656

gibber9809 commented Jan 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Jan 8, 2025

coderabbitai bot Jan 8, 2025

coderabbitai bot left a comment

coderabbitai bot Jan 10, 2025

wraymo left a comment

wraymo Jan 14, 2025

wraymo Jan 14, 2025

wraymo Jan 14, 2025

gibber9809 Jan 14, 2025

wraymo Jan 14, 2025

wraymo Jan 14, 2025

gibber9809 Jan 14, 2025

wraymo Jan 14, 2025

gibber9809 Jan 14, 2025

gibber9809 Jan 14, 2025

wraymo Jan 14, 2025

wraymo Jan 14, 2025

gibber9809 Jan 14, 2025

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Jan 14, 2025

coderabbitai bot left a comment

wraymo Jan 15, 2025

gibber9809 Jan 15, 2025

	for (int i = 0; i < num_patterns; ++i) {
	for (uint64_t i = 0; i < num_patterns; ++i) {

feat(clp-s): Add support for reading and searching single file archives. #656

feat(clp-s): Add support for reading and searching single file archives. #656

Conversation

gibber9809 commented Jan 8, 2025 • edited by coderabbitai bot Loading

Description

Benchmarks

Validation performed

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Jan 8, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 8, 2025

Choose a reason for hiding this comment

coderabbitai bot Jan 8, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 10, 2025

Choose a reason for hiding this comment

wraymo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 14, 2025

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibber9809 commented Jan 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)