-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-s): Add support for reading and searching single file archives. #656
Conversation
…gle and multi-file archives
WalkthroughThis pull request introduces a comprehensive refactoring of the archive reading and processing infrastructure in the Changes
Sequence DiagramsequenceDiagram
participant AR as ArchiveReader
participant ARA as ArchiveReaderAdaptor
participant DR as DictionaryReader
participant ZD as ZstdDecompressor
AR->>ARA: load_archive_metadata()
ARA-->>AR: Metadata loaded
AR->>ARA: checkout_reader_for_section()
ARA-->>AR: Reader for section
AR->>DR: read_entries()
DR->>ARA: Get reader
ARA-->>DR: Reader interface
DR->>ZD: read with decompressor
ZD-->>DR: Entries read
The sequence diagram illustrates the new workflow for reading archives, highlighting the central role of the Possibly related PRs
Suggested reviewers
Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🔭 Outside diff range comments (3)
components/core/src/clp_s/ReaderUtils.cpp (2)
Line range hint
17-61
: Ensureschema_tree_reader
is always checked back inTo prevent potential resource leaks, consider using RAII (Resource Acquisition Is Initialization) or a scope guard to ensure that
adaptor.checkin_reader_for_section
is called even if an exception is thrown before it. This will guarantee that the reader is always checked back in, maintaining resource integrity.
Line range hint
90-141
: Ensureschema_id_reader
is always checked back inIn the
read_schemas
method, to prevent potential resource leaks, consider implementing RAII or a scope guard forschema_id_reader
. This will ensure thatadaptor.checkin_reader_for_section
is called even if exceptions occur, maintaining resource integrity.components/core/src/clp_s/PackedStreamReader.cpp (1)
Line range hint
121-126
: Fix incorrect assignment fromstd::unique_ptr<char[]>
tostd::shared_ptr<char[]>
The code incorrectly assigns a
std::unique_ptr<char[]>
to astd::shared_ptr<char[]>
, which is invalid as there is no implicit conversion between them for array types. This can lead to undefined behaviour.Apply the following fix to correctly create a
std::shared_ptr<char[]>
:- buf = std::make_unique<char[]>(uncompressed_size); + buf = std::shared_ptr<char[]>(new char[uncompressed_size]);Alternatively, consider using
std::vector<char>
for buffer management if appropriate.
🧹 Nitpick comments (12)
components/core/src/clp_s/PackedStreamReader.hpp (1)
47-49
: Consider passingadaptor
by const reference to improve efficiencyIn the
open_packed_streams
method, consider passing theadaptor
parameter as aconst std::shared_ptr<ArchiveReaderAdaptor>&
instead of by value. This avoids unnecessary copying and reference count modifications, which can improve performance.Suggested change:
-void open_packed_streams(std::shared_ptr<ArchiveReaderAdaptor> adaptor); +void open_packed_streams(const std::shared_ptr<ArchiveReaderAdaptor>& adaptor);components/core/src/clp_s/ArchiveWriter.cpp (2)
59-73
: Improve clarity of offset calculationThe current implementation temporarily uses the
o
field to store the size before calculating offsets. This dual use of the field is not immediately clear and could confuse future maintainers.Consider using a more explicit approach:
std::vector<ArchiveFileInfo> files{ {constants::cArchiveSchemaTreeFile, schema_tree_compressed_size}, {constants::cArchiveSchemaMapFile, schema_map_compressed_size}, {constants::cArchiveTableMetadataFile, table_metadata_compressed_size}, {constants::cArchiveVarDictFile, var_dict_compressed_size}, {constants::cArchiveLogDictFile, log_dict_compressed_size}, {constants::cArchiveArrayDictFile, array_dict_compressed_size}, {constants::cArchiveTablesFile, table_compressed_size} }; uint64_t offset = 0; for (auto& file : files) { - uint64_t original_size = file.o; - file.o = offset; - offset += original_size; + uint64_t const file_size = file.o; // Store the size from the initialization + file.o = offset; // Set the file's offset + offset += file_size; // Increment offset by file size }
86-89
: Improve readability of compressed size calculationThe calculation of
m_compressed_size
spans multiple lines and includes many components, making it difficult to verify that all necessary components are included.Consider using a more structured approach:
- m_compressed_size - = var_dict_compressed_size + log_dict_compressed_size + array_dict_compressed_size - + metadata_size + schema_tree_compressed_size + schema_map_compressed_size - + table_metadata_compressed_size + table_compressed_size + sizeof(ArchiveHeader); + // Calculate total compressed size from all components + m_compressed_size = sizeof(ArchiveHeader) // Header size + // Dictionary sizes + + var_dict_compressed_size + + log_dict_compressed_size + + array_dict_compressed_size + // Schema sizes + + schema_tree_compressed_size + + schema_map_compressed_size + // Table sizes + + table_metadata_compressed_size + + table_compressed_size + // Metadata size + + metadata_size;components/core/src/clp_s/TimestampDictionaryReader.cpp (1)
41-41
: Use consistent data types: change loop index touint64_t
The loop index
i
infor (int i = 0; i < num_patterns; ++i)
is declared as anint
, whilenum_patterns
is of typeuint64_t
. To avoid potential issues with signed-unsigned comparisons and to maintain consistency, consider declaringi
as auint64_t
.components/core/src/clp_s/TimestampDictionaryReader.hpp (1)
26-30
: Enhance method documentation forread
To improve clarity and maintainability, consider expanding the documentation for the
read
method. Including details such as any exceptions that may be thrown and elaborating on the method's behaviour will be beneficial.components/core/src/clp_s/DictionaryReader.hpp (1)
42-44
: Enhance method documentation forread_entries
Consider providing detailed documentation for the
read_entries
method, including descriptions of the parameters likelazy
, expected behaviour, and any exceptions that might be thrown. This will enhance code readability and ease future maintenance.components/core/src/clp_s/ReaderUtils.cpp (2)
67-70
: Consistent use of smart pointers for resource managementIn the
get_variable_dictionary_reader
method, ensure consistent and safe resource management by verifying that theVariableDictionaryReader
object correctly handles resource acquisition and release, especially in exceptional circumstances.
75-78
: Consistent use of smart pointers for resource managementSimilarly, in the
get_log_type_dictionary_reader
method, confirm that theLogTypeDictionaryReader
object properly manages resources to prevent leaks or dangling pointers.components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
31-33
: Add move constructor/assignment operator declarations.The class manages resources (shared_ptr, unique_ptr) but doesn't declare move operations. Consider adding them to support efficient resource transfer.
Add the following declarations:
explicit ArchiveReaderAdaptor(Path const& archive_path, NetworkAuthOption const& network_auth); ~ArchiveReaderAdaptor(); + + // Move constructor/assignment + ArchiveReaderAdaptor(ArchiveReaderAdaptor&&) = default; + ArchiveReaderAdaptor& operator=(ArchiveReaderAdaptor&&) = default; + + // Explicitly delete copy operations + ArchiveReaderAdaptor(const ArchiveReaderAdaptor&) = delete; + ArchiveReaderAdaptor& operator=(const ArchiveReaderAdaptor&) = delete;components/core/src/clp_s/ZstdDecompressor.hpp (1)
46-46
: Rename parameter for clarity.The parameter name
file_read_buffer_capacity
suggests file-specific usage, but this method handles generic reader interfaces.Apply this diff to improve clarity:
- void open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) override; + void open(clp::ReaderInterface& reader, size_t read_buffer_capacity) override;components/core/src/clp_s/JsonConstructor.cpp (1)
46-46
: LGTM! Consider adding documentation.The addition of
open_packed_streams()
follows a logical sequence in the initialization process. Consider adding a comment explaining the initialization order requirements.+ // Open packed streams after reading dictionaries and metadata m_archive_reader->open_packed_streams();
components/core/src/clp_s/search/Output.cpp (1)
68-69
: LGTM! Consider adding error handling.The addition of
open_packed_streams()
is well-placed in the execution flow. Consider adding error handling for potential stream opening failures.- m_archive_reader->open_packed_streams(); + try { + m_archive_reader->open_packed_streams(); + } catch (const std::exception& e) { + SPDLOG_ERROR("Failed to open packed streams: {}", e.what()); + return false; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
components/core/CMakeLists.txt
(1 hunks)components/core/src/clp_s/ArchiveReader.cpp
(6 hunks)components/core/src/clp_s/ArchiveReader.hpp
(6 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.cpp
(1 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp
(1 hunks)components/core/src/clp_s/ArchiveWriter.cpp
(2 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(1 hunks)components/core/src/clp_s/CMakeLists.txt
(1 hunks)components/core/src/clp_s/Decompressor.hpp
(2 hunks)components/core/src/clp_s/DictionaryReader.hpp
(5 hunks)components/core/src/clp_s/JsonConstructor.cpp
(1 hunks)components/core/src/clp_s/PackedStreamReader.cpp
(4 hunks)components/core/src/clp_s/PackedStreamReader.hpp
(3 hunks)components/core/src/clp_s/ReaderUtils.cpp
(4 hunks)components/core/src/clp_s/ReaderUtils.hpp
(2 hunks)components/core/src/clp_s/TimestampDictionaryReader.cpp
(2 hunks)components/core/src/clp_s/TimestampDictionaryReader.hpp
(1 hunks)components/core/src/clp_s/Utils.cpp
(1 hunks)components/core/src/clp_s/ZstdDecompressor.cpp
(5 hunks)components/core/src/clp_s/ZstdDecompressor.hpp
(4 hunks)components/core/src/clp_s/archive_constants.hpp
(1 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)components/core/src/clp_s/search/Output.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (21)
components/core/src/clp_s/Decompressor.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/search/Output.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/archive_constants.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/TimestampDictionaryReader.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/Utils.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonConstructor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ZstdDecompressor.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/clp-s.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/PackedStreamReader.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReader.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveWriter.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveWriter.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ZstdDecompressor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReader.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/PackedStreamReader.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ReaderUtils.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/DictionaryReader.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ReaderUtils.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build (macos-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (25)
components/core/src/clp_s/clp-s.cpp (1)
147-147
: LGTM! Method name change aligns with the new architecture.The change from
read_timestamp_dictionary()
toget_timestamp_dictionary()
reflects the architectural shift to use the ArchiveReaderAdaptor for handling both single and multi-file archives.Let's verify the implementation of the new method:
✅ Verification successful
Implementation verified: get_timestamp_dictionary() correctly supports the new architecture
The method is properly implemented in ArchiveReaderAdaptor and correctly delegated through ArchiveReader, with consistent usage across the codebase for timestamp operations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of get_timestamp_dictionary() # Expected: The method should be defined in ArchiveReaderAdaptor and should handle both single and multi-file archives # Search for the method definition ast-grep --pattern 'get_timestamp_dictionary() { $$$ }' # Search for references to ensure consistent usage rg -A 2 'get_timestamp_dictionary'Length of output: 2331
components/core/src/clp_s/PackedStreamReader.hpp (3)
9-10
: Includes necessary headers for new functionalityThe addition of
ReaderInterface.hpp
andArchiveReaderAdaptor.hpp
is appropriate and ensures that the necessary interfaces are available for the updated implementation.
89-90
: Proper use of smart pointers for resource managementThe introduction of
m_adaptor
as astd::shared_ptr
and updatingm_packed_stream_reader
to astd::unique_ptr
align with best practices for smart pointer usage, ensuring appropriate ownership and lifetime management of resources.
93-93
: Initialization ofm_begin_offset
ensures defined behaviourInitializing
m_begin_offset
to zero ensures that it has a well-defined starting value, preventing potential issues with uninitialized variables.components/core/src/clp_s/archive_constants.hpp (1)
10-12
: LGTM! The new header constant follows consistent naming and organization.The addition of
cArchiveHeaderFile
and its placement in a dedicated "Header and metadata section" maintains good code organization. The constant's value follows the established pattern of other archive path constants.components/core/src/clp_s/Utils.cpp (2)
Line range hint
79-87
: LGTM! The updated file check aligns with the new archive format.The replacement of
cArchiveTimestampDictFile
withcArchiveHeaderFile
in the condition maintains the correct detection of multi-file archives while adapting to the new archive format. The condition follows the coding guideline of placing the constant on the left side of the equality operator.
Line range hint
79-87
: Verify comprehensive testing of archive operations.Since this change affects archive format detection, please ensure:
- All archive operations (read, write, search) are tested with both single and multi-file archives
- Migration paths exist for existing archives using the old timestamp dictionary format
Let's verify the usage of the old timestamp dictionary format:
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
31-35
: Verify the conditional logic for settingm_single_file_archive
The condition
if (InputSource::Filesystem != archive_path.source || std::filesystem::is_regular_file(archive_path.path))
will setm_single_file_archive
totrue
if either the input source is not the filesystem or the path is a regular file. Please verify that this logic correctly identifies single-file archives in all intended scenarios.components/core/src/clp_s/ArchiveReader.cpp (1)
Line range hint
1-360
: Code changes look goodThe integration of
ArchiveReaderAdaptor
enhances the modularity and maintainability of the archive reading process. The updates are well-structured and align with the project architecture.components/core/src/clp_s/Decompressor.hpp (1)
54-59
: LGTM! Well-documented interface extension.The new virtual method enhances the Decompressor interface to support reading from a clp::ReaderInterface, which aligns well with the PR's objective of supporting single file archives.
components/core/src/clp_s/ReaderUtils.hpp (1)
30-34
: LGTM! Documentation updates are clear and consistent.The documentation has been updated to accurately reflect the switch from string paths to ArchiveReaderAdaptor references.
Also applies to: 37-41, 44-49, 53-58, 62-67
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (2)
17-30
: LGTM! Well-designed class with proper error handling.The class follows SOLID principles and includes thorough error handling through a custom exception class.
43-59
: LGTM! Robust checkout/checkin pattern.The checkout/checkin pattern effectively prevents concurrent access to sections and enforces sequential access to prevent backwards seeks.
components/core/src/clp_s/ZstdDecompressor.hpp (1)
115-116
: LGTM! Consistent extension for ClpReader support.The InputType enum and member variable additions properly support the new reader interface capability.
Also applies to: 134-134
components/core/src/clp_s/ArchiveReader.hpp (4)
45-49
: LGTM! Documentation follows standard format.The new method declaration is well-documented and follows the established documentation pattern in the codebase.
113-113
: LGTM! Delegation to adaptor improves modularity.The change to delegate timestamp dictionary access to the adaptor improves separation of concerns and aligns with the adaptor pattern.
201-201
: LGTM! Member variable follows naming convention.The new member variable follows the established m_ prefix naming convention.
56-56
: Verify the impact of changing dictionary read behaviour.The change from
read_new_entries
toread_entries
could affect the dictionary loading behaviour. Please ensure this change is intentional and doesn't impact existing functionality.Also applies to: 66-66, 76-76
✅ Verification successful
The change is intentional and correct as part of the CLP_S implementation
This change is part of a new implementation in the
clp_s
directory, which is separate from the existing CLP/GLT implementations. The newread_entries
method includes enhanced functionality with optional lazy loading support. The existing implementations usingread_new_entries
remain unchanged and unaffected.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for any other uses of read_new_entries to ensure all occurrences are updated rg "read_new_entries" -A 2Length of output: 2561
Script:
#!/bin/bash # Search for read_entries implementation and usage ast-grep --pattern 'read_entries($_) { $$$ }' # Also search with ripgrep for any other occurrences rg "read_entries" -A 2Length of output: 1539
components/core/src/clp_s/ArchiveWriter.hpp (1)
183-183
: LGTM! Documentation update improves clarity.The documentation change appropriately generalizes the method's purpose to support both single and multi-file archives.
components/core/src/clp_s/CMakeLists.txt (1)
91-92
: LGTM! New source files are properly integrated.The new ArchiveReaderAdaptor files are correctly added to the CLP_S_SOURCES set, maintaining alphabetical order.
components/core/CMakeLists.txt (1)
264-265
: LGTM! New source files are properly integrated into the test suite.The ArchiveReaderAdaptor files are correctly added to SOURCE_FILES_clp_s_unitTest, ensuring they'll be included in the unit test build.
Let's verify the test coverage for the new ArchiveReaderAdaptor:
components/core/src/clp_s/PackedStreamReader.cpp (4)
3-5
: Appropriate inclusion of new headersThe added header files
BoundedReader.hpp
,archive_constants.hpp
, andArchiveReaderAdaptor.hpp
are necessary for the new functionality and are correctly included.
44-59
:open_packed_streams
method updated appropriatelyThe
open_packed_streams
method correctly updates the state, initializesm_adaptor
, and handles errors appropriately when checking the position. The use ofArchiveReaderAdaptor
enhances the modularity and maintainability of the code.
63-78
: Proper resource management inclose
methodThe
close
method effectively uses theneeds_checkin
flag to determine if resources need to be checked in. The state transitions and resetting of member variables are correctly handled to ensure proper cleanup.
Line range hint
94-114
: Correct state management and error handling inread_stream
The
read_stream
method appropriately manages state transitions, validates stream IDs, calculates adjusted file offsets, and handles errors properly. The integration withBoundedReader
and usage of the decompressor are correctly implemented.
entry.try_read_from_file(decompressor); | ||
if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) { | ||
throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure consistent error handling: replace exception with error code return
In the read
method, error handling has shifted to returning ErrorCode
instead of throwing exceptions. However, the call to StringUtils::tokenize_column_descriptor
still throws an exception upon failure. For consistency and to maintain uniform error handling, consider returning an appropriate ErrorCode
instead of throwing an exception.
// TODO: attempt to translate clp error codes | ||
return ErrorCodeFailure; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Implement error code translation for CLP errors
The TODO comment indicates that CLP error codes are not currently translated. Instead of returning a generic ErrorCodeFailure
, consider mapping the CLP error codes to the appropriate internal error codes to provide more informative error handling.
Do you want assistance in implementing the error code translation or should I open a GitHub issue to track this task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp_s/TimestampDictionaryReader.cpp (2)
19-21
:⚠️ Potential issueEnsure consistent error handling approach.
The method mixes error handling approaches by throwing exceptions in these sections while using error codes elsewhere. Consider maintaining consistency by returning error codes instead of throwing exceptions.
- if (auto rc = entry.try_read_from_file(decompressor); ErrorCodeSuccess != rc) { - throw OperationFailed(rc, __FILENAME__, __LINE__); + if (auto rc = entry.try_read_from_file(decompressor); ErrorCodeSuccess != rc) { + return rc; } - if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) { - throw OperationFailed(ErrorCodeCorrupt, __FILENAME__, __LINE__); + if (false == StringUtils::tokenize_column_descriptor(entry.get_key_name(), tokens)) { + return ErrorCodeCorrupt; }Also applies to: 22-25
41-58
:⚠️ Potential issueAdd bounds checking for numeric inputs.
The pattern reading section should validate
num_patterns
andpattern_len
to prevent potential memory exhaustion or buffer overflow attacks from malformed input.error = decompressor.try_read_numeric_value<uint64_t>(num_patterns); if (ErrorCodeSuccess != error) { return error; } + if (num_patterns > MAX_PATTERNS) { + return ErrorCodeInvalidInput; + } for (int i = 0; i < num_patterns; ++i) { uint64_t id, pattern_len; std::string pattern; error = decompressor.try_read_numeric_value<uint64_t>(id); if (ErrorCodeSuccess != error) { return error; } error = decompressor.try_read_numeric_value<uint64_t>(pattern_len); if (ErrorCodeSuccess != error) { return error; } + if (pattern_len > MAX_PATTERN_LENGTH) { + return ErrorCodeInvalidInput; + } error = decompressor.try_read_string(pattern_len, pattern);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/TimestampDictionaryReader.cpp
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build (macos-latest)
🔇 Additional comments (1)
components/core/src/clp_s/TimestampDictionaryReader.cpp (1)
8-14
: LGTM! Well-structured error handling pattern.The new method signature and initial read operation demonstrate good practices with consistent error handling and clear variable naming.
if (ErrorCodeSuccess != error) { | ||
throw OperationFailed(error, __FILENAME__, __LINE__); | ||
return error; | ||
} | ||
for (int i = 0; i < num_patterns; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix type mismatch in loop counter.
The loop counter 'i' is declared as int
while num_patterns
is uint64_t
. This could cause overflow issues if num_patterns
exceeds INT_MAX
.
- for (int i = 0; i < num_patterns; ++i) {
+ for (uint64_t i = 0; i < num_patterns; ++i) {
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for (int i = 0; i < num_patterns; ++i) { | |
for (uint64_t i = 0; i < num_patterns; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Here are some initial comments.
ErrorCode try_read_archive_file_info(ZstdDecompressor& decompressor, size_t size); | ||
|
||
ErrorCode try_read_timestamp_dictionary(ZstdDecompressor& decompressor, size_t size); | ||
|
||
ErrorCode try_read_archive_info(ZstdDecompressor& decompressor, size_t size); | ||
|
||
std::shared_ptr<clp::ReaderInterface> try_create_reader_at_header(); | ||
|
||
std::unique_ptr<clp::ReaderInterface> checkout_reader_for_sfa_section(std::string_view section); | ||
|
||
ErrorCode try_read_header(clp::ReaderInterface& reader); | ||
|
||
ErrorCode try_read_archive_metadata(ZstdDecompressor& reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add descriptions for these methods?
m_current_reader_holder.reset(); | ||
} | ||
|
||
std::shared_ptr<clp::ReaderInterface> ArchiveReaderAdaptor::try_create_reader_at_header() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this method before ArchiveReaderAdaptor::checkout...
@@ -162,6 +186,28 @@ void ZstdDecompressor::open(FileReader& file_reader, size_t file_read_buffer_cap | |||
reset_stream(); | |||
} | |||
|
|||
void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the future FileReader
is an inherited class of clp::ReaderInterface
, so we don't need to have two methods and two types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will probably want to refactor this in the future + converge on one set of readers for clp and clp-s. I could put up a github issue to delete clp_s::FileReader as a near-term step if you want? I don't think we have too many uses of it left in our codebase, and it should be possible to switch the ones we do have over to clp::FileReader.
@@ -239,9 +287,22 @@ ErrorCode ZstdDecompressor::open(std::string const& compressed_file_path) { | |||
|
|||
void ZstdDecompressor::reset_stream() { | |||
if (InputType::File == m_input_type) { | |||
m_file_reader->seek_from_begin(m_file_reader_initial_pos); | |||
auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something like
auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos); | |
if (auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos); ErrorCodeSuccess != rc && ErrorCodeEndOfFile != rc) { | |
throw OperationFailed(rc, __FILENAME__, __LINE__); | |
} | |
@@ -1,5 +1,7 @@ | |||
#include "Utils.hpp" | |||
|
|||
#include <charconv> | |||
#include <cstdint> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this header file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- std::from_chars comes from charconv, and we need cstdint for the (u)int(X)_t types.
* Load metadata for an archive including the header and metadata section. This method must be | ||
* invoked before checking out any section of an archive, or calling `get_timestamp_dictionary`. | ||
* @return ErrorCodeSuccess on success | ||
* @return ErrorCode_errno on failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrorCodeErrno
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to imply that it can return any error. Maybe I can rephrase as "relevant error code on failure" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased as "relevant ErrorCode on failure."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ErrorCodeSuccess
is in clp-s, while ErrorCode_error
is in clp. "Relevant ErrorCode on failure' also seems good.
} | ||
|
||
ArchiveReaderAdaptor::~ArchiveReaderAdaptor() { | ||
m_reader.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not. I will get rid of it.
Co-authored-by: wraymo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (12)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (3)
31-35
: Improve readability of single file archive condition.The condition for determining if an archive is single file could be more explicit.
- if (InputSource::Filesystem != archive_path.source - || std::filesystem::is_regular_file(archive_path.path)) - { - m_single_file_archive = true; - } + m_single_file_archive = (InputSource::Filesystem != archive_path.source) + || std::filesystem::is_regular_file(archive_path.path);
95-97
: Define a constant for the expected segment count.The magic number
1
for the expected number of segments should be defined as a named constant to improve code maintainability and readability.+ static constexpr uint32_t cExpectedNumSegments = 1; - if (1 != m_archive_info.num_segments) { + if (cExpectedNumSegments != m_archive_info.num_segments) { return ErrorCodeUnsupported; }
222-226
: Optimize lambda capture in find_if.The lambda captures
section
by reference unnecessarily, which could be captured by value since it's a string_view.auto it = std::find_if( m_archive_file_info.files.begin(), m_archive_file_info.files.end(), - [&](ArchiveFileInfo& info) { return info.n == section; } + [section](ArchiveFileInfo& info) { return info.n == section; } );components/core/src/clp_s/ArchiveReaderAdaptor.hpp (7)
1-16
: Consider adding file-level documentation.Add a file-level documentation block explaining the purpose of this header file, its role in the archive reading infrastructure, and any important implementation notes.
+/** + * @file ArchiveReaderAdaptor.hpp + * @brief Adaptor class for reading single and multi-file archives from various sources + * + * This header defines the ArchiveReaderAdaptor class which provides a unified interface + * for reading archives regardless of their storage type (single/multi-file) or location + * (S3/local filesystem). + */ #ifndef CLP_S_ARCHIVEREADERADAPTOR_HPP
18-22
: Enhance class documentation.While the current documentation is good, consider adding:
- @brief tag for quick reference
- @details section explaining the adaptor pattern usage
- @note section about thread safety guarantees
/** + * @brief Adaptor for unified access to different types of archives + * * ArchiveReaderAdaptor is an adaptor class which helps with reading single and multi-file archives * which exist on either S3 or a locally mounted file system. + * + * @details This class implements the adaptor pattern to provide a unified interface for reading + * archives, abstracting away the differences between single and multi-file archives, as well as + * their storage locations (S3 or local filesystem). + * + * @note This class is not thread-safe and should not be accessed concurrently from multiple threads. */
24-29
: Document the exception class.Add documentation for the
OperationFailed
exception class explaining when it's thrown and what the error codes mean.+ /** + * @brief Exception thrown when archive operations fail + * + * This exception is thrown when operations such as checking out/in readers + * or accessing sections fail due to invalid state or I/O errors. + */ class OperationFailed : public TraceableException {
31-31
: Document the constructor.Add documentation for the constructor explaining the parameters and any preconditions.
+ /** + * @brief Constructs an ArchiveReaderAdaptor + * @param archive_path Path to the archive file + * @param network_auth Network authentication options for remote archives + */ explicit ArchiveReaderAdaptor(Path const& archive_path, NetworkAuthOption const& network_auth);
59-64
: Document getter methods.Add documentation for get_timestamp_dictionary() and get_header() methods explaining their purpose and any preconditions.
+ /** + * @brief Gets the timestamp dictionary reader + * @return Shared pointer to the timestamp dictionary reader + * @note load_archive_metadata() must be called before using this method + */ std::shared_ptr<TimestampDictionaryReader> get_timestamp_dictionary() { return m_timestamp_dictionary; } + /** + * @brief Gets the archive header + * @return Constant reference to the archive header + * @note load_archive_metadata() must be called before using this method + */ ArchiveHeader const& get_header() const { return m_archive_header; }
125-135
: Document member variables and consider using polymorphism.
- Add documentation for member variables to improve maintainability.
- Consider using polymorphism instead of the boolean flag for handling different archive types.
+ /// Path to the archive file Path m_archive_path{}; + /// Network authentication options for remote archives NetworkAuthOption m_network_auth{}; + /// Flag indicating whether this is a single file archive + /// @note Consider using polymorphism instead of this flag bool m_single_file_archive{false}; + /// Information about files in the archive ArchiveFileInfoPacket m_archive_file_info{}; + /// Archive header information ArchiveHeader m_archive_header{}; + /// General archive information ArchiveInfoPacket m_archive_info{}; + /// Offset to the files section in the archive size_t m_files_section_offset{}; + /// Name of the section currently being read, if any std::optional<std::string> m_current_reader_holder; + /// Reader for accessing timestamp information std::shared_ptr<TimestampDictionaryReader> m_timestamp_dictionary; + /// Current reader interface std::shared_ptr<clp::ReaderInterface> m_reader;Consider refactoring to use polymorphism:
class ArchiveReaderInterface { public: virtual ~ArchiveReaderInterface() = default; virtual ErrorCode load_archive_metadata() = 0; // ... other methods }; class SingleFileArchiveReader : public ArchiveReaderInterface { // Implementation for single file archives }; class MultiFileArchiveReader : public ArchiveReaderInterface { // Implementation for multi-file archives }; // Factory function std::unique_ptr<ArchiveReaderInterface> create_archive_reader( Path const& archive_path, NetworkAuthOption const& network_auth);
22-22
: Consider adding move semantics support.The class manages resources (readers, paths) but lacks move constructor and assignment operator. Consider adding them to support efficient resource transfer.
class ArchiveReaderAdaptor { public: + // Rule of 5 + ArchiveReaderAdaptor(ArchiveReaderAdaptor&&) noexcept = default; + ArchiveReaderAdaptor& operator=(ArchiveReaderAdaptor&&) noexcept = default; + ArchiveReaderAdaptor(const ArchiveReaderAdaptor&) = delete; + ArchiveReaderAdaptor& operator=(const ArchiveReaderAdaptor&) = delete;components/core/src/clp_s/ZstdDecompressor.cpp (2)
189-209
: Consider refactoring to reduce code duplication with FileReader open method.The implementation is nearly identical to the
open(FileReader&, size_t)
method. Consider extracting the common logic into a private helper method to improve maintainability and reduce duplication.-void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) { +template<typename ReaderType> +void ZstdDecompressor::open_reader(ReaderType& reader, InputType type, size_t file_read_buffer_capacity) { if (InputType::NotInitialized != m_input_type) { throw OperationFailed(ErrorCodeNotReady, __FILENAME__, __LINE__); } - m_input_type = InputType::ClpReader; + m_input_type = type; m_reader = &reader; m_file_reader_initial_pos = m_reader->get_pos(); // Avoid reallocating the internal buffer if this instance is being re-used with an // unchanged buffer size. if (file_read_buffer_capacity != m_file_read_buffer_capacity) { m_file_read_buffer_capacity = file_read_buffer_capacity; m_file_read_buffer = std::make_unique<char[]>(m_file_read_buffer_capacity); } m_file_read_buffer_length = 0; m_compressed_stream_block = {m_file_read_buffer.get(), m_file_read_buffer_length, 0}; reset_stream(); } + +void ZstdDecompressor::open(clp::ReaderInterface& reader, size_t file_read_buffer_capacity) { + open_reader(reader, InputType::ClpReader, file_read_buffer_capacity); +}
290-306
: Improve error handling consistency and code structure.
- Error handling differs between FileReader and ClpReader cases. Consider using the same pattern for both.
- Based on previous review comments, the seek condition could be simplified.
- if (InputType::File == m_input_type) { - if (auto rc = m_file_reader->try_seek_from_begin(m_file_reader_initial_pos); - ErrorCodeSuccess != rc && ErrorCodeEndOfFile != rc) - { - throw OperationFailed(rc, __FILENAME__, __LINE__); - } - m_file_read_buffer_length = 0; - m_compressed_stream_block.size = m_file_read_buffer_length; - } else if (InputType::ClpReader == m_input_type) { - auto rc = m_reader->try_seek_from_begin(m_file_reader_initial_pos); - m_file_read_buffer_length = 0; - m_compressed_stream_block.size = m_file_read_buffer_length; - if (false - == (clp::ErrorCode::ErrorCode_Success == rc || clp::ErrorCode::ErrorCode_EndOfFile == rc - )) - { - throw OperationFailed(static_cast<ErrorCode>(rc), __FILENAME__, __LINE__); - } + if (InputType::File == m_input_type || InputType::ClpReader == m_input_type) { + auto* reader = (InputType::File == m_input_type) ? m_file_reader : m_reader; + auto rc = reader->try_seek_from_begin(m_file_reader_initial_pos); + if (auto success = (InputType::File == m_input_type) + ? (ErrorCodeSuccess == rc || ErrorCodeEndOfFile == rc) + : (clp::ErrorCode::ErrorCode_Success == rc || clp::ErrorCode::ErrorCode_EndOfFile == rc); + !success) + { + throw OperationFailed(static_cast<ErrorCode>(rc), __FILENAME__, __LINE__); + } + m_file_read_buffer_length = 0; + m_compressed_stream_block.size = m_file_read_buffer_length; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
(1 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp
(1 hunks)components/core/src/clp_s/TimestampDictionaryReader.cpp
(2 hunks)components/core/src/clp_s/ZstdDecompressor.cpp
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp_s/TimestampDictionaryReader.cpp
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp_s/ZstdDecompressor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build (macos-latest)
🔇 Additional comments (4)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
49-72
: Resolve the FIXME and simplify deserialization.The current implementation manually parses the MessagePack object due to the commented-out line
m_archive_file_info = obj.as<clp_s::ArchiveFileInfoPacket>();
. Addressing the FIXME by fixing the issue withobj.as<clp_s::ArchiveFileInfoPacket>()
would simplify the code and improve maintainability.components/core/src/clp_s/ZstdDecompressor.cpp (3)
86-109
: Implement error code translation for CLP errors.The TODO comment indicates that CLP error codes are not currently translated. Instead of returning a generic
ErrorCodeFailure
, consider mapping the CLP error codes to the appropriate internal error codes to provide more informative error handling.Do you want assistance in implementing the error code translation or should I open a GitHub issue to track this task?
220-225
: LGTM! Clean and consistent implementation.The ClpReader case is handled appropriately, maintaining consistency with the existing pattern.
238-238
: Consider future refactoring opportunity.The condition has been correctly updated to include ClpReader. However, as discussed in previous comments, if FileReader becomes an inherited class of ReaderInterface in the future, this check could be simplified.
Let's check for any remaining uses of FileReader that could be migrated:
✅ Verification successful
Refactoring already implemented in most components
The condition in ZstdDecompressor.cpp is correct for the current architecture. Our verification shows that FileReader already inherits from ReaderInterface in both clp and glt namespaces. Only the clp_s::FileReader implementation remains to be refactored to follow this pattern.
components/core/src/clp_s/FileReader.hpp
: Consider refactoring this implementation to inherit from ReaderInterface, matching the pattern used in other components.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for FileReader usage rg -l "FileReader" | grep -v "ZstdDecompressor"Length of output: 4204
Script:
#!/bin/bash # Check FileReader class definitions and ReaderInterface echo "=== FileReader class definitions ===" ast-grep --pattern 'class FileReader$$$' echo -e "\n=== ReaderInterface definitions ===" ast-grep --pattern 'class ReaderInterface$$$'Length of output: 260
Script:
#!/bin/bash # Examine FileReader and ReaderInterface class definitions with context echo "=== FileReader Headers ===" rg -U "class FileReader" --type hpp -A 5 echo -e "\n=== ReaderInterface Headers ===" rg -U "class ReaderInterface" --type hpp -A 5Length of output: 288
Script:
#!/bin/bash echo "=== FileReader in clp ===" cat "components/core/src/clp/FileReader.hpp" echo -e "\n=== FileReader in clp_s ===" cat "components/core/src/clp_s/FileReader.hpp" echo -e "\n=== FileReader in glt ===" cat "components/core/src/glt/FileReader.hpp" echo -e "\n=== ReaderInterface definitions ===" cat "components/core/src/clp/ReaderInterface.hpp" "components/core/src/glt/ReaderInterface.hpp"Length of output: 22712
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (4)
70-72
: Log exceptions when catching them to aid in debuggingIn the catch block, the exception is caught but not logged. Logging the exception message can help with debugging and provide more context when errors occur.
Apply this diff to add logging:
} catch (std::exception const& e) { + SPDLOG_ERROR("Exception while reading archive file info: {}", e.what()); return ErrorCodeCorrupt; }
91-93
: Log exceptions when deserializing archive infoThe exception caught during deserialization of archive info is not logged. Logging the exception can assist in diagnosing issues during runtime.
Apply this diff to add logging:
} catch (std::exception const& e) { + SPDLOG_ERROR("Exception while reading archive info: {}", e.what()); return ErrorCodeCorrupt; }
191-193
: Usestd::filesystem::path
for path concatenation intry_create_reader_at_header
Concatenating paths using string operations may result in invalid paths if path separators are not properly handled. It is recommended to use
std::filesystem::path
to safely concatenate paths.Apply this diff to use
std::filesystem::path
:return std::make_shared<clp::FileReader>( - m_archive_path.path + constants::cArchiveHeaderFile + (std::filesystem::path(m_archive_path.path) / constants::cArchiveHeaderFile).string() );
214-215
: Usestd::filesystem::path
for path concatenation to handle path separators correctlyConcatenating paths using string concatenation may result in invalid file paths if path separators are not correctly managed. It is recommended to use
std::filesystem::path
for path concatenation to ensure correctness and portability.Apply this diff to use
std::filesystem::path
:return std::make_unique<clp::FileReader>( - m_archive_path.path + std::string{section} + (std::filesystem::path(m_archive_path.path) / section).string() );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (2)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (2)
49-72
: Address the FIXME and resolve the deserialization issueThe code is currently manually parsing the MessagePack object due to the commented-out line
m_archive_file_info = obj.as<clp_s::ArchiveFileInfoPacket>();
. Resolving theFIXME
by fixing the deserialization issue would simplify the code and improve maintainability.
178-180
: Handle unknown metadata packet types appropriatelyThe default case in the switch statement silently ignores unknown metadata packet types, which could lead to unnoticed issues. Consider logging a warning or returning an error when an unknown packet type is encountered.
Apply this diff to handle unknown packet types:
default: + SPDLOG_WARN("Unknown metadata packet type: {}", static_cast<int>(packet_type)); + return ErrorCodeUnsupported; - break;
auto const clp_rc = m_reader->try_read_exact_length( | ||
reinterpret_cast<char*>(&m_archive_header), | ||
sizeof(m_archive_header) | ||
); | ||
if (clp::ErrorCode::ErrorCode_Success != clp_rc) { | ||
return ErrorCodeErrno; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the passed reader
object instead of m_reader
in try_read_header
The function try_read_header
should use the reader
parameter provided to it rather than m_reader
. Using m_reader
may lead to unexpected behaviour if it is not properly initialised or if a different reader is intended.
Apply this diff to fix the issue:
- auto const clp_rc = m_reader->try_read_exact_length(
+ auto const clp_rc = reader.try_read_exact_length(
reinterpret_cast<char*>(&m_archive_header),
sizeof(m_archive_header)
);
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (4)
95-98
: Enhance error message for unsupported segment count.When multiple segments are detected, provide more context in the error message to aid debugging.
if (1 != m_archive_info.num_segments) { + SPDLOG_ERROR("Unsupported number of segments: {}", m_archive_info.num_segments); return ErrorCodeUnsupported; }
102-102
: Improve naming of the decompressor buffer capacity constant.The constant
cDecompressorFileReadBufferCapacity
could be more descriptive. Consider a name that indicates its purpose and unit, e.g.,cMetadataDecompressorBufferSizeBytes
.
130-138
: Use constant-time comparison for magic number.When comparing magic numbers or other security-sensitive data, use a constant-time comparison to prevent timing attacks.
- if (0 - != std::memcmp( - m_archive_header.magic_number, - cStructuredSFAMagicNumber, - sizeof(cStructuredSFAMagicNumber) - )) + if (!constant_time_compare( + m_archive_header.magic_number, + cStructuredSFAMagicNumber, + sizeof(cStructuredSFAMagicNumber)))
203-216
: Use RAII pattern for reader checkout/checkin.The current implementation relies on manual checkin. Consider using RAII to automatically checkin readers when they go out of scope.
Example implementation:
class SectionReaderGuard { ArchiveReaderAdaptor& adaptor; std::string section; public: SectionReaderGuard(ArchiveReaderAdaptor& a, std::string_view s) : adaptor(a), section(s) {} ~SectionReaderGuard() { adaptor.checkin_reader_for_section(section); } };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
49-50
: Resolve the FIXME and simplify deserialization.The current implementation manually parses the MessagePack object due to the commented-out line. Addressing the FIXME would simplify the code and improve maintainability.
@@ -119,7 +119,7 @@ ErrorCode ArchiveReaderAdaptor::load_archive_metadata() { | |||
} | |||
|
|||
ErrorCode ArchiveReaderAdaptor::try_read_header(clp::ReaderInterface& reader) { | |||
auto const clp_rc = m_reader->try_read_exact_length( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that we don't need to pass reader
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing reader to make it more generic and follow the convention of the other helpers that read some part of the metadata. I prefer doing it this way because I don't want to embed the assumption that the MFA flow initializes m_reader in a particular way.
I was erroneously using m_reader in this function before the most recent commit.
Description
This PR adds support for reading and searching clp-s single file archives. The general approach is to introduce a new "adaptor" class that allows most of the code to be oblivious as to whether the underlying archive is a single or multi-file archive.
This class is called "ArchiveReaderAdaptor" and makes up the bulk of this change.
Other notable changes include:
Benchmarks
Some benchmarking was performed to get an idea of compression, decompression, and search performance with single file archives. All of the following benchmarking results are from averaging over 3-5 tests on the local filesystem clearing the cache between runs.
Compression is slightly slower (~1.2%) when compressing single-file archives (compared to multi-file archives).
Surprisingly decompression is also slightly slower (~1%). This should be acceptable for now though, and we can work on optimizing bottlenecks later.
Likewise search seems to be slower (~3.7%). Again this is worth optimizing for later, but should be sufficient for now.
Validation performed
Summary by CodeRabbit
Release Notes
New Features
ArchiveReaderAdaptor
to enhance archive reading capabilities.Improvements
Technical Updates
These changes improve the overall robustness and flexibility of the archive reading and processing infrastructure.