-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing snapshot creation for earlier versions than the latest checkpoint #322
base: main
Are you sure you want to change the base?
Commits on Sep 5, 2024
-
Add test for snapshot creation at version before latest checkpoint
This commit introduces a new unit test to verify that the Delta table implementation can correctly build a snapshot at a version that is earlier than the latest checkpoint. Specifically, it: - Tests snapshot creation at version 10 when later checkpoints exist - Adds delta dataset with multiple checkpoints as test data.
Configuration menu - View commit details
-
Copy full SHA for 016dfc2 - Browse repository at this point
Copy the full SHA 016dfc2View commit details
Commits on Sep 6, 2024
-
Add temporary fix for checkpoint metadata field naming inconsistency
- Added serde alias for 'size_in_bytes' field in CheckpointMetadata struct - This allows deserialization of both camelCase and snake_case variants - Addresses issue with inconsistent field naming in _last_checkpoint file This is a temporary workaround for the issue described in delta-incubator#326. The long-term solution will involve aligning the checkpoint writing logic with the Delta protocol specification to use camelCase field names consistently. See delta-incubator#326 for full details.
Configuration menu - View commit details
-
Copy full SHA for 5270ea7 - Browse repository at this point
Copy the full SHA 5270ea7View commit details -
Add commit files after last checkpoint in multiple-checkpoint test data
- Added commit files for versions 25, 26, 27, and 28 to the multiple-checkpoint test dataset - Last checkpoint remains at version 24 - Purpose: Enable testing of snapshot creation for versions between the last checkpoint and the latest commit This change allows us to test scenarios where: 1. A snapshot is requested for a version after the last checkpoint 2. The behavior of version selection when commits exist beyond the last checkpoint 3. The correct handling of file listing and filtering for versions between checkpoints and the latest commit These additions will help ensure the snapshot creation logic correctly handles various version scenarios, particularly focusing on the interaction between checkpoints and subsequent commits.
Configuration menu - View commit details
-
Copy full SHA for 4f6b1d6 - Browse repository at this point
Copy the full SHA 4f6b1d6View commit details -
Add test for snapshot creation with version after last checkpoint
This commit introduces a new unit test 'test_snapshot_with_version_after_last_checkpoint' to verify correct snapshot behavior when requesting a version that is after the last checkpoint but not the latest commit. Test data state: - Located in ./tests/data/multiple-checkpoint/ - Contains commits up to version 28 - Last checkpoint is at version 24 - Requested snapshot version is 26 The test ensures: 1. Snapshot creation succeeds for version 26 2. Correct commit files are included (versions 25 and 26) 3. Older commits are excluded (version 24 and earlier) 4. Newer commits are excluded (versions 27 and 28) 5. The correct checkpoint file (version 24) is used 6. The effective version of the snapshot is set correctly This test improves coverage of the snapshot creation logic, particularly for cases where the requested version falls between the last checkpoint and the latest commit.
Configuration menu - View commit details
-
Copy full SHA for 1a8f8fe - Browse repository at this point
Copy the full SHA 1a8f8feView commit details -
Refine snapshot creation logic using last checkpoint
This commit updates the snapshot creation process to more efficiently utilize the last checkpoint information. Key changes include: 1. Streamlined logic for determining which log files to list based on the presence of a checkpoint and the requested version. 2. Use checkpoint data to list files when available, regardless of the requested version, allowing for more efficient file retrieval. 3. Fall back to listing all log files when no checkpoint is found. This approach optimizes file reading operations, particularly for tables with long histories, while maintaining correct behavior for all version request scenarios. The subsequent filtering of commits based on the requested version remains unchanged, ensuring accurate snapshot creation.
Configuration menu - View commit details
-
Copy full SHA for 576a136 - Browse repository at this point
Copy the full SHA 576a136View commit details -
Add test for snapshot creation at latest checkpoint version
This commit introduces a new unit test: 'test_snapshot_at_latest_checkpoint_version'. The test verifies that: 1. Snapshot creation succeeds when requesting the exact version of the latest checkpoint. 2. The created snapshot has the correct version. 3. The appropriate checkpoint file is used. 4. No commit files after the checkpoint version are included. 5. The effective version matches the checkpoint version. This test covers an important edge case in snapshot creation, ensuring correct behavior when the requested version aligns exactly with the latest checkpoint. It complements existing tests and improves coverage of the snapshot creation logic.
Configuration menu - View commit details
-
Copy full SHA for 45bd64e - Browse repository at this point
Copy the full SHA 45bd64eView commit details -
Refactor log file listing to include version filtering
This commit updates the `list_log_files_with_checkpoint` function to incorporate version filtering, previously handled in `try_new`. Changes include: 1. Add `requested_version: Option<Version>` parameter to `list_log_files_with_checkpoint`. 2. Implement version filtering logic within the commit file selection process. 3. Remove redundant version filtering from `try_new`.
Configuration menu - View commit details
-
Copy full SHA for 8d2754c - Browse repository at this point
Copy the full SHA 8d2754cView commit details
Commits on Sep 7, 2024
-
Refactor and improve Delta log file listing logic
- Merged list_log_files and list_log_files_with_checkpoint into a single function - Enhanced file filtering to correctly handle checkpoint boundaries - Updated test cases to cover all scenarios, including: * Initial commits without checkpoints * Checkpoint versions * Versions between checkpoints * Accumulating commits after checkpoints - Added detailed comments explaining each test case - Improved handling of requested versions at or near checkpoint versions - Optimized file sorting and filtering for better performance This refactor simplifies the codebase, improves test coverage, and ensures correct behavior for all Delta log file listing scenarios, particularly around checkpoint boundaries.
Configuration menu - View commit details
-
Copy full SHA for 8cb6544 - Browse repository at this point
Copy the full SHA 8cb6544View commit details -
Refactor list_log_files for improved version handling
- Optimize file selection based on checkpoints and requested versions - Ensure correct handling of commit files and checkpoints - Improve efficiency by leveraging most recent checkpoints - Add logic to handle cases before and after checkpoints
Configuration menu - View commit details
-
Copy full SHA for f0ae2b9 - Browse repository at this point
Copy the full SHA f0ae2b9View commit details -
Configuration menu - View commit details
-
Copy full SHA for c67af78 - Browse repository at this point
Copy the full SHA c67af78View commit details
Commits on Sep 10, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 83459ea - Browse repository at this point
Copy the full SHA 83459eaView commit details -
Fix documentation for checkpoint version handling
- Correct explanation for requested version matching checkpoint - Clarify that both commit and checkpoint files are included - Align comment with existing test cases and implementation
Configuration menu - View commit details
-
Copy full SHA for dda9911 - Browse repository at this point
Copy the full SHA dda9911View commit details
Commits on Sep 17, 2024
-
Implement DeltaLogGroupingIterator for efficient log file processing
This commit introduces the DeltaLogGroupingIterator, a crucial component for processing Delta Lake log files. The iterator groups log files into checkpoint nodes, handling various scenarios including single-part checkpoints, multi-part checkpoints, and commits without checkpoints. Key features and improvements: 1. Efficient sorting and processing of log files: - Files are sorted by version and type (checkpoints before commits) - Handles version gaps and ensures proper sequencing of files 2. Flexible checkpoint handling: - Supports both single-part and multi-part checkpoints - Correctly groups multi-part checkpoint files - Detects and reports incomplete multi-part checkpoints 3. Robust error handling: - Detects and reports version gaps in the log - Ensures the log starts from version 0 when required - Reports incomplete multi-part checkpoints 4. Memory-efficient linked list structure: - Uses Rc<RefCell<>> for shared ownership and interior mutability - Allows for easy traversal of the log structure 5. Iterator implementation: - Provides a standard Rust iterator interface for easy consumption of log data
Configuration menu - View commit details
-
Copy full SHA for 96765ba - Browse repository at this point
Copy the full SHA 96765baView commit details -
Add multi-part checkpoint detection and parsing to LogPath
This commit enhances the LogPath struct with new functionality to handle multi-part checkpoint files in Delta Lake log processing. Two new methods have been added to improve the identification and parsing of multi-part checkpoint files: 1. is_multi_part_checkpoint(): - Determines if a file is a multi-part checkpoint - Handles both single-part and multi-part checkpoint file formats - Returns a boolean indicating if the file is a multi-part checkpoint 2. get_checkpoint_part_numbers(): - Extracts part number and total parts for multi-part checkpoints - Returns Option<(u64, u64)> representing (part_number, total_parts) - Returns None for single-part checkpoints or non-checkpoint files Key improvements: - Robust parsing of checkpoint filenames - Clear distinction between single-part and multi-part checkpoints - Efficient extraction of part information from multi-part checkpoints
Configuration menu - View commit details
-
Copy full SHA for c820895 - Browse repository at this point
Copy the full SHA c820895View commit details -
Add InvalidDeltaLog error variant
- Introduce new Error variant for invalid Delta Log structures - Improve error reporting for log processing issues - Supports recent changes in DeltaLogGroupingIterator and LogPath
Configuration menu - View commit details
-
Copy full SHA for a940266 - Browse repository at this point
Copy the full SHA a940266View commit details -
Refactor list_log_files function using DeltaLogGroupingIterator
- Replace manual file processing with DeltaLogGroupingIterator - Improve handling of multi-part checkpoints and version requests - Enhance error handling for invalid Delta log structures - Optimize file filtering and sorting for different scenarios - Update comments to explain complex logic and edge cases - Maintain backwards compatibility with existing test cases
Configuration menu - View commit details
-
Copy full SHA for 2af625e - Browse repository at this point
Copy the full SHA 2af625eView commit details