Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing snapshot creation for earlier versions than the latest checkpoint #322

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Commits on Sep 5, 2024

  1. Add test for snapshot creation at version before latest checkpoint

    This commit introduces a new unit test to verify that the Delta table
    implementation can correctly build a snapshot at a version that is
    earlier than the latest checkpoint. Specifically, it:
    
    - Tests snapshot creation at version 10 when later checkpoints exist
    - Adds delta dataset with multiple checkpoints as test data.
    hackintoshrao committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    016dfc2 View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2024

  1. Add temporary fix for checkpoint metadata field naming inconsistency

    - Added serde alias for 'size_in_bytes' field in CheckpointMetadata struct
    - This allows deserialization of both camelCase and snake_case variants
    - Addresses issue with inconsistent field naming in _last_checkpoint file
    
    This is a temporary workaround for the issue described in delta-incubator#326. The long-term
    solution will involve aligning the checkpoint writing logic with the Delta
    protocol specification to use camelCase field names consistently.
    
    See delta-incubator#326 for full details.
    hackintoshrao committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    5270ea7 View commit details
    Browse the repository at this point in the history
  2. Add commit files after last checkpoint in multiple-checkpoint test data

    - Added commit files for versions 25, 26, 27, and 28 to the multiple-checkpoint test dataset
    - Last checkpoint remains at version 24
    - Purpose: Enable testing of snapshot creation for versions between the last checkpoint and the latest commit
    
    This change allows us to test scenarios where:
    1. A snapshot is requested for a version after the last checkpoint
    2. The behavior of version selection when commits exist beyond the last checkpoint
    3. The correct handling of file listing and filtering for versions between checkpoints and the latest commit
    
    These additions will help ensure the snapshot creation logic correctly handles
    various version scenarios, particularly focusing on the interaction between
    checkpoints and subsequent commits.
    hackintoshrao committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    4f6b1d6 View commit details
    Browse the repository at this point in the history
  3. Add test for snapshot creation with version after last checkpoint

    This commit introduces a new unit test 'test_snapshot_with_version_after_last_checkpoint'
    to verify correct snapshot behavior when requesting a version that is after the last
    checkpoint but not the latest commit.
    
    Test data state:
    - Located in ./tests/data/multiple-checkpoint/
    - Contains commits up to version 28
    - Last checkpoint is at version 24
    - Requested snapshot version is 26
    
    The test ensures:
    1. Snapshot creation succeeds for version 26
    2. Correct commit files are included (versions 25 and 26)
    3. Older commits are excluded (version 24 and earlier)
    4. Newer commits are excluded (versions 27 and 28)
    5. The correct checkpoint file (version 24) is used
    6. The effective version of the snapshot is set correctly
    
    This test improves coverage of the snapshot creation logic, particularly for cases
    where the requested version falls between the last checkpoint and the latest commit.
    hackintoshrao committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    1a8f8fe View commit details
    Browse the repository at this point in the history
  4. Refine snapshot creation logic using last checkpoint

    This commit updates the snapshot creation process to more efficiently
    utilize the last checkpoint information. Key changes include:
    
    1. Streamlined logic for determining which log files to list based on
       the presence of a checkpoint and the requested version.
    
    2. Use checkpoint data to list files when available, regardless of
       the requested version, allowing for more efficient file retrieval.
    
    3. Fall back to listing all log files when no checkpoint is found.
    
    This approach optimizes file reading operations, particularly for
    tables with long histories, while maintaining correct behavior for
    all version request scenarios. The subsequent filtering of commits
    based on the requested version remains unchanged, ensuring accurate
    snapshot creation.
    hackintoshrao committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    576a136 View commit details
    Browse the repository at this point in the history
  5. Add test for snapshot creation at latest checkpoint version

    This commit introduces a new unit test:
    'test_snapshot_at_latest_checkpoint_version'. The test verifies that:
    
    1. Snapshot creation succeeds when requesting the exact version of the
       latest checkpoint.
    2. The created snapshot has the correct version.
    3. The appropriate checkpoint file is used.
    4. No commit files after the checkpoint version are included.
    5. The effective version matches the checkpoint version.
    
    This test covers an important edge case in snapshot creation, ensuring
    correct behavior when the requested version aligns exactly with the
    latest checkpoint. It complements existing tests and improves coverage
    of the snapshot creation logic.
    hackintoshrao committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    45bd64e View commit details
    Browse the repository at this point in the history
  6. Refactor log file listing to include version filtering

    This commit updates the `list_log_files_with_checkpoint` function to
    incorporate version filtering, previously handled in `try_new`. Changes include:
    
    1. Add `requested_version: Option<Version>` parameter to
       `list_log_files_with_checkpoint`.
    2. Implement version filtering logic within the commit file selection process.
    3. Remove redundant version filtering from `try_new`.
    hackintoshrao committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    8d2754c View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2024

  1. Refactor and improve Delta log file listing logic

    - Merged list_log_files and list_log_files_with_checkpoint into a single function
    - Enhanced file filtering to correctly handle checkpoint boundaries
    - Updated test cases to cover all scenarios, including:
      * Initial commits without checkpoints
      * Checkpoint versions
      * Versions between checkpoints
      * Accumulating commits after checkpoints
    - Added detailed comments explaining each test case
    - Improved handling of requested versions at or near checkpoint versions
    - Optimized file sorting and filtering for better performance
    
    This refactor simplifies the codebase, improves test coverage, and ensures
    correct behavior for all Delta log file listing scenarios, particularly
    around checkpoint boundaries.
    hackintoshrao committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    8cb6544 View commit details
    Browse the repository at this point in the history
  2. Refactor list_log_files for improved version handling

    - Optimize file selection based on checkpoints and requested versions
    - Ensure correct handling of commit files and checkpoints
    - Improve efficiency by leveraging most recent checkpoints
    - Add logic to handle cases before and after checkpoints
    hackintoshrao committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    f0ae2b9 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c67af78 View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2024

  1. Configuration menu
    Copy the full SHA
    83459ea View commit details
    Browse the repository at this point in the history
  2. Fix documentation for checkpoint version handling

    - Correct explanation for requested version matching checkpoint
    - Clarify that both commit and checkpoint files are included
    - Align comment with existing test cases and implementation
    hackintoshrao committed Sep 10, 2024
    Configuration menu
    Copy the full SHA
    dda9911 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2024

  1. Implement DeltaLogGroupingIterator for efficient log file processing

    This commit introduces the DeltaLogGroupingIterator, a crucial component for
    processing Delta Lake log files. The iterator groups log files into checkpoint
    nodes, handling various scenarios including single-part checkpoints, multi-part
    checkpoints, and commits without checkpoints.
    
    Key features and improvements:
    
    1. Efficient sorting and processing of log files:
       - Files are sorted by version and type (checkpoints before commits)
       - Handles version gaps and ensures proper sequencing of files
    
    2. Flexible checkpoint handling:
       - Supports both single-part and multi-part checkpoints
       - Correctly groups multi-part checkpoint files
       - Detects and reports incomplete multi-part checkpoints
    
    3. Robust error handling:
       - Detects and reports version gaps in the log
       - Ensures the log starts from version 0 when required
       - Reports incomplete multi-part checkpoints
    
    4. Memory-efficient linked list structure:
       - Uses Rc<RefCell<>> for shared ownership and interior mutability
       - Allows for easy traversal of the log structure
    
    5. Iterator implementation:
       - Provides a standard Rust iterator interface for easy consumption of log data
    hackintoshrao committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    96765ba View commit details
    Browse the repository at this point in the history
  2. Add multi-part checkpoint detection and parsing to LogPath

    This commit enhances the LogPath struct with new functionality to handle
    multi-part checkpoint files in Delta Lake log processing. Two new methods
    have been added to improve the identification and parsing of multi-part
    checkpoint files:
    
    1. is_multi_part_checkpoint():
       - Determines if a file is a multi-part checkpoint
       - Handles both single-part and multi-part checkpoint file formats
       - Returns a boolean indicating if the file is a multi-part checkpoint
    
    2. get_checkpoint_part_numbers():
       - Extracts part number and total parts for multi-part checkpoints
       - Returns Option<(u64, u64)> representing (part_number, total_parts)
       - Returns None for single-part checkpoints or non-checkpoint files
    
    Key improvements:
    - Robust parsing of checkpoint filenames
    - Clear distinction between single-part and multi-part checkpoints
    - Efficient extraction of part information from multi-part checkpoints
    hackintoshrao committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    c820895 View commit details
    Browse the repository at this point in the history
  3. Add InvalidDeltaLog error variant

    - Introduce new Error variant for invalid Delta Log structures
    - Improve error reporting for log processing issues
    - Supports recent changes in DeltaLogGroupingIterator and LogPath
    hackintoshrao committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    a940266 View commit details
    Browse the repository at this point in the history
  4. Refactor list_log_files function using DeltaLogGroupingIterator

    - Replace manual file processing with DeltaLogGroupingIterator
    - Improve handling of multi-part checkpoints and version requests
    - Enhance error handling for invalid Delta log structures
    - Optimize file filtering and sorting for different scenarios
    - Update comments to explain complex logic and edge cases
    - Maintain backwards compatibility with existing test cases
    hackintoshrao committed Sep 17, 2024
    Configuration menu
    Copy the full SHA
    2af625e View commit details
    Browse the repository at this point in the history