Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an edge case of getting duplicated records when using TextIO. #30026

Merged
merged 1 commit into from
Jan 17, 2024

Commits on Jan 17, 2024

  1. Fix an edge case of getting duplicated records when using TextIO.

    When processing a CRLF-delimited file and the read buffer has
    CR as the last character, startOfNextRecord will be set to the
    position after the CR, i.e. the following LF. Let's say the
    position of this LF is p.
    
    In the next buffer, even though the actual start of record should be
    p+1, startOfRecord is set to startOfNextRecord, which is p.
    
    Then the code processes the next record by skipping the LF and yields
    a record starting from p+1. It decides whether the record is valid by
    checking if startOfRecord is in the range defined in RangeTracker.
    
    If there is a split right after p, i.e. we have ranges [a, p+1) and [p+1, b),
    then the above record would be considered as valid in the split [a, p+1),
    because its startOfRecord is p <= p+1. However, the record is also
    considered valid when split [p+1, b) is processed, resulting into
    duplicated records in the output.
    shunping committed Jan 17, 2024
    Configuration menu
    Copy the full SHA
    2f226f0 View commit details
    Browse the repository at this point in the history