Fix an edge case of getting duplicated records when using TextIO. #30026

When processing a CRLF-delimited file and the read buffer has CR as the last character, startOfNextRecord will be set to the position after the CR, i.e. the following LF. Let's say the position of this LF is p. In the next buffer, even though the actual start of record should be p+1, startOfRecord is set to startOfNextRecord, which is p. Then the code processes the next record by skipping the LF and yields a record starting from p+1. It decides whether the record is valid by checking if startOfRecord is in the range defined in RangeTracker. If there is a split right after p, i.e. we have ranges [a, p+1) and [p+1, b), then the above record would be considered as valid in the split [a, p+1), because its startOfRecord is p <= p+1. However, the record is also considered valid when split [p+1, b) is processed, resulting into duplicated records in the output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix an edge case of getting duplicated records when using TextIO. #30026

Fix an edge case of getting duplicated records when using TextIO. #30026

Commits on Jan 17, 2024