-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix an edge case of getting duplicated records when using TextIO. (#3…
…0026) When processing a CRLF-delimited file and the read buffer has CR as the last character, startOfNextRecord will be set to the position after the CR, i.e. the following LF. Let's say the position of this LF is p. In the next buffer, even though the actual start of record should be p+1, startOfRecord is set to startOfNextRecord, which is p. Then the code processes the next record by skipping the LF and yields a record starting from p+1. It decides whether the record is valid by checking if startOfRecord is in the range defined in RangeTracker. If there is a split right after p, i.e. we have ranges [a, p+1) and [p+1, b), then the above record would be considered as valid in the split [a, p+1), because its startOfRecord is p <= p+1. However, the record is also considered valid when split [p+1, b) is processed, resulting into duplicated records in the output.
- Loading branch information
Showing
2 changed files
with
82 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters