-
-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovering from CSV read errors leads to infinite loops when caused by IO errors #207
Comments
Hi again. |
I don't really understand what needs to be fixed in the csv crate here. You're getting an error and then explicitly continuing. If you don't want to continue, then stop the loop when an error occurs. |
We want to keep parsing other lines when there's a broken record in the actual CSV, as it's very common the CSVs we handle are broken in some way (e.g. on a single line, not all fields set, broken escapes leading to too many fields, failed to deserialize because of wrong content...), so we need to iterate on all the records, handling lines that are not broken and logging the errors: this is what this example code is meant to do (though this is a much reduced example: in practice for us it's actually hundreds of lines of code generic on iterators of results, with multiple conversion steps, The issue is that when the underlying reader breaks, I'm thinking about e.g. using some kind of wrapper around the reader, like this, in order to avoid this issue: /// Wrapper around reader to simulate an "End of file" after we hit an error
///
/// This is so that users of the reader who recover from some errors (e.g. users of csv reader) don't infinite loop.
/// "End of file" in Rust is represented by `read` returning `Ok(0)`
pub struct PretendEofAfterError<R: Read> {
reader: R,
has_errored: bool,
}
impl<R: Read> PretendEofAfterError<R> {
pub fn new(reader: R) -> Self {
Self {
reader,
has_errored: false,
}
}
}
impl<R: Read> Read for PretendEofAfterError<R> {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
if self.has_errored {
Ok(0)
} else {
let res = self.reader.read(buf);
if res.is_err() {
self.has_errored = true;
}
res
}
}
} (This is the workaround I implemented after opening this issue, but forgetting to use it in one place is what eventually cost us thousands of euros.) |
Thanks for elaborating. That's a much better bug report. There is some subtlety to this though. It sounds like the iterator's behavior needs to depend on the kind of error that occurred. If an I/O error occurs, then yeah, probably the iterator should not yield anything else. But if a deserialization error occurs, then the CSV reader should advance to the next record. The latter behavior exists today: #[derive(Debug, serde::Deserialize)]
struct Record {
field1: String,
field2: i32,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut rdr = csv::Reader::from_path("weird.csv")?;
for result in rdr.deserialize::<Record>() {
match result {
Ok(record) => println!("OK: {:?}", record),
Err(err) => println!("ERROR: {:?}", err),
}
}
Ok(())
} [package]
name = "csv-207"
version = "0.1.0"
authors = ["Andrew Gallant <[email protected]>"]
edition = "2018"
[dependencies]
csv = "1"
serde = { version = "1", features = ["derive"] }
I guess this is what you're saying: the "continuing" behavior of errors is correct for some kinds of errors but not for other kinds. That distinction wasn't obvious to me. But yeah, I agree with you, if an iterator hits an I/O error, then the iterator should be marked as exhausted. And this should be clearly documented. It shouldn't be too hard to implement, but would be tedious probably. I don't know when I'll get around to it, so if you wanted to submit a PR, that would be great. |
Resolves BurntSushi#207 When reading and parsing a record fails, it's correct behaviour in most cases to keep on trying to read and parse the next record. However when the Error is caused by failing to read from the underlying `impl Read`, the next read would almost always fail with the exact same error. In that scenario, reading from the underlying `impl Read` should not be attempted when trying to deserialize the next record, as this may lead to infinite loops. Instead, the `Reader` will behave in the same way as if an end-of-file had been reached.
Resolves BurntSushi#207 When reading and parsing a record fails, it's correct behavior in most cases to keep on trying to read and parse the next record. However when the Error is caused by failing to read from the underlying `impl Read`, the next read would almost always fail with the exact same error. In that scenario, reading from the underlying `impl Read` should not be attempted when trying to extract the next record, as this may lead to infinite loops. Instead, the `Reader` will behave in the same way as if an end-of-file had been reached.
This commit changes the behavior of the CSV reader (and its iterators) to stop when an I/O error from the underlying reader is encountered. When reading and parsing a record fails, correct behavior in most cases to keep on trying to read and parse the next record. That behavior remains the same after this commit. However when the Error is caused by failing to read from the underlying reader, the next read would almost always fail with the exact same error. In that scenario, reading from the underlying reader should not be attempted when trying to extract the next record, as this may lead to infinite loops. Instead, the reader will behave in the same way as if an end-of-file had been reached. Fixes #207
What version of the
csv
crate are you using?1.1.3
Briefly describe the question, bug or feature request.
Recovering from CSV errors leads to infinite loops when caused by IO errors
Include a complete program demonstrating a problem.
What is the observed behavior of the code above?
Infinite loop of "This record is broken :(".
What is the expected or desired behavior of the code above?
program exits
The text was updated successfully, but these errors were encountered: