Recovering from CSV read errors leads to infinite loops when caused by IO errors #207

Ten0 · 2020-08-12T12:18:52Z

What version of the `csv` crate are you using?

1.1.3

Briefly describe the question, bug or feature request.

Recovering from CSV errors leads to infinite loops when caused by IO errors

Include a complete program demonstrating a problem.

use {
    csv::Reader,
    std::io::{self, Read},
};

#[derive(Debug, serde_derive::Deserialize)]
struct Deserialized {
    something: String,
}

fn main() {
    for record_result in Reader::from_reader(FailingRead).deserialize::<Deserialized>() {
        match record_result {
            Ok(deserialized) => println!("Got something: {}", deserialized.something),
            Err(failed_line) => eprintln!("This record is broken :("),
        }
    }
}

struct FailingRead;
impl Read for FailingRead {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        Err(io::Error::new(io::ErrorKind::Other, "reader just broke!"))
    }
}

What is the observed behavior of the code above?

Infinite loop of "This record is broken :(".

What is the expected or desired behavior of the code above?

This record is broken :(

program exits

The text was updated successfully, but these errors were encountered:

Ten0 · 2020-11-02T17:06:19Z

Hi again.
@BurntSushi This has just cost us thousands of euros due to generating tons of useless logs after an invalid file was transferred, leading to an always-erroring reader.
Would you be open to a PR fixing this ?

BurntSushi · 2020-11-02T17:25:53Z

I don't really understand what needs to be fixed in the csv crate here. You're getting an error and then explicitly continuing. If you don't want to continue, then stop the loop when an error occurs.

Ten0 · 2020-11-02T18:44:46Z

We want to keep parsing other lines when there's a broken record in the actual CSV, as it's very common the CSVs we handle are broken in some way (e.g. on a single line, not all fields set, broken escapes leading to too many fields, failed to deserialize because of wrong content...), so we need to iterate on all the records, handling lines that are not broken and logging the errors: this is what this example code is meant to do (though this is a much reduced example: in practice for us it's actually hundreds of lines of code generic on iterators of results, with multiple conversion steps, csv not being the only library we parse from).

The issue is that when the underlying reader breaks, csv::DeserializeRecordsIntoIter becomes an infinite iterator of IO errors, and I beleive it should not, as this creates issues when handling possibly partly broken CSVs, and I can't think of a scenario where the infinite iterator of IO errors would be what somebody who still attempts to read the next record after they encountered an error on one record would want.

I'm thinking about e.g. using some kind of wrapper around the reader, like this, in order to avoid this issue:

/// Wrapper around reader to simulate an "End of file" after we hit an error
///
/// This is so that users of the reader who recover from some errors (e.g. users of csv reader) don't infinite loop.
/// "End of file" in Rust is represented by `read` returning `Ok(0)`
pub struct PretendEofAfterError<R: Read> {
	reader: R,
	has_errored: bool,
}

impl<R: Read> PretendEofAfterError<R> {
	pub fn new(reader: R) -> Self {
		Self {
			reader,
			has_errored: false,
		}
	}
}

impl<R: Read> Read for PretendEofAfterError<R> {
	fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
		if self.has_errored {
			Ok(0)
		} else {
			let res = self.reader.read(buf);
			if res.is_err() {
				self.has_errored = true;
			}
			res
		}
	}
}

(This is the workaround I implemented after opening this issue, but forgetting to use it in one place is what eventually cost us thousands of euros.)

BurntSushi · 2020-11-02T19:21:04Z

Thanks for elaborating. That's a much better bug report. There is some subtlety to this though. It sounds like the iterator's behavior needs to depend on the kind of error that occurred. If an I/O error occurs, then yeah, probably the iterator should not yield anything else. But if a deserialization error occurs, then the CSV reader should advance to the next record. The latter behavior exists today:

#[derive(Debug, serde::Deserialize)]
struct Record {
    field1: String,
    field2: i32,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rdr = csv::Reader::from_path("weird.csv")?;
    for result in rdr.deserialize::<Record>() {
        match result {
            Ok(record) => println!("OK: {:?}", record),
            Err(err) => println!("ERROR: {:?}", err),
        }
    }
    Ok(())
}

[package]
name = "csv-207"
version = "0.1.0"
authors = ["Andrew Gallant <[email protected]>"]
edition = "2018"

[dependencies]
csv = "1"
serde = { version = "1", features = ["derive"] }

field1,field2
foo,2
bar,2.5
baz,3

$ cargo run
   Compiling csv-207 v0.1.0 (/tmp/csv-207)
    Finished dev [unoptimized + debuginfo] target(s) in 0.48s
     Running `target/debug/csv-207`
OK: Record { field1: "foo", field2: 2 }
ERROR: Error(Deserialize { pos: Some(Position { byte: 20, line: 3, record: 2 }), err: DeserializeError { field: Some(1), kind: ParseInt(ParseIntError { kind: InvalidDigit }) } })
OK: Record { field1: "baz", field2: 3 }

I guess this is what you're saying: the "continuing" behavior of errors is correct for some kinds of errors but not for other kinds. That distinction wasn't obvious to me.

But yeah, I agree with you, if an iterator hits an I/O error, then the iterator should be marked as exhausted. And this should be clearly documented. It shouldn't be too hard to implement, but would be tedious probably. I don't know when I'll get around to it, so if you wanted to submit a PR, that would be great.

Resolves BurntSushi#207 When reading and parsing a record fails, it's correct behaviour in most cases to keep on trying to read and parse the next record. However when the Error is caused by failing to read from the underlying `impl Read`, the next read would almost always fail with the exact same error. In that scenario, reading from the underlying `impl Read` should not be attempted when trying to deserialize the next record, as this may lead to infinite loops. Instead, the `Reader` will behave in the same way as if an end-of-file had been reached.

Resolves BurntSushi#207 When reading and parsing a record fails, it's correct behavior in most cases to keep on trying to read and parse the next record. However when the Error is caused by failing to read from the underlying `impl Read`, the next read would almost always fail with the exact same error. In that scenario, reading from the underlying `impl Read` should not be attempted when trying to extract the next record, as this may lead to infinite loops. Instead, the `Reader` will behave in the same way as if an end-of-file had been reached.

This commit changes the behavior of the CSV reader (and its iterators) to stop when an I/O error from the underlying reader is encountered. When reading and parsing a record fails, correct behavior in most cases to keep on trying to read and parse the next record. That behavior remains the same after this commit. However when the Error is caused by failing to read from the underlying reader, the next read would almost always fail with the exact same error. In that scenario, reading from the underlying reader should not be attempted when trying to extract the next record, as this may lead to infinite loops. Instead, the reader will behave in the same way as if an end-of-file had been reached. Fixes #207

BurntSushi added enhancement help-wanted labels Nov 2, 2020

Ten0 mentioned this issue Nov 2, 2020

Avoid running into infinite loops when recovering from Reader errors #213

Merged

BurntSushi closed this as completed in #213 Nov 3, 2020

Ten0 mentioned this issue Apr 5, 2023

Update naming around TcpListener::from_std so that it's harder to misuse tokio-rs/tokio#5597

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering from CSV read errors leads to infinite loops when caused by IO errors #207

Recovering from CSV read errors leads to infinite loops when caused by IO errors #207

Ten0 commented Aug 12, 2020 •

edited

Loading

Ten0 commented Nov 2, 2020 •

edited

Loading

BurntSushi commented Nov 2, 2020

Ten0 commented Nov 2, 2020 •

edited

Loading

BurntSushi commented Nov 2, 2020 •

edited

Loading

Recovering from CSV read errors leads to infinite loops when caused by IO errors #207

Recovering from CSV read errors leads to infinite loops when caused by IO errors #207

Comments

Ten0 commented Aug 12, 2020 • edited Loading

What version of the csv crate are you using?

Briefly describe the question, bug or feature request.

Include a complete program demonstrating a problem.

What is the observed behavior of the code above?

What is the expected or desired behavior of the code above?

Ten0 commented Nov 2, 2020 • edited Loading

BurntSushi commented Nov 2, 2020

Ten0 commented Nov 2, 2020 • edited Loading

BurntSushi commented Nov 2, 2020 • edited Loading

Ten0 commented Aug 12, 2020 •

edited

Loading

What version of the `csv` crate are you using?

Ten0 commented Nov 2, 2020 •

edited

Loading

Ten0 commented Nov 2, 2020 •

edited

Loading

BurntSushi commented Nov 2, 2020 •

edited

Loading