-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Carriage Return 0x0d rewritten in cell values #50
Comments
Elsewhere we've tended to use commons-csv rather than opencsv, partly because it's included in jena. Certainly support |
Legal line endings are normally Though, if you can provide a small test case then I guess we could check commons CSV and updated open CSV in case they have different behaviour. |
Totally agree. Interestingly (maybe) the
which is common at least with I guess a forensic reading would note that the absence of the word 'only' and the note to implementers do not prohibit the use of other line breaks (specifically a single LF or CR). |
While platform line endings are indeed an issue, the crux of this particular issue is that we were surprised that the parser essentially strips any The grammar in RFC4180 allows a quoted cell value to have When digging down, we saw that BufferedReader, to this day, reads lines and gobbles up any I've done some quick checking and it looks as though import au.com.bytecode.opencsv.{CSVParser, CSVReader}
import org.apache.commons.csv.CSVFormat
import java.io.StringReader
import scala.jdk.CollectionConverters.IterableHasAsScala
object CSVParserTest extends Examples with App {
val acc_records = CSVFormat.RFC4180.parse(new StringReader(eg1)).asScala.toArray
assert(acc_records(1).get(0) == "line1\r\nline2")
val oc_reader = new CSVReader(
new StringReader(eg1),
CSVParser.DEFAULT_SEPARATOR, CSVParser.DEFAULT_QUOTE_CHARACTER, '\u0000')
val oc_records = oc_reader.readAll().asScala.toArray
assert(oc_records(1)(0) == "line1\nline2")
}
trait Examples {
val eg1: String = "a,b\r\n\"line1\r\nline2\",1"
} |
Well very few sources of CSV conform to other aspects of RFC4180 (hence all parsers having lots of configuration options to try to cope with the vagaries of CSVs), why should this be different? :) In particular, all CSV tools will normally cope with, and typically generate, platform-compliant non-quoted line endings not RFC4180 I still think this specific case is simply broken data that should be cleaned by preprocessing rather than cleaned by text processing in the dclib rules, feels like a better separation of concerns. However, if this is causing problems then I'd be OK to switch dclib to commons-csv and test whether using RFC4180 mode would break too many other things or, more likely, make it an option. |
We've had a few source CSV files with character sequence
CR
CR
LF
inside cells, used to denote a single line break. The resulting sequence indclib
— i.e. when dealing with values in templates — isLF
LF
, essentially doubling the line breaks. This becomes a problem when users are trying to separate paragraphs in a description, for instance, where the usual practice is to use a double line break to separate paragraphs, which comes out as 4LF
characters and often gets turned into two<br />
s in the resulting HTML.@skwlilac and I tracked this down to the CSV parser
opencsv
(version 2.3) which underneath uses Java'sBufferedReader
to iterate over lines of text, where lines are delimited by eitherCR
,LF
, orCR
LF
. As far as the CSV parser is concerned, if a line ends in the middle of a quoted cell, then it adds back aLF
and continues reading the value.This dependency comes from
lib
version 2.0.0, which in turn comes fromappbase
2.0.0.opencsv
looks to have had a number of forks and owners over the intervening years, but does now say that it deals properly withCR
characters in valuesWhile this character sequence shouldn't be being used in the first place, I'd argue that the parser shouldn't be interpreting characters in cell values and should pass things through verbatim for processing within templates.
The text was updated successfully, but these errors were encountered: