Allow CSV dialect to specify the meaning of blank lines #150

hubgit · 2014-11-12T14:35:28Z

A blank/empty line in a CSV file can have several meanings.

A StackOverflow discussion lists these possibilities for how a CSV parser could handle blank lines:

Ignore blank lines
Treat a blank line as a row with zero fields
Treat a blank line as a row with one empty field
Treat a blank line as the end of the input file

A CSV parser needs to know which of those meanings applies to a particular CSV file.

To see a CSV file that ends the data with a blank line then continues with free text metadata, run a search in the Kew Herbarium Catalogue, then choose one of the options under "Download specimen records".

rufuspollock · 2015-05-26T12:14:01Z

@hubgit this is a great suggestion. Do you want to make a formal suggestion for a mod to the spec?

rufuspollock · 2015-09-24T15:37:42Z

@hubgit ping re above comment ^^^

pwalsh · 2016-03-07T06:36:40Z

@hubgit ( @rgrp @danfowler )

I'm interested to have this as part of CSV dialect. In GoodTables there are config options for this as part of a validation run over a CSV file. To declare the behaviour as part of the CSV spec is appealing.

rufuspollock · 2016-03-07T08:36:07Z

+1 on this as a nice addition.

I think list of options is:

ignore: Ignore blank lines
endOfInput: Treat a blank line as the end of the input file
error: a complete blank row signifies and error and should terminate processing?
- is this a duplicate of the constraints?
zeroFields: Treat a blank line as a row with zero fields
- What does this mean?
oneEmptyField: Treat a blank line as a row with one empty field
- What does this mean?

@pwalsh what are the options to GoodTables?

pwalsh · 2016-03-07T08:37:45Z

@rgrp GoodTables can ignore blank rows, and ignore ragged rows (in both cases, instead of raising and exception).

hubgit · 2016-03-07T09:06:55Z

oneEmptyField: Treat a blank line as a row with one empty field

This is what PHP's CSV parser does - a blank line is parsed as [null] (an array with one column containing null)

zeroFields: Treat a blank line as a row with zero fields

This is an alternative to the above, avoiding the null column.

I suppose they're both describing the parser behaviour more than the data, so debatable whether they should be included in a CSV dialect.

hubgit · 2016-03-07T09:08:54Z

Maybe there should also be:

startOfInput: Treat a blank line as the start of the input file (for when the metadata header is at the top of the file). Sometimes this would be covered by a headerRows integer, but if that's not a fixed number then the data might instead be found after one or more blank lines.

rufuspollock · 2016-03-07T11:18:26Z

@hubgit thanks for the clarification. I guess my sense is that "oneEmptyField" and "zeroFields" are a bit odd in that if you have e.g. headers then I'd just say an empty row means that all fields are empty rather than zero or one. I wonder if we could just have: an empty option which means treat the row as valid but empty (and the parser can determine what they mean by "empty").

rufuspollock · 2016-04-19T10:54:26Z

@hubgit any final thoughts before this goes in (including on my last comment).

hubgit · 2016-04-19T11:52:49Z

@rgrp I think ignore, empty (default?), end (and possibly start) sound reasonable, though I haven't used them enough in practice to say for sure.

rufuspollock · 2016-05-19T09:24:07Z

I am just recording that I am hesitating a bit on this one. Looking at various parsers these do not seem to be very common options and add a fair amount of complexity to something implementing CSV DDF - it also seems to be extending beyond a pure dialect description to something about how the data is formatted.

My thoughts is that we might write this up as a pattern rather than something in the primary spec.

hubgit · 2016-05-25T13:06:15Z

That's fair enough - I imagine it could be handled fairly easily by a client, outside of the CSV parser, by simply ignoring everything before or after the first empty row.

pwalsh · 2016-07-12T10:05:33Z

I'm suggesting we close this as a nice idea and WONTFIX: implementors can do it, but let's keep it out of the spec. @rgrp are you ok with that?

rufuspollock · 2016-08-09T14:12:31Z

WONTFIX. As per above discussion.

jpmckinney added the Table Dialect label Feb 3, 2015

rufuspollock added the Ready for PR label Mar 7, 2016

rufuspollock added FAQ / Pattern / Best Practice and removed Ready for PR labels May 19, 2016

roll added the backlog label Aug 8, 2016

rufuspollock closed this as completed Aug 9, 2016

rufuspollock removed the backlog label Aug 9, 2016

rufuspollock mentioned this issue Nov 17, 2016

Create a Patterns / FAQ section #321

Closed

roll added this to Open Knowledge Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow CSV dialect to specify the meaning of blank lines #150

Allow CSV dialect to specify the meaning of blank lines #150

hubgit commented Nov 12, 2014

rufuspollock commented May 26, 2015

rufuspollock commented Sep 24, 2015

pwalsh commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

pwalsh commented Mar 7, 2016

hubgit commented Mar 7, 2016

hubgit commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

rufuspollock commented Apr 19, 2016

hubgit commented Apr 19, 2016

rufuspollock commented May 19, 2016

hubgit commented May 25, 2016

pwalsh commented Jul 12, 2016

rufuspollock commented Aug 9, 2016

Allow CSV dialect to specify the meaning of blank lines #150

Allow CSV dialect to specify the meaning of blank lines #150

Comments

hubgit commented Nov 12, 2014

rufuspollock commented May 26, 2015

rufuspollock commented Sep 24, 2015

pwalsh commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

pwalsh commented Mar 7, 2016

hubgit commented Mar 7, 2016

hubgit commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

rufuspollock commented Apr 19, 2016

hubgit commented Apr 19, 2016

rufuspollock commented May 19, 2016

hubgit commented May 25, 2016

pwalsh commented Jul 12, 2016

rufuspollock commented Aug 9, 2016