Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CSV dialect to specify the meaning of blank lines #150

Closed
hubgit opened this issue Nov 12, 2014 · 14 comments
Closed

Allow CSV dialect to specify the meaning of blank lines #150

hubgit opened this issue Nov 12, 2014 · 14 comments

Comments

@hubgit
Copy link

hubgit commented Nov 12, 2014

A blank/empty line in a CSV file can have several meanings.

A StackOverflow discussion lists these possibilities for how a CSV parser could handle blank lines:

  • Ignore blank lines
  • Treat a blank line as a row with zero fields
  • Treat a blank line as a row with one empty field
  • Treat a blank line as the end of the input file

A CSV parser needs to know which of those meanings applies to a particular CSV file.

To see a CSV file that ends the data with a blank line then continues with free text metadata, run a search in the Kew Herbarium Catalogue, then choose one of the options under "Download specimen records".

@rufuspollock
Copy link
Contributor

@hubgit this is a great suggestion. Do you want to make a formal suggestion for a mod to the spec?

@rufuspollock
Copy link
Contributor

@hubgit ping re above comment ^^^

@pwalsh
Copy link
Member

pwalsh commented Mar 7, 2016

@hubgit ( @rgrp @danfowler )

I'm interested to have this as part of CSV dialect. In GoodTables there are config options for this as part of a validation run over a CSV file. To declare the behaviour as part of the CSV spec is appealing.

@rufuspollock
Copy link
Contributor

+1 on this as a nice addition.

I think list of options is:

  • ignore: Ignore blank lines
  • endOfInput: Treat a blank line as the end of the input file
  • error: a complete blank row signifies and error and should terminate processing?
    • is this a duplicate of the constraints?
  • zeroFields: Treat a blank line as a row with zero fields
    • What does this mean?
  • oneEmptyField: Treat a blank line as a row with one empty field
    • What does this mean?

@pwalsh what are the options to GoodTables?

@pwalsh
Copy link
Member

pwalsh commented Mar 7, 2016

@rgrp GoodTables can ignore blank rows, and ignore ragged rows (in both cases, instead of raising and exception).

@hubgit
Copy link
Author

hubgit commented Mar 7, 2016

oneEmptyField: Treat a blank line as a row with one empty field

This is what PHP's CSV parser does - a blank line is parsed as [null] (an array with one column containing null)

zeroFields: Treat a blank line as a row with zero fields

This is an alternative to the above, avoiding the null column.

I suppose they're both describing the parser behaviour more than the data, so debatable whether they should be included in a CSV dialect.

@hubgit
Copy link
Author

hubgit commented Mar 7, 2016

Maybe there should also be:

startOfInput: Treat a blank line as the start of the input file (for when the metadata header is at the top of the file). Sometimes this would be covered by a headerRows integer, but if that's not a fixed number then the data might instead be found after one or more blank lines.

@rufuspollock
Copy link
Contributor

@hubgit thanks for the clarification. I guess my sense is that "oneEmptyField" and "zeroFields" are a bit odd in that if you have e.g. headers then I'd just say an empty row means that all fields are empty rather than zero or one. I wonder if we could just have: an empty option which means treat the row as valid but empty (and the parser can determine what they mean by "empty").

@rufuspollock
Copy link
Contributor

@hubgit any final thoughts before this goes in (including on my last comment).

@hubgit
Copy link
Author

hubgit commented Apr 19, 2016

@rgrp I think ignore, empty (default?), end (and possibly start) sound reasonable, though I haven't used them enough in practice to say for sure.

@rufuspollock
Copy link
Contributor

I am just recording that I am hesitating a bit on this one. Looking at various parsers these do not seem to be very common options and add a fair amount of complexity to something implementing CSV DDF - it also seems to be extending beyond a pure dialect description to something about how the data is formatted.

My thoughts is that we might write this up as a pattern rather than something in the primary spec.

@hubgit
Copy link
Author

hubgit commented May 25, 2016

That's fair enough - I imagine it could be handled fairly easily by a client, outside of the CSV parser, by simply ignoring everything before or after the first empty row.

@pwalsh
Copy link
Member

pwalsh commented Jul 12, 2016

I'm suggesting we close this as a nice idea and WONTFIX: implementors can do it, but let's keep it out of the spec. @rgrp are you ok with that?

@roll roll added the backlog label Aug 8, 2016
@rufuspollock
Copy link
Contributor

WONTFIX. As per above discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants