Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ignoreRows for TabularResource #344

Closed
roll opened this issue Dec 20, 2016 · 4 comments · Fixed by frictionlessdata/datapackage-v2-draft#41
Closed

Support ignoreRows for TabularResource #344

roll opened this issue Dec 20, 2016 · 4 comments · Fixed by frictionlessdata/datapackage-v2-draft#41
Assignees
Milestone

Comments

@roll
Copy link
Member

roll commented Dec 20, 2016

Overview

Resource specification is created to describe concrete data source with metadata. When we deal with concrete real world data sources there could be some corner case like commented rows or blank rows on top etc. A publisher needs an ability to share this information with implementations.

Example

https://github.com/frictionlessdata/ADB-User-Study/blob/master/metadata.tsv

It's a valid resource (checked by goodtables) except row 2 and 3 which are comments and can't be removed because it's vital metadata for this publisher tools.

Proposal

Introduce ignoreRows (or skipRows or informationalRows or ?) attribute for TabularResource specification. This attribute MUST be an array of integers and strings where:

  • numbers mean row number to ignore the row
  • strings mean row first characters to match to ignore the row

Example

ignoreRows = [1, 2, "#","//"]

Related

Headers is another example where publisher could be in need of more granular control over data source rows - #326

References

@roll roll changed the title Support ignore rows for tabular resource Support ignoreRows for tabular resource Dec 20, 2016
@roll roll changed the title Support ignoreRows for tabular resource Support ignoreRows for TabularResource Dec 20, 2016
@pwalsh
Copy link
Member

pwalsh commented Dec 20, 2016

closely related to #326

@rufuspollock
Copy link
Contributor

rufuspollock commented Dec 21, 2016

@roll i'm super cautious about this kind of stuff as it is a place where "ETL" logic starts to bleed into the spec and that's a slippery slope. If you delete rows, what about columns, what about transforms etc etc.

Thus, my sense is that ETL stuff like this should not go into the spec for now - at most it should be in patterns and even there i'm cautious.

PS: i am willing to consider #326 because it is so common and it is about presence of a header row.

@roll
Copy link
Member Author

roll commented Dec 22, 2016

@rufuspollock
@pwalsh has said the same but there is a very common real world problem and it needs some help from specs (may be patterns?). It's only datapackage problem - on other levels implementations could use own options but datapackage encapsulates all knowledge about data sources (and that's the thing - data containerization) so we need some way to allow this information injection (cc @danfowler)

@pwalsh pwalsh added this to the v1.0 milestone Feb 5, 2017
@rufuspollock
Copy link
Contributor

rufuspollock commented Feb 5, 2017

AGREED with @pwalsh: this should go to "Best Practice" rather than spec for now.

@rufuspollock rufuspollock modified the milestones: Backlog, v1.0 Feb 5, 2017
@roll roll removed this from the Backlog milestone Apr 14, 2023
@roll roll added this to the v2 milestone Jan 3, 2024
@roll roll self-assigned this Feb 21, 2024
@roll roll added the proposal label Feb 22, 2024
@roll roll modified the milestones: v2.0-draft, v2.0 Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants