Data Summary and Quality description spec #364

rufuspollock · 2017-01-29T12:30:18Z

This would be a spec for describing summary information about data -- often on a per field (column) or per resource (table) basis. Things like:

Most common values
Means, median, max, min etc (for numeric fields)
Histogram of values
Sample values
Type guess (if type not known)

Quality side:

Note: I think we probably have a proto version of this spec, at least for reporting errors against schema in the goodtables library

Percentage of values in this field inconsistent with the schema
List of erroneous rows / columns
Example errors
"Probable errors" - useful for where schema is missing (e.g. 99.9% of field have non-zero values and this field is -ve => may be an error)

Background

We have talked about this for years - in fact all the way back to early versions of CKAN.

It is common practice to acquire and present this data in data analytics and data warehouse applications.

pwalsh · 2017-01-30T21:21:00Z

See https://github.com/frictionlessdata/data-quality-spec

This was extracted out of the first version of goodtables, into a form that can be used in any code or just used as a reference. This extracted form of the data quality spec is now used as is in the updated goodtables codebase, and I'd love to get some more feedback on the spec itself, and see more potential use cases.

patcon · 2017-02-07T05:57:46Z

Carrying over suggestion from #324:

Add a quantitative metric for "data points" counts. The idea would be to allow for a better measure of portal-wide data quantity, other than the current favourite of "total number of datasets or resources".

This would be most useful per-resource and perhaps per-dataset. The definition of a datapoint would presumably be non-empty values in columns indicated to contain data.

Stephen-Gates · 2017-08-03T11:22:57Z

Some other potential values:

Row and Column count
Outliers:
- Outliers from the mean are numbers more than three standard deviations from the mean
- Outliers from the median are numbers more than three median absolute deviations from the median.
Completeness - Calculate the percentage of rows that are empty for each column without a “required” constraint
Uniqueness %
Unique count
Min / Max Length
Missing Values Y/N
Null Count
Precision
and many more

Really like this idea. Would the values be published in a Readme.md, or a separate data-quality.csv file in the data package?

Stephen-Gates · 2017-08-04T01:13:16Z

@rufuspollock @pwalsh I'd like to make a start on this. I'm thinking:

a csv reference of quality measures, e.g.

quality-measure-name, description, equation, type, url-to-further-detail
column-mean, mean of all values in one column, column-sum / row-count, number, fd.io\quality\column-mean.md

at the url-to-further-detail discussion on:
- how to measure
- related resources
for a table in a data package, a csv describing quality e.g.

column-name, quality-measure-name, value
price, column-mean, 12.345

a spec to describe how the data-quality.csv for each table is included in a datapackage.zip

datapackage.zip
  |
  L datapackage.json
  L readme.md
  |
  L data
  |    L pricing.csv
  |    L inventory.csv
  |
  L quality
  |    L pricing-quality.csv
  |    L inventory-quality.csv
  |
  L scripts

Would a file like pricing-quality.csv be a type of data-resource e.g. a tabular-data-quality-resource with a fixed table-schema?

Am I on the right track?

One problem is that the value in the tabular-data-quality-resource could be of varying data types making validation tricky. Would be good if it could be validated in the same way as a tabular-data-resource.

pwalsh · 2017-08-04T04:36:00Z

cc @roll

@Stephen-Gates we've been introducing such quality checks in goodtables in the last weeks, so I'm cc'ing @roll in case he sees a cross over, and, how this can tie in.

MAliNaqvi · 2018-01-16T22:26:02Z

@rufuspollock @Stephen-Gates @pwalsh
is there a need for a separate quality file, could we not create an optional section in datapackage.json to store the metrics?

Stephen-Gates · 2018-01-16T23:15:40Z

@MAliNaqvi I can see benefits in both approaches.

If csv:

datapackage.json is smaller
quality data may be more accessible to non-technical people
tools that don't support this part of the spec may drop the data quality information

If json:

easy for computers and experts to work with

The spec at present has options to support data in-line, in a path or at a url, so I'm sure we could cater for a few options here.

patcon · 2018-01-17T19:26:44Z

quality data may be more accessible to non-technical people

great point. that's reason enough. imho json means "you are looking where you're not supposed to" for non-technical people.

Stephen-Gates · 2018-01-17T20:51:29Z

I'm happy to draft something if I can get a "steer" from @rufuspollock, @pwalsh, or @roll on the general direction to take.

I think you could capture at least 6 types of measures:

package errors e.g. valid/pass
package stats e.g. 3 data resources
table errors e.g. encoding-error, blank-row, schema-error from data-quality-spec
table stats e.g. number of columns and rows
column errors e.g. type-or-format-error, pattern-constraint from data-quality-spec
column stats e.g. min/max value, outliers, mean, implied precision, completeness etc.

I was speaking with others yesterday about if Data Curator should allow you to publish a data package with data that doesn't validate against the schema. We ended at, letting the publisher decide to publish but add the residual validation errors to the README.md to warn the data consumer of data errors. It a bit of a hack that hopefully this extension to the spec could solve.

Looking forward your thoughts 🤔

Stephen-Gates · 2018-01-22T07:39:12Z

OK, I've made a start on a Data Quality Pattern. It is still a work in progress but probably good enough to get some feedback and work out if I'm going in the right direction.

rufuspollock · 2018-01-22T13:21:13Z

@Stephen-Gates great to get this started.

Comments:

Process issue: suggest putting the spec into a hackmd for now as that is the easiest way to collaboratively edit it
I think it would be good to start by listing user stories explicitly - I've put some example ones below
definitely think this is something that, by default, would go into the datapackage.json. I think we should design around a "JSON" format and serializing this to CSV rather than JSON is not that hard.

User stories

As a Consumer of a Dataset I want to know how many rows there are without having to load and parse the whole dataset myself so that I can display this information to others or ...

As a Publisher or Consumer of a Dataset I want to know what validation errors there are with this dataset without having to validate this dataset myself so I know [Consumer] what issues to expect or [Publisher team member] I know what i need to fix

Note: we already have a data validation reporting format in the form of good tables stuff. I've also just opened this epic about unifying validation reporting in a consistent way here https://github.com/frictionlessdata/implementations/issues/30.

Stephen-Gates · 2018-01-23T20:31:16Z

@rufuspollock happy for a quality assessment tool to produce JSON, and then produce a CSV. What I wrote is focussed on sharing the data quality measures with others.

Happy to have a consistent approach between data validation and quality assessment, so thanks for the Epic.

Requirements are in the document, just not written in user story format.

The pattern supports the requirement to:

associate data with data quality measures and annotations

associate data quality measures with a set of documented, objective data quality metrics

support user-defined or domain-specific data quality metrics

discover data packages and data resources that contain data quality measures

compare data quality measures across similar types of data

I'm not sure about your statement,

definitely think this is something that, by default, would go into the datapackage.json

That's what I've proposed. Unless you mean you want to assess everything in the data package at once and place all the results in one measurement file? I'm not sure this works as different types of data in the same package could be measured by different metrics (e.g. spatial vs tabular).

I started in HackMD but created a repo to help me think things thru and spilt up a document that was becoming too big and to provide examples.

Stephen-Gates · 2018-02-01T19:54:45Z

HackMD version of pattern from GitHub - https://hackmd.io/s/BJeKJgW8G

✏️ Comments and edits are welcome

patcon · 2018-02-01T20:30:47Z

Just an FYI: hackmdio/codimd#579

Stephen-Gates · 2018-02-01T20:54:00Z

Thanks @patcon

@rufuspollock the Data Quality Pattern was posted to HackMD on your advice. Are you considering alternate platforms going forward given hackmdio/codimd#579?

Happy to collaborate on https://github.com/Stephen-Gates/data-quality-pattern if people aren't happy with HackMD

Stephen-Gates · 2018-02-02T06:02:09Z

I wonder if it's worth folding in the ideas in #281 Support for observational error measurements in data into this spec?

patcon · 2018-02-02T22:47:53Z

No pressure to bikeshed the tool for this specific doc :)

Just recalled introducing it to someone here, and wanted to ensure folks at okfn had full context going forward

Stephen-Gates · 2018-02-04T00:38:29Z

Added user stories https://github.com/Stephen-Gates/data-quality-pattern/blob/master/user-stories.md

Note: we already have a data validation reporting format in the form of good tables stuff. I've also just opened this epic about unifying validation reporting in a consistent way here frictionlessdata/implementations#30.

Stories include reporting validation results

Contributions welcome

rufuspollock · 2019-01-31T19:13:43Z

See also these recent pandas discussion pandas-dev/pandas#22819

rufuspollock added the new-spec-or-general-discussion label Jan 29, 2017

rufuspollock mentioned this issue Jan 29, 2017

Add example key to field descriptor #363

Closed

pwalsh added this to the Backlog milestone Feb 5, 2017

pwalsh mentioned this issue Feb 5, 2017

[Discussion] Better aggregate metric for dataset comparison #324

Closed

This was referenced Jan 17, 2018

Highlight validation errors qcif/data-curator#275

Closed

Add residual errors to readme.md qcif/data-curator#393

Closed

Stephen-Gates mentioned this issue Feb 4, 2018

Refine pattern to accomodate reporting validation results Stephen-Gates/data-quality-pattern#6

Open

roll mentioned this issue Sep 3, 2019

script that pulls a group of data packages into a sqlite database catalyst-cooperative/pudl#364

Closed

fjuniorr mentioned this issue Jun 2, 2022

Implement resource.analyze function and CLI command frictionlessdata/frictionless-py#1067

Closed

4 tasks

rufuspollock mentioned this issue Aug 23, 2022

Hash should be an object? #379

Closed

roll added this to Open Knowledge Apr 14, 2023

roll removed this from the Backlog milestone Apr 14, 2023

roll added Patterns and removed New Spec labels Jan 3, 2024

frictionlessdata locked and limited conversation to collaborators Apr 12, 2024

roll converted this issue into discussion #909 Apr 12, 2024

github-project-automation bot moved this to Done in Open Knowledge Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Data Summary and Quality description spec #364

Data Summary and Quality description spec #364

rufuspollock commented Jan 29, 2017

pwalsh commented Jan 30, 2017

patcon commented Feb 7, 2017 •

edited

Loading

Stephen-Gates commented Aug 3, 2017 •

edited

Loading

Stephen-Gates commented Aug 4, 2017 •

edited

Loading

pwalsh commented Aug 4, 2017

MAliNaqvi commented Jan 16, 2018

Stephen-Gates commented Jan 16, 2018

patcon commented Jan 17, 2018

Stephen-Gates commented Jan 17, 2018

Stephen-Gates commented Jan 22, 2018

rufuspollock commented Jan 22, 2018

Stephen-Gates commented Jan 23, 2018

Stephen-Gates commented Feb 1, 2018

patcon commented Feb 1, 2018

Stephen-Gates commented Feb 1, 2018

Stephen-Gates commented Feb 2, 2018

patcon commented Feb 2, 2018 •

edited

Loading

Stephen-Gates commented Feb 4, 2018

rufuspollock commented Jan 31, 2019

This issue was moved to a discussion.

This issue was moved to a discussion.

Data Summary and Quality description spec #364

Data Summary and Quality description spec #364

Comments

rufuspollock commented Jan 29, 2017

Background

pwalsh commented Jan 30, 2017

patcon commented Feb 7, 2017 • edited Loading

Stephen-Gates commented Aug 3, 2017 • edited Loading

Stephen-Gates commented Aug 4, 2017 • edited Loading

pwalsh commented Aug 4, 2017

MAliNaqvi commented Jan 16, 2018

Stephen-Gates commented Jan 16, 2018

patcon commented Jan 17, 2018

Stephen-Gates commented Jan 17, 2018

Stephen-Gates commented Jan 22, 2018

rufuspollock commented Jan 22, 2018

User stories

Stephen-Gates commented Jan 23, 2018

Stephen-Gates commented Feb 1, 2018

patcon commented Feb 1, 2018

Stephen-Gates commented Feb 1, 2018

Stephen-Gates commented Feb 2, 2018

patcon commented Feb 2, 2018 • edited Loading

Stephen-Gates commented Feb 4, 2018

rufuspollock commented Jan 31, 2019

This issue was moved to a discussion.

patcon commented Feb 7, 2017 •

edited

Loading

Stephen-Gates commented Aug 3, 2017 •

edited

Loading

Stephen-Gates commented Aug 4, 2017 •

edited

Loading

patcon commented Feb 2, 2018 •

edited

Loading