Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options for metadata to describe tabular data in a more structured way? #1418

Closed
sabinem opened this issue Oct 30, 2021 · 18 comments
Closed
Labels
dcat dct:conformsTo requires discussion Issue to be discussed in a telecon (group or plenary)
Milestone

Comments

@sabinem
Copy link

sabinem commented Oct 30, 2021

I am just wondering, why does DCAT not help in a more structured way with providing metadata about tabular data?

A lot of data is provided as csv files, which would then be dcat:Distributions. For a datauser it is of great importance to understand the tabular structure of these data files. So in the csv file example: what is the meaning of each column?
DCAT provides an unstrutured dct:description for this purpose, but as far as I know there is no support to describe tabular data in a more structured way with it. Or am I wrong with this? In case I am right about this: Why did DCAT go this way?

What I found on describing metadata of tabular data was this here: https://www.w3.org/TR/tabular-data-model/#locating-metadata

I also found a property https://resources.data.gov/resources/dcat-us/#distribution-describedBy DCAT-US, that might help in this regard.

This issue is mainly about understanding.

@makxdekkers
Copy link
Contributor

@sabinem As far as I understand, DCAT specifies a way to describe datasets in a catalogue, not the data inside a dataset. It defines a way to describe characteristics like what it is about, who is responsible, and for the distribution where it is and what format it is in. DCAT does not look inside the data in the dataset in any detail. This is, in my mind, because DCAT does not want to limit the kind of datasets it can describe. Data may indeed often be tabular but there are many cases where it is not.

Other groups at W3C have concentrated on the data, and the one that would interest you the most I guess is CSV on the Web, but there are others like Data Cube for multi-dimensional data like statistics, Spatial Data on the Web and others.

DCAT provides a way to describe these various types of datasets with different kinds of data together in a catalogue.

@makxdekkers
Copy link
Contributor

And by the way, DCAT specifies the use of the property dct:conformsTo to be used on a Distribution "to indicate the model, schema, ontology, view or profile that this representation of a dataset conforms to".

@kcoyle
Copy link
Contributor

kcoyle commented Oct 30, 2021

It makes sense to me that the catalog entry would include a link to the specification that defines the dataset structure and semantics. Not providing this link as part of the catalog entry seems to be a real disservice, causing users to hunt around to discover if such a document exists.

I don't think dct:conformsTo is sufficient. In the DC vocabulary, dct:conformsTo is defined as: "An established standard to which the described resource conforms" which is fairly non-specific. Given that the DCAT definition of conformsTo also covers a wide range ("model, schema, ontology, view or profile") and that more than one of those might be needed in the description of a Distribution, it appears that there is not a way to indicate which object of conformsTo defines the data structure, or which defines, for example, a profile in a textual document. While it may be hard to distinguish between all of the possible objects of conformsTo the data structure of the dataset does seem to be necessary for use of the dataset.

@sabinem
Copy link
Author

sabinem commented Oct 30, 2021

@makxdekkers and @kcoyle thank you both for your answers.

@makxdekkers I do understand that datasets differ and I also know that some distribution formats have metadata build in to them (shapefiles for example can come with metadata). I also get that DCAT does not want to restrict the kinds of dataset, but I stiill think it would be great if it would come with an optional hint on how to add metadata to distributions, that don't come with metadata on their own. The reason for this is that datapublishers and probably everybody tends to put their focus on the things that are explicitely mentioned: it is just normal to forget about all the other stuff that is not mentioned, Therefore there are many open datasets today, with tabular data, where the field descriptions are missing.

I agree with @kcoyle that dct:conformsTo is not enough in that regard, since it does not link to a description, but just states which standard was applied.

In my opinion it would be helpful to have the possibilty to add an attribute similar as in DCAT-US: distribution → describedBy (see https://resources.data.gov/resources/dcat-us/#distribution-describedBy). With the usage note there it would help to make datapublisher aware that in certain cases it might be needed to add a link to a more structured description of the tabular structure of the distribution. As an optional field it would not be restrictive on datasets in general and would be sort of a minimal solution, adding some hint on what to provide if a distribution has no meta description build into itself.

@makxdekkers
Copy link
Contributor

Good points @sabinem @kcoyle .
I am not questioning the usefulness of linking to information about the structure of data that is not 'self-describing'. The question that I have is whether this needs to be done as part of DCAT or by some other specification.

In the case of CSV, which is the case that @sabinem raises, my worry would be that if DCAT tries to find a solution for describing the meaning of the columns in the file, we are in fact doing something in parallel to the work done by the CSV on the Web working group, and in particular the W3C recommendation Metadata Vocabulary for Tabular Data which looks like a really thorough (and standard!) solution for this. In this case, could our approach be to explain how to use that recommendation in conjunction with DCAT?

The more general remark that I have is that I would not want to just add one or more properties as a quick solution without a thorough analysis of the use cases, the expectations of publishers and users, interoperability aspects and the availability of existing standards to meet the requirements.

@sabinem sabinem changed the title Options for metadata to describe tabular data in a more strutured way? Options for metadata to describe tabular data in a more structured way? Oct 31, 2021
@sabinem
Copy link
Author

sabinem commented Oct 31, 2021

@makxdekkers

In this case, could our approach be to explain how to use that recommendation in conjunction with DCAT?

Exactly, that is what I mean.

See here on the location of metadata for a csv file in the W3C recommendation Metadata Vocabulary for Tabular Data: One option there is to link the metadata by a url. But how will a user be able to locate that metadata description? The instruction don't seem very straight forward to me. Why can't DCAT offer that as an optional property:

    rel="describedby", and
    type="application/csvm+json", type="application/ld+json" or type="application/json".

I am just wondering: Can this be made ito a DCAT property "dcat:describedby or similar with domain resource and the usage note linking to W3C recommendation Metadata Vocabulary for Tabular Data?

This would help a lot:

  • making both users and publishers aware of a good way to add/find metadata for Distributions, that don't have metadata build in by default
  • raising awareness of that other project of W3C recommendation Metadata Vocabulary for Tabular Data (I myself did not even know about that project, even though I am working with open data and DCAT for 2 years now).
  • improving data quality overall, since profiles can then hook on to that and add their own usage notes regarding metadata for distributions.

With a hook like this the work would not be doubled, just build onto and I think that it would make sense to reach out to the people behind the tabular data and maybe ask them how their work can best be integrated into DCAT and if a property such as the proposed above would make sense to them.

@makxdekkers
Copy link
Contributor

@sabinem I think this is a good discussion for a potential addition in DCAT4.

I would, however, prefer to make the discussion wider than just for CSV files. The more general use case would be that the file accessed through the downloadURL is not self-describing but is associated with one or more resources that describe various aspects necessary to be able to understand and/or process the file.

We can then see if we can find a general solution that helps interoperability.

Would that make sense @pwin, @riccardoAlbertoni, @dr-shorthair, @agbeltran, @davebrowning, @andrea-perego, or would it be out of scope?

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Nov 1, 2021

I agree with @kcoyle that dct:conformsTo is not enough in that regard, since it does not link to a description, but just states which standard was applied.

I disagree. There is nothing to stop you using dct:conformsTo to point to a detailed highly specific schema - in fact the DCAT usage note is very clear on this point: "This property SHOULD be used to indicate the model, schema, ontology, view or profile that this representation of a dataset conforms to." @kcoyle may be referring to the DCMI usage, but the DCAT usage note is deliberately more detailed.

On the wider issue: I see the tabular data descriptions as complementary to DCAT.

@rob-metalinkage
Copy link
Contributor

rob-metalinkage commented Nov 1, 2021

The "correct" approach according to the scoping discussions would be to define a profile of DCAT that defines a canonical structural description. RDF-QB is probably the leading candidate for such a description - and the StatDCAT-AP profile is a potential starting point see : https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/statdcat-application-profile-data-portals-europe/release/100

The question is whether there should be a profile heirarchy:

  1. DCAT-QB profile of DCAT and QB for any structural case (not just statistical data)
  2. StatDCAT being a profile of DCAT-QB
  3. StatDCAT-AP being a profile of StatDCAT and DCAT-AP

(this would make simple declarative statements about the expected interoperability as well as provide logical extension points for new profiles to suit any similar requirements without re-inventing these patterns.)

For each profile, SHACL validations, JSON schema, JSON-LD contexts etc could be implemented in a modular fashion to simplify implementation. (This is just extending what the DCAT-AP community is already doing toward a scalable approach to support machine access and readability of interoperability requirements.)

Note that publishing profiles of DCAT is explicitly out of scope of this working group - but there must be a sizeable community who would have an interest in doing that in a coherent way - it would be helpful if the DXWG could pass off these matters to an appropriate forum.

@kcoyle
Copy link
Contributor

kcoyle commented Nov 1, 2021

I disagree. There is nothing to stop you using dct:conformsTo to point to a detailed highly specific schema

You can point to a detailed schema. You could also point to a very general profile document. How would a user know which it is? It's the OR that is the problem in that definition. I read the opening comment here as someone wanting to obtain a specific type of resource that gives the schema describing the data in the dataset, and dct:conformsTo doesn't guarantee that. So the question is whether a DCAT description should specifically point to a schema that defines the data in the dataset, regardless of its format.

@smrgeoinfo
Copy link
Contributor

Maybe there are a couple different use cases here:

  1. what are the variables that are specified in the dataset (whether they are columns in a csv or keys in a JSON object, etc...). This is useful for discovery. Schema.org variableMeasured (see Science on schema.org discussion) might be sufficient for this (see
  2. what is the data structure (tabular-- long? wide?, objects, grids...). This is useful to the reuse part for FAIR, and making metadata machine-actionable. W3c data cube, DDI provide tools for this-- can DCAT make recommendation on which to use, how, and when?

@rob-metalinkage
Copy link
Contributor

I suspect it will be a very common pattern to "promote" up out of detailed structural and semantic descriptions one or more simple properties to support faceted searching - such as variableMeasured, dct:subject etc.

Since we dont want to mandate the form of those structural descriptions, profiles of DCAT that assume different structural descriptions could define an entailment rule (property chain, SHACL-AF rule etc ) to derive these from a particular form of detailed description.

So one could envisage a profile of DCAT for observation collections - where the simple properties include the ObservedProperty, the type(s) of the Feature of Interest etc - and then profile this for different detailed structural patterns and descriptions that are common to significant communities of practice.

@makxdekkers
Copy link
Contributor

I agree with @rob-metalinkage that adding specificity to the 'general' property conformsTo is the role of a profile. For example, the European DCAT-AP adds details: for Dataset, it refers to "an implementing rule or other specification" while for Distribution, it specifies "an established schema". Both fit in the general semantics of conformsTo. But if for some reason, an application would find this still too vague -- maybe because a stronger need for validation -- the profile could create subclasses of conformsTo, e.g. conformsToSpec and conformsToSchema.

Maybe then this group could investigate whether there is a set of 'common' subproperties of conformsTo for the description of datasets that could be added to DCAT?

@sabinem
Copy link
Author

sabinem commented Nov 2, 2021

@makxdekkers I think your suggestion and explanation on how to work with conformsTo is very helpful. It makes sense to me that it might mean different things as a property of a dataset and as a property of a distribution. And it does also make sense that a profile can further specify this exact meaning of it.

I wasn't aware that a profile can define subclasses. So this is also a good suggestion. I agree with @kcoyle that it helps users if properties that mean different things have different names: the subclasses would solve this.

So it seems DCAT as it is can already cover this demand: that is good to know.

@makxdekkers
Copy link
Contributor

Apologies, when I wrote "the profile could create subclasses of conformsTo, e.g. conformsToSpec and conformsToSchema", that should of course have been subproperties.

@riccardoAlbertoni riccardoAlbertoni added dct:conformsTo requires discussion Issue to be discussed in a telecon (group or plenary) labels Nov 2, 2021
@sabinem
Copy link
Author

sabinem commented Nov 2, 2021

@makxdekkers Yes of course it is subproperties. My apologies as well. I should have applied my critical thinking skills.

@matthiaspalmer
Copy link

matthiaspalmer commented Nov 17, 2021

A bit late to the game, but I agree with @kcoyle, @sabinem and several other that the main problem is knowing what conformsTo points to. Introducing subproperties like @makxdekkers are suggesting is a clear and simple way, but perhaps not flexible enough. I think it would be better to use a classifying property on the linked schema itself. The most obvious would be to use rdf:type and express the instance relation to a hierarchy of classes corresponding to a CSV on the web schema, XML Schema, JSON Schema, Datacube structure definition etc. However, introducing a hierachy of classes seems to be similar in rigidity to introducing subproperties. (From what I have seen profiles like DCAT-AP are not keen on introducing new classes, the preferred option seem to be to use vocabularies.)

Hence, another alternative (which I also find a bit more appealing) would be to use two levels of conformsTo. E.g.

someDistribution dcterms:conformsTo tabularDataSchemaInJSON
tabularDataSchemaInJSON dcterms:conformsTo CSVOnTheWebSpecification

Where CSVOnTheWebSpecification would be one of several well known data specification maintained as a vocabulary by those providing profiles of DCAT. The CSVOnTheWebSpecification could also be described as an instance of dcterms:Standard and perhaps utilize the profiles vocabulary for a richer description.

@riccardoAlbertoni
Copy link
Contributor

Thanks all for the valuable input.

In the last call, we discussed this issue, and we agreed to have a new requirement for a future version of DCAT 4 and close this issue. (see https://www.w3.org/2021/11/16-dxwgdcat-minutes#r02 )

The new requirement is now tracked by issue #1426.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dcat dct:conformsTo requires discussion Issue to be discussed in a telecon (group or plenary)
Projects
None yet
Development

No branches or pull requests

8 participants