-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Options for metadata to describe tabular data in a more structured way? #1418
Comments
@sabinem As far as I understand, DCAT specifies a way to describe datasets in a catalogue, not the data inside a dataset. It defines a way to describe characteristics like what it is about, who is responsible, and for the distribution where it is and what format it is in. DCAT does not look inside the data in the dataset in any detail. This is, in my mind, because DCAT does not want to limit the kind of datasets it can describe. Data may indeed often be tabular but there are many cases where it is not. Other groups at W3C have concentrated on the data, and the one that would interest you the most I guess is CSV on the Web, but there are others like Data Cube for multi-dimensional data like statistics, Spatial Data on the Web and others. DCAT provides a way to describe these various types of datasets with different kinds of data together in a catalogue. |
And by the way, DCAT specifies the use of the property dct:conformsTo to be used on a Distribution "to indicate the model, schema, ontology, view or profile that this representation of a dataset conforms to". |
It makes sense to me that the catalog entry would include a link to the specification that defines the dataset structure and semantics. Not providing this link as part of the catalog entry seems to be a real disservice, causing users to hunt around to discover if such a document exists. I don't think |
@makxdekkers and @kcoyle thank you both for your answers. @makxdekkers I do understand that datasets differ and I also know that some distribution formats have metadata build in to them (shapefiles for example can come with metadata). I also get that DCAT does not want to restrict the kinds of dataset, but I stiill think it would be great if it would come with an optional hint on how to add metadata to distributions, that don't come with metadata on their own. The reason for this is that datapublishers and probably everybody tends to put their focus on the things that are explicitely mentioned: it is just normal to forget about all the other stuff that is not mentioned, Therefore there are many open datasets today, with tabular data, where the field descriptions are missing. I agree with @kcoyle that In my opinion it would be helpful to have the possibilty to add an attribute similar as in DCAT-US: |
Good points @sabinem @kcoyle . In the case of CSV, which is the case that @sabinem raises, my worry would be that if DCAT tries to find a solution for describing the meaning of the columns in the file, we are in fact doing something in parallel to the work done by the CSV on the Web working group, and in particular the W3C recommendation Metadata Vocabulary for Tabular Data which looks like a really thorough (and standard!) solution for this. In this case, could our approach be to explain how to use that recommendation in conjunction with DCAT? The more general remark that I have is that I would not want to just add one or more properties as a quick solution without a thorough analysis of the use cases, the expectations of publishers and users, interoperability aspects and the availability of existing standards to meet the requirements. |
Exactly, that is what I mean. See here on the location of metadata for a csv file in the W3C recommendation Metadata Vocabulary for Tabular Data: One option there is to link the metadata by a url. But how will a user be able to locate that metadata description? The instruction don't seem very straight forward to me. Why can't DCAT offer that as an optional property:
I am just wondering: Can this be made ito a DCAT property "dcat:describedby or similar with domain resource and the usage note linking to W3C recommendation Metadata Vocabulary for Tabular Data? This would help a lot:
With a hook like this the work would not be doubled, just build onto and I think that it would make sense to reach out to the people behind the tabular data and maybe ask them how their work can best be integrated into DCAT and if a property such as the proposed above would make sense to them. |
@sabinem I think this is a good discussion for a potential addition in DCAT4. I would, however, prefer to make the discussion wider than just for CSV files. The more general use case would be that the file accessed through the downloadURL is not self-describing but is associated with one or more resources that describe various aspects necessary to be able to understand and/or process the file. We can then see if we can find a general solution that helps interoperability. Would that make sense @pwin, @riccardoAlbertoni, @dr-shorthair, @agbeltran, @davebrowning, @andrea-perego, or would it be out of scope? |
I disagree. There is nothing to stop you using On the wider issue: I see the tabular data descriptions as complementary to DCAT. |
The "correct" approach according to the scoping discussions would be to define a profile of DCAT that defines a canonical structural description. RDF-QB is probably the leading candidate for such a description - and the StatDCAT-AP profile is a potential starting point see : https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/solution/statdcat-application-profile-data-portals-europe/release/100 The question is whether there should be a profile heirarchy:
(this would make simple declarative statements about the expected interoperability as well as provide logical extension points for new profiles to suit any similar requirements without re-inventing these patterns.) For each profile, SHACL validations, JSON schema, JSON-LD contexts etc could be implemented in a modular fashion to simplify implementation. (This is just extending what the DCAT-AP community is already doing toward a scalable approach to support machine access and readability of interoperability requirements.) Note that publishing profiles of DCAT is explicitly out of scope of this working group - but there must be a sizeable community who would have an interest in doing that in a coherent way - it would be helpful if the DXWG could pass off these matters to an appropriate forum. |
You can point to a detailed schema. You could also point to a very general profile document. How would a user know which it is? It's the OR that is the problem in that definition. I read the opening comment here as someone wanting to obtain a specific type of resource that gives the schema describing the data in the dataset, and |
Maybe there are a couple different use cases here:
|
I suspect it will be a very common pattern to "promote" up out of detailed structural and semantic descriptions one or more simple properties to support faceted searching - such as variableMeasured, dct:subject etc. Since we dont want to mandate the form of those structural descriptions, profiles of DCAT that assume different structural descriptions could define an entailment rule (property chain, SHACL-AF rule etc ) to derive these from a particular form of detailed description. So one could envisage a profile of DCAT for observation collections - where the simple properties include the ObservedProperty, the type(s) of the Feature of Interest etc - and then profile this for different detailed structural patterns and descriptions that are common to significant communities of practice. |
I agree with @rob-metalinkage that adding specificity to the 'general' property Maybe then this group could investigate whether there is a set of 'common' subproperties of |
@makxdekkers I think your suggestion and explanation on how to work with I wasn't aware that a profile can define subclasses. So this is also a good suggestion. I agree with @kcoyle that it helps users if properties that mean different things have different names: the subclasses would solve this. So it seems DCAT as it is can already cover this demand: that is good to know. |
Apologies, when I wrote "the profile could create subclasses of |
@makxdekkers Yes of course it is subproperties. My apologies as well. I should have applied my critical thinking skills. |
A bit late to the game, but I agree with @kcoyle, @sabinem and several other that the main problem is knowing what Hence, another alternative (which I also find a bit more appealing) would be to use two levels of
Where CSVOnTheWebSpecification would be one of several well known data specification maintained as a vocabulary by those providing profiles of DCAT. The CSVOnTheWebSpecification could also be described as an instance of dcterms:Standard and perhaps utilize the profiles vocabulary for a richer description. |
Thanks all for the valuable input. In the last call, we discussed this issue, and we agreed to have a new requirement for a future version of DCAT 4 and close this issue. (see https://www.w3.org/2021/11/16-dxwgdcat-minutes#r02 ) The new requirement is now tracked by issue #1426. |
I am just wondering, why does DCAT not help in a more structured way with providing metadata about tabular data?
A lot of data is provided as csv files, which would then be dcat:Distributions. For a datauser it is of great importance to understand the tabular structure of these data files. So in the csv file example: what is the meaning of each column?
DCAT provides an unstrutured dct:description for this purpose, but as far as I know there is no support to describe tabular data in a more structured way with it. Or am I wrong with this? In case I am right about this: Why did DCAT go this way?
What I found on describing metadata of tabular data was this here: https://www.w3.org/TR/tabular-data-model/#locating-metadata
I also found a property https://resources.data.gov/resources/dcat-us/#distribution-describedBy DCAT-US, that might help in this regard.
This issue is mainly about understanding.
The text was updated successfully, but these errors were encountered: