Retrieving, publishing and loading data in a single DCAT-centric tool.
One could say, DCAT is to datasets what the pom is to Java software projects.
Question: How many commands does it take to load the 50+ files of this ckan record into a virtuoso triple store (with default port and credentials)?
Answer: 2
dcat import ckan --url=http://ckan.qrowd.aksw.org --dataset=org-linkedgeodata-osm-bremen-2018-04-04 > /tmp/dcat.nt
dcat deploy virtuoso --allowed=/writeable/dir/readable/by/virtuoso /tmp/dcat.nt
Note: It works also for the DCAT based DBpedia DataID datasets:
dcat show http://downloads.dbpedia.org/2016-10/core-i18n/en/2016-10_dataid_en.ttl > /tmp/dcat.ttl
Question: And how do I create a graph group so I can view all these files as a single graph?
Answer: It already happened
Question: So I have this DCAT file with dcat:downloadURL pointing to local files. How can I publish it to CKAN?
Answer: Like this:
dcat deploy ckan --url=http://ckan.example.org --apikey=my-ckan-api-key dcat.nt
Installing as root will perform global install in the folders /usr/local/share/dcat-suite
and /usr/local/bin
.
For non-root users, the folders are ~/Downloads/dcat-suite
and ~/bin
.
Run setup-latest-release.sh uninstall
to conveniently remove downloaded and generated files.
-
via curl
bash -c "$(curl -fsSL https://raw.githubusercontent.com/SmartDataAnalytics/dcat-suite/develop/setup-latest-release.sh)"
-
via wget
bash -c "$(wget -O- https://raw.githubusercontent.com/SmartDataAnalytics/dcat-suite/develop/setup-latest-release.sh)"
API | DCAT retrieval | Deploy RDF | Deploy non RDF |
---|---|---|---|
CKAN | X | X | x |
Virtuoso RDF Bulk Loader | . | X | n/a |
Generic SPARQL | . | . | |
URL to DCAT resource | X | n/a | n/a |
. = future work
Here is a short example of a DCAT dataset description in order to give you an impression of what we are talking about.
@prefix eg: <http://example.org/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
eg:myDataset
a dcat:Dataset ;
dct:identifier "my-dataset" ;
dct:title "My Dataset" ;
dct:description "Really useful dataset" ;
dcat:distribution eg:myFirstDistribution-of-myDataset ;
.
eg:myFirstDistribution-of-myDataset
a dcat:Distribution ;
dct:title "My Distribution" ;
dct:description "Download of my distribution" ;
dcat:accessURL <a/relative/path/or/a/url/of/a/web/resource/or/a/named/graph> ;
.
- Show help
dcat --help
- Show all DCAT related information from an RDF URI or filename
dcat show my-dcat.nt
- Deploy datasets based on a DCAT description to CKAN
dcat deploy ckan --apikey=yourApiKey --url=yourCkanUrl my-dcat.nt
This will create a copy of the input DCAT file under target/ckan/deploy-dcat.nt
file with the dcat:accessURL
replaced by the CKAN resources. If you host this file anywhere on the Web, it will give you working download links - neat!
- Deploy a self-describing dataset (see below) to CKAN
dcat deploy ckan --apikey=yourApiKey --url=yourCkanUrl mySelfDescribingDataset.nq
- Expand the graphs of a self-describing dataset to individual files based on its contained DCAT description
dcat expand mySelfDescribingDataset.nq
# Now you can also deploy the expanded form:
cd target/dcat/mySelfDescribingDataset
dcat deploy ckan dcat.nt --url=yourCkanUrl --apikey=yourSecretKey
mvn clean install
After the build run
./reinstall-debs.sh
A SDD is simply a quad-based dataset that contains DCAT dataset and distribution information in its default graph.
The dcat:accessURL
attribute of distributions is thereby intepreted as follows:
- If at least one of the given accessURLs matches the IRI of a graph within the SDD,
ckan-deploy
will deploy a an RDF file to CKAN that is the union of all graphs denoted by accessURLs. An error will raised if any other accessURL points to a non-existent graph. - If there is at most one accessURL, a CKAN resource will be created, with the URL attribute set if present.
- An error is raised otherwise
You can use your favourite RDF tool.
Shamless self-advertisement: Sparql Integrate is a tool that enables expressing data integration workflows as a sequence of SPARQL queries that make use of function extensions for XML, CSV and JSON processing. Hence, it makes it fairly easy to create quad based datasets. You only need to design your workflow such that it outputs appropriate DCAT descriptions.
This example assumes that the debian packages of ckan-deploy
and sparql-integrate
are installed.
cd /tmp
git clone https://github.com/QROWD/QROWD-RDF-Data-Integration.git qrowd-rdf-data-integration
cd qrowd-rdf-data-integration/datasets/1046-1051
sparql-integrate workloads.sparql process.sparql emit.sparql > dataset.nq
dcat deploy ckan --url=yourCkanInstance --apikey=yourApiKey dataset.nq
The dataset entry on our CKAN: http://ckan.qrowd.aksw.org/dataset/trento-railway-time-tables
For explanations about the transformations using the *.sparql
files, please refer to this page.
These commands are not yet implemented, but appear to be useful. These descriptions are not final.
- Generate a meta dcat file that treats another dcat file as a dataset. The meta file can be used to deploy the described file.
dcat meta my-datasests.dcat.nt > meta.dcat.nt
- Upload rdf file via SPARQL Update
dcat deploy sparql --user=dba --pass=dba --url=http://example.org/sparql dcat.nt
- Add support for user agent field on upload
- Possibly add support for profiles that bundle commonly needed information, such as apikey and user agent