This repository contains the overview documentation describing the resources and tools developed and used by the Semantic Interoperability Community (SEMIC).
The SEMICeu GitHub space is created to support the Semantic Interoperability Community with the creation of data specifications (e-Government Core Vocabularies, Application Profiles, etc.). A list of these data specifications can be found on the SEMIC website. This documentation describes both how the different SEMIC resources are organised in GitHub repositories, as well as what are the supporting assets that need to be set up, configured, and used to realise (i.e. edit and publish) such SEMIC data specifications.
This documentation covers the state of affairs as of May 2022. The objective is that readers of this documentation are provided with sufficient anchors to initiate mastering the methods and tooling used to create and manage data specifications.
For readers not acquinted with SEMIC we recommend to read this page first and afterwards go through the different markdown files in the following order: Actors > Artefact generation > Data model > Editorial flow > Toolchain
Other resources in the documentation are: an explanation of PURIs, the Glossary, and additional information about XSD artefact generation.
Links to relevant contents and other pages are numbered throughout the text in a bibliographic reference style (e.g. [1], [2]), and are provided in a dedicated 'Links' section at the bottom of each page.
A Status page displays the current editorial status of the different documentation.
The creation of data specification is an activity that involves design decisions regarding two aspects:
- Creation or generation of the artefacts
- Publication of the artefacts
These activities are connected, but require different tools and methods. In some organisations it might be decided that these activities are performed by different teams. In SEMIC they are performed by the same team, however the connection between the two activities has not been (fully) automated, yet. This has historic reasons. SEMIC has started with managing data specifications independently, by independent editorial teams producing artefacts manually. Step by step, in the past 10 years, more harmonisation occurred. The tooling documented hereby is a further step towards an integrated and automated flow between the creation and the publication of the artefacts. Nevertheless, the current design of the architecture is a reflection of the organic growth of the system, during which a number of best possible decisions have been taken in order to minimise the impacts on the existing processes and methods.
There are three main types of actors that will interact with the resources and assets produced and managed by SEMIC on the SEMICeu GitHub space:
- the data specification consumers
- the data specification editors
- the supporting asset developers
The tools and methods are set up by the developers to support the editors in their work to provide together a coherent experience to the consumers. Since the consumers are the "end users" of the data specifications, they will not need to interact directly with the SEMICeu GitHub space. Therefore, this documentation will only focus on the editors and developers, as they constitute its target audience.
These actors are expected to perform the following actions:
- Editing of artefacts
- Publication of artefacts
The following paragraphs describe the editing and publishing activities. For more information about the user roles please check out the Actors page.
Data specifications evolve over time. This evolution is the result of addressing the use cases provided by the SEMIC community for the data specifications. During the editorial process the editors are adapting the data specification: i.e. the artefacts that are part of the data specification.
The assessment of how to best address the different use cases is beyond the scope of this documentation. Nevertheless, this documentation provides insights into how a resolution for a given change is integrated in the published artefacts of the data specification.
To adequately respond to a use case, editors should understand how to operate the toolchain to efficiently create the artefacts for a data specification. (For more details, consult section 4. Usage ).
The created artefacts are always represented in a human readable form (e.g. HTML) together with other supportive representations (JSON-LD context files, SHACL shapes, RDF vocabularies, etc.), depending on the nature of the data specification.
The editorial activity typically requires several iterations in order to reach the final resolution. The toolchain reduces the workload on editors created by these iterations, makes the artefact creation less error-prone, and at the same time increase the coherency among the different data specifications. It also can play an important role in training future editors to maintain the data specifications.
After creating the artefacts, the editor has to make them accessible to the consumers. For this the artefacts have to be stored in a publication environment.
Currently, the publication environment that has been adopted by SEMIC is GitHub: one repository per data specification. Over the past years, the organisation and structure of these repositories underwent a gradual harmonisation process. Moreover, the consumers' access to the content of these repositories is facilitated through GitHub Pages, a service offered to render webpages stored in a repository.
Since these repositories were created before an automated artefact generation tooling was applied, the repositories are set up and organised with the assumption that the content of the repository is manually created. The full integration of this publication design in the automation process requires adaptations to the automation, organisation and setup of the data specification repositories. This is future work. Therefore, editors are today required to manually collect the generated artefacts, and store them in the appropriate locations within the data specification repository.
The SEMIC GitHub space contains several dozens of repositories, each falling in specific categories. There are data specification repositories, repositories supporting the generation and publication of the data specifications, and others.
The data specification repositories are tagged with the topic data-specification in GitHub. The full list of SEMICeu repositories tagged with data-specification can be found here.
As each data specification follows its individual life-cycle, their respective repositories also do. Nevertheless, as part of the editorial process multiple data specification repositories might be updated to address a given issue.
A subset of the data specification repositories that are set up for the editing of Core Vocabularies is tagged with the topic core-vocabulary. Here is the current list of Core Vocabulary repositories.
The Core Vocabulary repositories (e.g. Core Person Vocabulary) are organised as follows:
Webinars
: any documentation, meeting minutes, recording relating to a webinar on the data specificationreleases
: the data specification releases, organised per version. Each release is identified by its version number, and has a dedicated sub-folder. The content of a specific data specification release version folder (e.g. version 2.00 of the Core Person Vocabulary) is not fixed. The structure of the latest release will conform to the setup of the artefact generating toolchain. It is assumed that there is anindex.html
file, as the data specification is then rendered using GitHub Pages, a free service for public open source GitHub repositories (e.g. version 2.0.0 of the Person Core Vocabulary would be published at https://semiceu.github.io/Core-Person-Vocabulary/releases/2.00/).
Further governance agreements regarding the structure of the Core Vocabulary repositories are not made.
Current list of Core Vocabulary repositories
Example of data specification repository tagged 'core-vocabulary'| Core Person Vocabulary
Example of data specification release version folder | Core Person Vocabulary
The data specification repositories are tagged with the topic application-profile. Here is the current list of Application Profile repositories.
Some Application Profile repositories will follow the same organisational structure as the Core Vocabularies, but others follow a different structure. This has historic reasons.
When a new release for these Application Profiles is being prepared, restructuring the content in accordance to the above described structure is highly recommended.
Current list of Application Profile repositories
The repositories that provide the support for generating and publishing data specifications are tagged with the topic tooling.
These repositories have no common organisational structure, however their README.md
file should provide a good overview about their organisation and usage.
Additional information about what these tools are meant to do and when and how to use them can be found in the Toolchain page.
Full list of repositories tagged 'tooling'
The toolchain is an online service built from a collection of GitHub repositories in the SEMIC GitHub space. The repositories are interconnected with automated processing. There is no local installation required, besides of a UML editing tool.
The basic idea behind the toolchain service is to consider a data specification as software source code. Software development has a long history in tooling and approaches to manage code changes with different contributors. The rise of Open Source software development in the past decades made the tools and approaches for building software freely and reliable available for anybody. One of the best practices within software development is to build tools that release developers from repetitive work and other risks that are part of the software development process. Here, the toolchain will release editors from manually building the artefacts by exploiting software building tools. As a consequence, editors must familiarise themselves with the software development mindset.
Using the toolchain for editing data specifications requires editors to have insight in
- the interplay between the GitHub repositories (see the chapter on the Editorial flow)
- what data specifications are, and what information are they built of (see the chapter on the Data specifications)
- what persistent unique identifiers (PURIs) are, and how their use is supported by SEMIC (see the chapter on Persistent identifiers)
- the deployed tooling for the generation of artefacts (see the chapter on the Toolchain)
The purpose of these chapters is to introduce editors, but also developers, to foundational aspects of the editorial flow and how it is supported by the toolchain.
Note that the documentation is not a reflection on design decisions. Only those are included that are helpful to improve the understanding of how the tools and processes work. The documentation provides merely a description of the current state of affairs, with references to further information, when it might be necessary for a given task. As a result, the documentation may raise valuable (design) questions without an answer provided here. For instance, UML data modeling guidelines are not part of this documentation, despite this being a valuable knowledge for an editor. However, in the future, such aspects might be included also in this documentation.
The chapters are written with the assumption that the reader has at least a basic knowledge on the Semantic Web, UML modeling, GitHub and software development. Despite the efforts to make the text understandable for readers with diverse backgrounds, readers might encounter parts using unfamiliar terminology or approaches. If this is blocking for the reader, it may be helpful to perform the editorial flow, as a hands-on exercise, to experience the expressed ideas. Also, watching the included screen-casts might be helpful.
The different chapters shed different perspectives on creating and publishing data specifications. Each chapter can be read, to a high extent, independently. Nevertheless, they might use terms or refer to knowledge that are explained in other chapters in more depth. Cross referencing among chapters is used to enable readers to find these explanations more quickly. This organisation is to keep the documents concise and easily processable for the reader.