CLDF is designed to maximize re-usability of linguistic data. Often, though, reuse of linguistic data does not mean feeding it into analysis tools, but using it in descriptions as examples. To pave the path for supporting this usage scenario, CLDF Markdown specifies how to add references to CLDF data objects in text.
CLDF Markdown is supposed to work in much the same way that BibTeX functions for citations in LaTeX documents. I.e.
- CLDF defines a file format for the data - like the BibTeX format.
- CLDF Markdown specifies how to reference data objects - like the
\cite
command in LaTeX. - CLDF Markdown implementations render CLDF Markdown documents into target formats.
CLDF Markdown is regular markdown where appropriately formatted URLs specified in markdown links are interpreted as references to objects in a CLDF dataset.
The link syntax to reference a single object, i.e. a row in a CLDF table looks as follows:
[An arbitrary label](some/path/<component-name-or-csv-filename>#cldf:<obect-id>)
e.g.
[Example 1](ExampleTable#cldf:ex1)
All rows in a table can be referenced using the special __all__
identifier.
[An arbitrary label](some/path/<component-name-or-csv-filename>#cldf:__all__)
Only the last component of the URL path is used to determine a CLDF component or table of the dataset, while the first part is ignored. This allows using URLs that are even somewhat functional in unrendered document. E.g.
[Meier 2020](cldf/sources.bib#cldf:Meier2020)
will render as Meier 2020
, linking to the BibTeX file when the document is simply rendered as markdown by
a service like GitHub (while the enhanced document created from a CLDF Markdown renderer may replace the link with
the reference data expanded to a full citation according to the Unified Stylesheet for Linguistics).
Whether to use the somewhat more transparent CLDF component names for CLDF Markdown references or the respective filenames of the "data on disk" is up to the author of a CLDF Markdown document.
To reference objects in multiple datasets, the fragment identifier MUST start with cldf-dsid:
, where dsid
is
an identifier that can be mapped to a dataset by the processing application. To avoid ambiguities due to the dataset
identifier being used as part of a URL, it is limited to contain only ASCII alphanumerical characters and _
, i.e.
must match the regular expression [a-zA-Z0-9_]+
.
For example CLDF Markdown referencing data from WALS as well as
APiCS could use dataset identifiers wals
and apics
respectively:
See [Voorhoeve](Source#cldf-apics:1559) and [Abbott](Source#cldf-wals:Abbott-2000)
The mapping of dataset identifiers to dataset locators (see dataset discovery) must be passed into the processing application, or provided in the CLDF Markdown via YAML frontmatter.
CLDF Markdown processors MAY accept input to customize rendering of CLDF objects. This additional information can be passed in the query string of the URL. CLDF Markdown processors MUST ignore unknown parameters (to make CLDF Markdown documents renderable with multiple processors) - but MAY issue a warning about this (to give the user feedback about accepted parameters).
For example a CLDF Markdown formatter may offer different ways to render an example, e.g. including the Primary_Text or not. Such a switch could be controlled with a query parameter as follows:
[Example 1](ExampleTable?with_primaryText#cldf:ex1)
Options requiring a value, can be passed using the key=value
notation, e.g.
[Example 1](ExampleTable?with=primaryText#cldf:ex1)
And multiple parameters can be passed using &
as separator, e.g.
[Example 1](ExampleTable?with=primaryText&showReferences#cldf:ex1)
CLDF datasets may come with rich metadata, serialized as JSON. To reference (and insert) bits and pieces of this
metadata in CLDF Markdown documents, the special component name Metadata
(or the name of the JSON metadata
file) may be used in the URL of a CLDF Markdown link. The particular piece of metadata must be selected using
JMESPath syntax.
So in the simplest case this could be [](Metadata#cldf:"dc:license")
. A more complex example would be
[LaguageTable](Metadata#cldf:tables[?"dc:conformsTo"=='http://cldf.clld.org/v1.0/terms.rdf#LanguageTable'].url | [0])
to select the filename used for the LanguageTable
component of the dataset.
In addition to the CLDF Markdown link syntax, a text document may provide a mapping of prefixes to dataset locators
(see dataset discovery) - in analogy to XML namespaces - as
cldf-datasets
key of a YAML frontmatter block. The value of this key may
be a single locator string in case a single dataset is referenced, or a YAML mapping of
prefixes to dataset locators if multiple datasets are referenced.
So a CLDF Markdown document referencing data from WALS as well as APiCS could declare this in frontmatter as follows:
---
cldf-datasets: {
wals: https://doi.org/10.5281/zenodo.6806407#rdf:ID=wals
apics: https://doi.org/10.5281/zenodo.3823888#rdf:ID=apics
}
---
CLDF Markdown processors SHOULD allow to override dataset mappings from YAML frontmatter in order to make processing possible with local data rather than fetching remote data each time.
A typical use case for CLDF Markdown are short descriptions of the contents of a database, for example the feature descriptions of the WALS database (e.g. feature 28A). Ideally, these descriptions would be included in the CLDF dataset, too. This could be done in two ways:
- Include CLDF Markdown text as content of a column in a CLDF table. In this case, the respective column SHOULD be
marked adding the following properties to its column description:
"dc:format": "text/markdown", "dc:conformsTo": "CLDF Markdown",
- Reference CLDF Markdown documents using items in a CLDF MediaTable. In this case, these items
SHOULD have the value
text/markdown
in the Media_Type column, and the valueCLDF Markdown
in an additional column of the MediaTable withpropertyUrl
http://purl.org/dc/terms/conformsTo.
If no explicit dataset mapping via YAML frontmatter is given in such CLDF Markdown text, it MUST be rendered by CLDF Markdown renderers with the containing dataset as single data source.