- Introduction
- nyaml Workflow
- How to Use nyaml Tool
- Conversion from YAML to XML
- Design of NeXus Ontology and Terms in YAML
- Special Keywords in YAML
- How to Install nyaml
- Conclusion
- References
The NeXus data format, described by the NeXus Definition Language (NXDL), represents a concerted effort aimed at facilitating data exchange within scientific communities, particularly among those engaged in neutron, X-ray, and muon research J. Appl. Cryst. (2015). 48, 301-305. The data format is also being used by the material science community under the project NeXus-FAIRmat supporting FAIR (Findable, Accessible, Interoperable and Reuseable) data principle. It serves as a standardized framework for both data exchange and storage. At its core, the NeXus Definition Language (NXDL) functions as the cornerstone through which scientists delineate the nomenclature and organizational structure of information within NeXus data files, tailored to specific scientific techniques.
NXDL is used to define general data storage objects (base classes) and the base classes are the building blocks for defining measurement-specific or even instrument-specific or software-specific data storage objects (application definitions). In this process, members and definitions of individual base classes can be used as is or customized. In essence, the process of schema development, whether for a base class or an application definition, entails crafting an NXDL schema definition file with the extension 'nxdl.xml', utilizing the Extensible Markup Language, XML .
To expedite the schema development process, we have introduced the use of Yet Another Markup Language (YAML), which provides a syntax or style specifically tailored for defining scientific domain-driven schemas with NXDL. One significant advantage of YAML over XML is its indentation-driven approach, which eliminates the need for starting and ending tags for each entity within the schema. The YAML
format results in a reduction of NXDL keyword repetition and allows for a intuitive grasp with object oriented programing domain, such as class inheritance. These benefits are attained without compromising the integrity of the original NeXus schema, which is traditionally expressed in XML format.
The YAML
format, while not yet an official version of NeXus application definitions or base classes, necessitates a method for transcoding it into XML
. The nyaml Python package serves as a converter tool designed specifically for this purpose. It enables the conversion of NXDL from YAML
format to XML
, thereby enhancing the capability of NeXus schema developers to incorporate domain-specific scientific knowledge into the schema. Furthermore, the tool offers the flexibility to extend existing NeXus schemas in XML by facilitating conversion back and forth between the two formats. It is important to note that here we do not introduce NeXus data objects, terms, or types, which are fundamental for writing base class schemas or application definition schemas. For individuals new to NeXus, we refer to the official NeXus site at NeXus official site.
Like every scientific software, the nyaml
tool also follows a specific workflow.
graph TD;
subgraph Start
id1["Input File (YAML or XML)"]
end
subgraph Correct File Converter
id2["YAML Converter"]
id3["XML Converter"]
end
subgraph YAML converter
id4["Comment Collector"]
id5["Python Dictionary Object"]
end
subgraph XML converter
id6["XML Object"]
end
subgraph Final result
id7["Write XML File"]
id8["Write YAML File"]
end
id1--> |YAML File|id2
id1--> |XML File|id3
id2-->id4
id4-->id5
id3-->id6
id5-->id7
id6-->id8
For a given input file, the nyaml
converter checks for the correct file type and call appropriate converter. For an XML file, the XML converter parses the XML
file, by means of lxml python library, into an XML
tree object. Adhering to the NXDL rules, the converter writes the application definition or the base class object to a yaml
file that matches the nyaml
syntax. If the input file is a yaml
file, the yaml
converter collects the comments in a Comments
object and parses the yaml
file into a python dictionary
object. Later, the application definition or base classes will be converted into an XML
file by combining the Comments
and the python dictionary
object.
The tool is published to PyPI
and available for pip install
$ pip install nyaml
To contribute to the tool or to install it in development mode
$ git clone https://github.com/FAIRmat-NFDI/nyaml.git
$ cd nyaml
$ pip install -e ".[dev]"
There is also a pre-commit hook available which formats the code and checks the linting before actually commiting. It can be installed with
$ pre-commit install
The nyaml
works as a command line tool to convert NeXus application definition or base class from yaml
file format into the nxdl.xml
file format and vice-versa. The converter can be called by the command
$ nyaml2nxdl [OPTIONS] [INPUT_FILE]
with the available options:
--output-file TEXT Specify the output file path for the converted file.
--check-consistency Check whether YAML and NXDL can be recursively
converted, ensuring version consistency.
--do-not-store-nxdl Prevent the input NXDL file from being stored as a
comment at the end of the output YAML file.
--verbose Display keywords and value types in standard output to
assist in identifying issues in YAML files.
--help Show this message and exit.
The --output-file
option can be used to define the output file name (including the fle extension), otherwise the converter will define the output file name from the input file, e.g., for the input file NXapplication.nxdl.xml (NXapplication.yaml)
, the resultant file will be NXapplication_parser.yaml (NXapplication.nxdl.xml)
. With the option --check-consistency
the converter produces the same type of file as the input, e.g. for input NXapplication.nxdl.xml
the output file is NXapplication_consistency.nxd.xml
. The intention for this option is to verify proper file and version conversion of the file. When converting the nxdl.xml
file into yaml
it also stores the nxdl.xml
file at the end of yaml
file with a hash. The option --do-not-store-nxdl
prevents the yaml
file from storing the original nxdl.xml
text. The verbose
option is to identify any issues arising from unexpected conversion or syntax errors that occur while converting the file from one to another.
The --output-file
option if user wants to define output file name (including extension) otherwise converter will define the output file name e.g. from input file NXapplication.nxdl.xml (NXapplication.yaml)
the resultant file will be NXapplication_parser.yaml (NXapplication.nxdl.xml)
. With the option --check-consistency
the converter produces the same type of file as the input, e.g. for input NXapplication.nxdl.xml
the output file is NXapplication_consistency.nxd.xml
. The intention for this option is to verify proper file and version conversion of the file. When converting the nxdl.xml
file into yaml
it also stores the nxdl.xml
file at the end of yaml
file with a hash. The option --do-not-store-nxdl
prevents the yaml
file from storing nxdl.xml
text. The verbose
option is to identify the issue, if there are some unexpected conversion, while converting the file from one to another.
Presented below is a concise and trimmed example of the NXmpes
application definition (not a full application definition) in YAML
format, alongside its corresponding translation into XML
format, as illustrated below. Subsequently, the fundamental rules governing this conversion process are elucidated. For a comprehensive understanding of the basic structure of NXDL, readers are encouraged to explore the NeXus Manual. Throughout the following discussions, various components of the NXmpes application definition will be discussed in the light of nyaml
converter.
NXmpes application definition in YAML format
category: application
type: group
doc: |
This is the most general application definition for multidimensional photoelectron spectroscopy.
.. _ISO 18115-1:2023: https://www.iso.org/standard/74811.html
.. _IUPAC Recommendations 2020: https://doi.org/10.1515/pac-2019-0404
symbols:
doc: |
The symbols used in the schema to specify e.g. dimensions of arrays
n_transmission_function: |
Number of data points in the transmission function.
NXmpes(NXobject):
(NXentry):
exsits: required
definition:
\@version:
enumeration: [NXmpes]
title:
start_time(NX_DATE_TIME):
doc: |
Datetime of the start of the measurement.
end_time(NX_DATE_TIME):
exists: recommended
doc: |
Datetime of the end of the measurement.
(NXinstrument):
doc:
- |
Description of the MPES spectrometer and its individual parts.
- |
xref:
spec: ISO 18115-1:2023
term: 12.58
url: https://www.iso.org/obp/ui/en/#iso:std:iso:18115:-1:ed-3:v1:en:term:12.58
source_TYPE(NXsource):
exists: recommended
doc: |
A source used to generate a beam.
(NXmanipulator):
exists: optional
doc: |
Manipulator for positioning of the sample.
value_log(NXlog):
exists: optional
value(NX_NUMBER):
unit: NX_PRESSURE
doc: |
In the case of an experiment in which the gas pressure changes and is recorded,
this is an array of length m of gas pressures.
(NXprocess):
exists: recommended
doc: |
Document an event of data processing, reconstruction, or analysis for this data.
transmission_correction(NXcalibration):
exists: optional
doc: |
This calibration procedure is used to account for the different tranmsission efficiencies.
transmission_function(NXdata):
exists: recommended
doc: |
Transmission function of the electron analyser.
\@axes:
enumeration: [kinetic_energy]
kinetic_energy(NX_FLOAT):
unit: NX_ENERGY
doc: |
Kinetic energy values
dimensions:
rank: 1
dim: [[1, n_transmission_function]]
NXmpes application definition in nxdl.xml format
<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="nxdlformat.xsl"?>
<definition xmlns="http://definition.nexusformat.org/nxdl/3.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" category="application" type="group" name="NXmpes" extends="NXobject" xsi:schemaLocation="http://definition.nexusformat.org/nxdl/3.1 ../nxdl.xsd">
<symbols>
<doc>
The symbols used in the schema to specify e.g. dimensions of arrays
</doc>
<symbol name="n_transmission_function">
<doc>
Number of data points in the transmission function.
</doc>
</symbol>
</symbols>
<doc>
This is the most general application definition for multidimensional
photoelectron spectroscopy.
.. _ISO 18115-1:2023: https://www.iso.org/standard/74811.html
.. _IUPAC Recommendations 2020: https://doi.org/10.1515/pac-2019-0404
</doc>
<group type="NXentry">
<field name="definition">
<attribute name="version"/>
<enumeration>
<item value="NXmpes"/>
</enumeration>
</field>
<field name="title"/>
<field name="start_time" type="NX_DATE_TIME">
<doc>
Datetime of the start of the measurement.
</doc>
</field>
<field name="end_time" type="NX_DATE_TIME" recommended="true">
<doc>
Datetime of the end of the measurement.
</doc>
</field>
<group type="NXinstrument">
<doc>
Description of the MPES spectrometer and its individual parts.
This concept is related to term `12.58`_ of the ISO 18115-1:2023 standard.
.. _12.58: https://www.iso.org/obp/ui/en/#iso:std:iso:18115:-1:ed-3:v1:en:term:12.58
</doc>
<group name="source_TYPE" type="NXsource" recommended="true">
<doc>
A source used to generate a beam.
</doc>
</group>
<group type="NXmanipulator" optional="true">
<doc>
Manipulator for positioning of the sample.
</doc>
<group name="value_log" type="NXlog" optional="true">
<field name="value" type="NX_NUMBER" units="NX_PRESSURE">
<doc>
In the case of an experiment in which the gas pressure changes and is recorded,
this is an array of length m of gas pressures.
</doc>
</field>
</group>
</group>
</group>
<group type="NXprocess" recommended="true">
<doc>
Document an event of data processing, reconstruction, or analysis for this data.
</doc>
<group name="transmission_correction" type="NXcalibration" optional="true">
<doc>
This calibration procedure is used to account for the different tranmsission
efficiencies.
</doc>
<group name="transmission_function" type="NXdata" recommended="true">
<doc>
Transmission function of the electron analyser.
</doc>
<attribute name="axes">
<enumeration>
<item value="kinetic_energy"/>
</enumeration>
</attribute>
<field name="kinetic_energy" type="NX_FLOAT" units="NX_ENERGY">
<doc>
Kinetic energy values
</doc>
<dimensions rank="1">
<dim index="1" value="n_transmission_function"/>
</dimensions>
</field>
</group>
</group>
</group>
</group>
</definition>
Within the YAML format, the root section denotes the top-level description of the application definition or base class schema, comprising the category
, type
, doc
, symbols
block, and the name of the schema (e.g. NXmpes(NXobject)
). Correspondingly, the root section refers to the XML element definition
, encompassing the first doc
child of the definition
and symbols
. The definition element encapsulates essential XML attributes such as the schema's name
(and xml attribute), the object it extends
(an xml attribute), and the schema type
(an xml attribute), with additional XML attributes (e.i. xmlns:xsi
) handled by the nyaml converter. The accurate designation of category as either base
or application
distinguishes between an application definition
and a base class
. The schema name (e.i. NXmpes(NXobject)
) with paranthesis indicates the extension of the current application definition, noting that base classes must extends
NXobject, whereas application definitions may extends
either NXobject
or another application definition
(excluding base classes). Schemas may incorporate one or multiple symbols, each imbued with specialized physical meanings beyond their literal interpretation, which are utilised over the application definition.
Within the YAML format, the root section denotes the top-level description of the application definition or base class schema, comprising the category
, type
, doc
, symbols
block, and the name of the schema (e.g. NXmpes(NXobject)
). Correspondingly, the root section refers to the XML element definition
, encompassing the first doc
child of the definition
and symbols
. The definition element encapsulates essential xml attributes such as the schema's name
(and xml attribute), the object it extends
(an xml attribute), and the schema type
(an xml attribute), with additional XML attributes (e.i. xmlns:xsi
) handled by the nyaml converter. The accurate designation of category as either base
or application
distinguishes between an application definition
and a base class
. The schema name (e.i. NXmpes(NXobject)
) with paranthesis indicates the extension of the current application definition, noting that base classes must extends
NXobject, whereas application definitions may extends
either NXobject
or another application definition
(excluding base classes). Schemas may incorporate one or multiple symbols, each imbued with specialized physical meanings beyond their literal interpretation, which are utilised over the application definition.
A typical root section for the application definition NXmpes
outlined
category: application
type: group
doc: |
This is the most general application definition for multidimensional photoelectron spectroscopy.
.. _ISO 18115-1:2023: https://www.iso.org/standard/74811.html
.. _IUPAC Recommendations 2020: https://doi.org/10.1515/pac-2019-0404
symbols:
doc: |
The symbols used in the schema to specify e.g. dimensions of arrays
n_transmission_function: |
Number of data points in the transmission function.
NXmpes(NXobject):
NeXus groups, as instances of NeXus base classes, embody the compositional structure of application definitions. These groups can be initialized dynamically or statically, each approach offering distinct advantages.
Dynamic initialization allows for the instantiation of groups while implementing the NeXus definition to store data (in HDF5 file format called NeXus file). This method provides flexibility for multiple instances at the same level within the NeXus file. For instance, the group (NXmanipulator)
can initialize multiple groups such as manipulator1
and manipulator2
of the base class NXmanipulator
during data writing.
Descriptive information about NeXus groups is encapsulated within the doc
child of the respective group. It is important to note that the group annotation of source_TYPE(NXsource)
or (NXsource)source_TYPE
signifies the encapsulation of the group's name
as source_TYPE
and its type as NXsource
base class. Notably, the order between name
and type
within the XML element must be inverted such two different syntax.
Furthermore, the uppercase part of the group's name can be dynamically overwritten, allowing for the instantiation of multiple instances. For example, source_electric
and source_magnetic
can coexist from NXsource
. It is essential to adhere to the uppercase dynamic rules for NeXus groups, fields, and attributes.
NeXus Groups in YAML format
# NeXus groups in YAML format
source_TYPE(NXsource):
exists: recommended
doc: |
A source used to generate a beam.
(NXmanipulator):
exists: optional
doc: |
Manipulator for positioning of the sample.
value_log(NXlog):
exists: optional
A NeXus group may contain NeXus fields, NeXus attributes, and NeXus groups. A field, that does not have preceding NX
, and a attribute, preceded by \@
, must have a NeXus type (e.g.NX_FLOAT
, NX_CHAR
). In the YAML notation, each NeXus field or NeXus attribute has a implicit type NX_CHAR
; otherwise its type must be denoted inside the parenthesis (e.g. end_time(NX_DATE_TIME)
). Other XML attributes of the NeXus field
and NeXus attribute
comes as children of the field and attribute (the special keywords will be discussed on next section). The introductory text of the NeXus fields or attributes goes under doc
child.
A NeXus group may encompass multiple field
, attribute
, and subgroup, each serving distinct purposes within the data structure. The field
, denoted without the prefix NX, and the attribute
, indicated by \@
, must be associated with a NeXus type (e.g., NX_FLOAT
, NX_CHAR
). In YAML format, each field or attribute (NeXus attribute) implicitly assumes the type NX_CHAR
, unless explicitly specified within parentheses (e.g., end_time(NX_DATE_TIME)
).
Additionally, XML
attributes specific to NeXus field and attribute are represented as children of the corresponding field
or attribute
elements (further details on special keywords will be discussed in the following section). Descriptive information pertaining to NeXus field
s or attribute
s is encapsulated within the doc
child element.
NeXus field and attribute in YAML format
(NXentry):
exsits: required
definition: # Field type: NX_CHAR
\@version: # Attribute type: NX_CHAR
enumeration: [NXmpes]
title:
start_time(NX_DATE_TIME): # Field type: NX_DATE_TIME
doc: Datetime of the start of the measurement.
end_time(NX_DATE_TIME): # Field type: NX_DATE_TIME
exists: recommended
doc: Datetime of the end of the measurement.
The NeXus link
concept reduces duplication of the data since several concepts of the same kind (e.g., NeXus field or NeXus attribute) can refer to a single copy of a data element . In YAML format, NeXus link
is defined denoting the link in side parenthesis. At the same time, the concept containing the data must be mentioned under the target
child.
NeXus link in YAML format
reference_measurement(link):
target: /NXentry
doc: A link to a full data collection.
In the provided YAML example, reference_measurement
is defined as a link refering the NXentry
group with its target specified as /NXentry
. This structure ensures that the concept referencing the data is effectively linked to the designated target, thereby reducing redundancy and maintaining data integrity within the NeXus framework.
NeXus choice
concept is designed to choose a concept from a number of concepts of the same kind (e.g., a NeXus field). The choice
options allows for defining a scientific concept in several modes for different situations (e.g., for different instrument configurations or measurement modes).
NeXus choice in YAML format
pixel_shape(choice):
(NXoff_geometry):
doc: Shape description of each pixel. Use only if all pixels in the detector
are of uniform shape.
(NXcylindrical_geometry):
doc: Shape description of each pixel. Use only if all pixels in the detector
are of uniform shape and require being described by cylinders.
In this choice
example, pixes_shape
could be any of the groups (NXoff_geometry)
and (NXcylindrical_geometry)
, depending on the geometry of the pixelx.
To explain the context of NeXus, certain keywords hold significance beyond their literal interpretations. These special keywords are utilized to elucidate and denote various NeXus terms like attributes, fields, links, and groups, thereby enhancing the clarity and specificity of the data representation.
The exists
keyword plays a pivotal role in delineating the optionality of NeXus concepts attribute
, field
, choice
link
, and group
, during the implementation of NeXus definitions in NeXus files. It provides crucial insights into the expected presence or absence of these concepts within the NeXus data structure. By default, all the concepts of a base class are optional, while in an application definition, all concepts are required.
Presently, the accepted values for the exists
keyword encompass:
optional
: Denotes that the NeXus concept is not mandatory and may be absent.
recommended
: Suggests that the NeXus concept is advisable, but not mandatory.
required
: Indicates that the NeXus concept must be present within the structure. Any validation of a NeXus file will fail if required concepts (for a given application definition) are not available.
[min, <number>, max, <number> or infty]
: Represents an array type value that signifies the multiplicity of the NeXus concepts. For instance, a concept having the keyword exists: [min, 3, max, infty]
implies that this concept must come with a minimum of three instances and may extend to any number of instances.
exists
in YAML
transmission_correction(NXcalibration):
exists: optional
doc: |
This calibration procedure is used to account for the different tranmsission efficiencies.
In the above example the greoup transmission_correction
is a optional group.
A statement introducing NeXus-compliant NXDL units
attribute to the field
, e.g. NX_VOLTAGE
to assign a predefined physical unit.
unit
in YAML
detector_voltage(NX_FLOAT):
unit: NX_VOLTAGE
doc: |
Voltage applied to detector.
The dimensions
term describes the multidimensional nature of the data, specifying its rank, dimensional indices, and corresponding length of the rank. For example, the attribute rank
defines the dimension of the data set. To elucidate each dimension, we use two other keywords: dim
and dim_parameters
. The dim
keyword comprises an array of arrays, the nested array encapsulates values for index
and value
(NeXus keywords) pairs. Each array within the dim
array corresponds to a specific dimension of the multidimensional data. For example, for 2D particle motion, the dim
array may be represented as [[0, nx], [1, ny]]
, indicating the axes index and its length. The keyword dim_parameters
contains further information of each dimension such as doc
, ref
, etc. It is important to note that each term or keyword within dim_parameters
must have the same length as the value of the rank keyword.
dimensions
in YAML
# 2D particle motion
dimensions:
rank: 2
dim: [[0, nx], [1, ny]]
dim_parameters:
doc: ["Position of particle on x-axis.","Position of particle on y-axis."]
The dimensions
can also be written in shorter form
Dimensions in YAML (shorter form)
# 2D particle motion
dimensions:
rank: 2
dim: (nx, ny)
Python-like list of strings which are considered as recommended items for the fields or attributes.
Enumeration in YAML
definition:
\@version:
enumeration: [NXmpes]
In the example the valid value for NeXus field definition
is NXmpes
.
The xref
keyword (which can only inside the keyword doc
) is used to refer any other ontology or any other standard such ISO
. The xref
in the example doc
will reflect the information inside the XML doc
. Note that the xref
keyword is only available in the YAML
representation and will be transformed into its textual representation inside the doc
text in XML
.
xref
in YAML
(NXinstrument):
doc:
- |
Description of the MPES spectrometer and its individual parts.
- |
xref:
spec: ISO 18115-1:2023
term: 12.58
url: https://www.iso.org/obp/ui/en/#iso:std:iso:18115:-1:ed-3:v1:en:term:12.58
Defining a NeXus application definition or base class in YAML format is not an official structure of NeXus. The single source of truth for the NeXus definitions remains the XML reprensentation. However, the NYAML format serves to reduce the effort of the developer to construct an application definition or base class. nyaml
is the tool to convert the application definitions or base classes from the YAML
reprensentation nxdl.xml
(XML
type) format with any knowledge of XML
style or syntax. This is a open source software funded by the German National Data Infrastructure (NFDI) within the FAIRmat project. Thus, the tool is open to contribution and suggestion for improvements, e.g., through creating issues on the public GitHub repository. nyaml
is also published in PyPi and can be installed with pip
python package manageer.
[@Könnecke]: J. Appl. Cryst. (2015). 48, 301-305 (https://doi.org/10.1107/S1600576714027575) NeXus-FAIRmat: https://fairmat-nfdi.github.io/nexus_definitions/