Best practice for associating units of measurement with simple metadata in extensions #569
Labels
category: proposal
proposed enhancements or new features
priority: low
alternative solution already working and/or relevant to only specific user(s)
topic: docs
Issues related to documentation
A few of us discussed at our meeting this week how best to associate units of measurement with simple attribute metadata, often about devices or protocols (e.g.,
emission_lambda
(in nm),grid_spacing
(in um),camera_width
(in pixels),pulse_length
(in ms),injection_volume
(in mL)), where it makes sense to fix the unit to a particular value. I'm documenting this discussion here. It would also be good to collect input from others.Any changes we make on this to nwb-schema would break backward compatibility. But we can provide best practices for extensions and later look at improving the core nwb schema. (tl;dr at bottom)
Field naming
Currently we have several methods for doing this in NWB:
unit
attribute that is fixed to a particular value (we also have cases where theunit
attribute has a default, recommended value that can be overridden, but let's consider only the fixed value cases here)A problem with these methods is that when browsing the data naively without looking at the docs, or when writing data naively without looking at the docs, a user may guess incorrectly what the units of measurement are. In general, we recommend using SI base units, but most people don't know that and for some fields, like
emission_lambda
, which is almost always communicated in nanometers, it is unintuitive to use the SI base unit (meters). This has resulted in incorrectly written data.One pattern recommended by the LinkML group who has extensive experience modeling data from different fields is to put the unit abbreviation in the field name itself: https://linkml.io/linkml/howtos/model-measurements.html#simple-explicit-scalar-pattern (they also have other suggested approaches but this is the simplest). For example,
emission_lambda_in_nm
,camera_width_in_px
. We agreed that this approach would be best because it is clear and explicit, at the cost of being a little more verbose. We should also still have NWB inspector check to make sure the values are reasonable.Use of non-base SI units
As mentioned above, for some metadta, it is unintuitive to use one of the seven SI base units (e.g., meters, liters, seconds) because it differs from the unit that is widely used to communicate the metadata in the community (e.g., nanometers, microliters, milliseconds). I propose that we recommend extension writers to use the units that are already widely used. When it is not clear what it widely used, we should try to poll the community and just pick one. Using a fixed value is better than allowing people to enter a value because they are unlikely to enter the value in a standard form (across current dandisets, the set of all entered
grid_spacing.unit
values is{"microns", "micrometers", "millimeters", "meters", "microns per pixel", "mm"}
)Abbreviation in field names
I propose that unit abbreviations should come from CMIXF-12, with the modification that because
/
and^
are not allowed in Python and MATLAB variable names (e.g., for W/m^2),/
should be replaced with_
and^
with nothing (e.g.,intensity_in_W_m2
). The true unit abbreviation must be written in the docs. In context, I think it would make sense and confused users would consult the docs. (See also usage of CMIXF-12 in BIDS and relevant discussion and links.)Data format
Should users use a dataset (option 1 above) or an attribute (option 2 above)? Attributes are generally preferred for small metadata that are more properties than measurements, especially scalar values, so I think option 2 is best, but I don't think we settled on this.
An exception is
DynamicTable
columns that hold these metadata, for example, when parameters of a stimulus likepulse_length_ms
change across trials/epochs. Columns are datasets. I propose that having an attribute named "unit" with a fixed value on the dataset is optional but recommended.Note: This discussion is related to, but distinct from, the discussion on using a
MeasurementData
type that has attributes forunit
,conversion
,offset
, andresolution
(see #493). That type is designed for measured data from a data acquisition system. The best practices suggested here are mostly for small metadata (usually scalar properties of a device or protocol) where there is no conversion factor, offset, or resolution.Summary
To summarize, I propose that for extensions, the best practice is that fields that represent metadata whose units should be a fixed value should be schematized as attributes where the unit abbreviation is in the field name, e.g.,
emission_lambda_in_nm
forOpticalChannel
grid_spacing_in_um
forImagingPlane
camera_width_in_px
for a behavioral videopulse_length_in_ms
for stimulationinjection_volume_in_mL
for viruseslaser_power_in_mW
for optogeneticscore_diameter_in_um
for optic fibersap_in_mm
for brain coordinatestheta_filter_phase_in_deg
for a closed loop feedback protocoltiter_in_vg_mL
(vg/mL. In context, it should make sense and the unit will be well described in the docs)intensity_in_W_m2
(W/m^2) for a light sourcediameter_in_um
for spiral scanning stimulationthat the unit is the one widely used by the community, and the abbreviation follows a modified CMIXF-12 convention as described above.
cc @oruebel @bendichter @CodyCBakerPhD @alessandratrapani
The text was updated successfully, but these errors were encountered: