Best practice for associating units of measurement with simple metadata in extensions #569

rly · 2024-02-04T09:56:24Z

A few of us discussed at our meeting this week how best to associate units of measurement with simple attribute metadata, often about devices or protocols (e.g., emission_lambda (in nm), grid_spacing (in um), camera_width (in pixels),pulse_length (in ms), injection_volume (in mL)), where it makes sense to fix the unit to a particular value. I'm documenting this discussion here. It would also be good to collect input from others.

Any changes we make on this to nwb-schema would break backward compatibility. But we can provide best practices for extensions and later look at improving the core nwb schema. (tl;dr at bottom)

Field naming

Currently we have several methods for doing this in NWB:

a dataset with a unit attribute that is fixed to a particular value (we also have cases where the unit attribute has a default, recommended value that can be overridden, but let's consider only the fixed value cases here)
an attribute with the unit described in the API docstring and schema doc

A problem with these methods is that when browsing the data naively without looking at the docs, or when writing data naively without looking at the docs, a user may guess incorrectly what the units of measurement are. In general, we recommend using SI base units, but most people don't know that and for some fields, like emission_lambda, which is almost always communicated in nanometers, it is unintuitive to use the SI base unit (meters). This has resulted in incorrectly written data.

One pattern recommended by the LinkML group who has extensive experience modeling data from different fields is to put the unit abbreviation in the field name itself: https://linkml.io/linkml/howtos/model-measurements.html#simple-explicit-scalar-pattern (they also have other suggested approaches but this is the simplest). For example, emission_lambda_in_nm, camera_width_in_px. We agreed that this approach would be best because it is clear and explicit, at the cost of being a little more verbose. We should also still have NWB inspector check to make sure the values are reasonable.

Use of non-base SI units

As mentioned above, for some metadta, it is unintuitive to use one of the seven SI base units (e.g., meters, liters, seconds) because it differs from the unit that is widely used to communicate the metadata in the community (e.g., nanometers, microliters, milliseconds). I propose that we recommend extension writers to use the units that are already widely used. When it is not clear what it widely used, we should try to poll the community and just pick one. Using a fixed value is better than allowing people to enter a value because they are unlikely to enter the value in a standard form (across current dandisets, the set of all entered grid_spacing.unit values is {"microns", "micrometers", "millimeters", "meters", "microns per pixel", "mm"})

Abbreviation in field names

I propose that unit abbreviations should come from CMIXF-12, with the modification that because / and ^ are not allowed in Python and MATLAB variable names (e.g., for W/m^2), / should be replaced with _ and ^ with nothing (e.g., intensity_in_W_m2). The true unit abbreviation must be written in the docs. In context, I think it would make sense and confused users would consult the docs. (See also usage of CMIXF-12 in BIDS and relevant discussion and links.)

Data format

Should users use a dataset (option 1 above) or an attribute (option 2 above)? Attributes are generally preferred for small metadata that are more properties than measurements, especially scalar values, so I think option 2 is best, but I don't think we settled on this.

An exception is DynamicTable columns that hold these metadata, for example, when parameters of a stimulus like pulse_length_ms change across trials/epochs. Columns are datasets. I propose that having an attribute named "unit" with a fixed value on the dataset is optional but recommended.

Note: This discussion is related to, but distinct from, the discussion on using a MeasurementData type that has attributes for unit, conversion, offset, and resolution (see #493). That type is designed for measured data from a data acquisition system. The best practices suggested here are mostly for small metadata (usually scalar properties of a device or protocol) where there is no conversion factor, offset, or resolution.

Summary

To summarize, I propose that for extensions, the best practice is that fields that represent metadata whose units should be a fixed value should be schematized as attributes where the unit abbreviation is in the field name, e.g.,

emission_lambda_in_nm for OpticalChannel
grid_spacing_in_um for ImagingPlane
camera_width_in_px for a behavioral video
pulse_length_in_ms for stimulation
injection_volume_in_mL for viruses
laser_power_in_mW for optogenetics
core_diameter_in_um for optic fibers
ap_in_mm for brain coordinates
theta_filter_phase_in_deg for a closed loop feedback protocol
titer_in_vg_mL (vg/mL. In context, it should make sense and the unit will be well described in the docs)
intensity_in_W_m2 (W/m^2) for a light source
diameter_in_um for spiral scanning stimulation

that the unit is the one widely used by the community, and the abbreviation follows a modified CMIXF-12 convention as described above.

cc @oruebel @bendichter @CodyCBakerPhD @alessandratrapani

The text was updated successfully, but these errors were encountered:

CodyCBakerPhD · 2024-02-04T17:21:41Z

Thanks for the writeup!

I agree that simply altering the variable name to clarify units is the simplest approach here

I propose that unit abbreviations should come from CMIXF-12, with the modification that because / and ^ are not allowed in Python and MATLAB variable names (e.g., for W/m^2), / should be replaced with _ and ^ with nothing (e.g., intensity_in_W_m2)

I would however suggest replacing the / character with the word per since the inclusion of _in_ indicates we're going for maximum verbosity; also to disambiguate from the other potential character from CMIXF-12 (the .). Should also think about what the rule should be for the . character; either _ separator for items in the same numerator/denominator, or no separator at all.

Such that the example would be intensity_in_W_per_m2, which makes it a tad clearer how the W relates to the m2 (otherwise, stringification would not disambiguate W.m^2 from W/m^2 except by going to the docs)

Referring to the example section towards the bottom of https://people.csail.mit.edu/jaffer/MIXF/CMIXF-12 for other advanced examples, radiance would be W/(m^2.sr) -> W_per_m2_sr or W_per_m2sr; I favor the first since without a separator between units in the denominator you cannot disambiguate, except from context, m2sr to mean "meters squared times steradians" or "meters squared times seconds times revolutions"

rly · 2024-02-04T18:53:22Z

Thanks for the feedback!

replacing the / character with the word per

Great point! I support this.

_ separator for items in the same numerator/denominator

I think this is a good idea too. I support this.

rly added priority: low alternative solution already working and/or relevant to only specific user(s) category: proposal proposed enhancements or new features topic: docs Issues related to documentation labels Feb 4, 2024

alessandratrapani mentioned this issue Feb 5, 2024

Change quantitative properties to add units to name catalystneuro/ndx-patterned-ogen#6

Merged

5 tasks

rly mentioned this issue Apr 16, 2024

Create extension spec catalystneuro/ndx-fiber-photometry#1

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for associating units of measurement with simple metadata in extensions #569

Best practice for associating units of measurement with simple metadata in extensions #569

rly commented Feb 4, 2024

CodyCBakerPhD commented Feb 4, 2024

rly commented Feb 4, 2024

Best practice for associating units of measurement with simple metadata in extensions #569

Best practice for associating units of measurement with simple metadata in extensions #569

Comments

rly commented Feb 4, 2024

Field naming

Use of non-base SI units

Abbreviation in field names

Data format

Summary

CodyCBakerPhD commented Feb 4, 2024

rly commented Feb 4, 2024