Skip to content
Niru Maheswaranathan edited this page Jul 10, 2015 · 6 revisions

Data file format

This page describes the HDF5 format that is used to store raw data recorded on both the MCS and Hidens array systems.

HDF5

Files are stored in a format called Hierarchical Data Format version 5. This is a self-describing and widely-used standard format for storing large amounts of structured data. The base library is written in C, however bindings exist for a wide number of programming languages, including:

File structure

HDF5 files are organized similar to a file system, in that they are hierarchical. Data is organized into "groups", which are analogous to folders, and "datasets", which contain actual data (stored as arrays). Files, groups, and datasets can all have "attributes", which are arbitrary metadata associated with the given object.

Files are stored in chunked format on disk, with a chunk size of 20000 samples. Chunked format allows datasets to be resized, and the particular chunk size is chosen to be consistent with traditional AIB binary files. This is unlikely to be important for most cases, but may matter for debugging or low-level work.

Our files contain data from either the MCS or Hidens array systems. These systems are different, but there is a subset of datasets and attributes that must be in each file so that it can work with other components of the spike sorting system. Any tool or component creating these files, and expecting them to work with other components, must included at least the common components. Other attributes, datasets, etc, may be there, but they will not be referenced.

Common components

  • A single root group is defined, "/".
  • A single dataset with the raw voltage data is stored in "/data". This contains the actual raw data values recorded, in whatever bit-width the system defines. This data must be formatted as a 2-dimensional array, shaped as nchannels-by-nsamples.
  • The data must have the following attributes, and must be the following data types
    • "date": a string formatted as "ddd, MMM dd, yyyy"
    • "time": a string formatted as "h:mm:ss AP"
    • "sample-rate": 4-byte IEEE floating point, giving the sample rate of the data
    • "gain": 4-byte IEEE floating point, giving the gain of the analog-digital conversion
    • "offset": 4-byte IEEE floating point, giving offset of analog-digital conversion
    • "array": a string identifying the array on which data was recorded (e.g., "hidens", "hexagonal", etc)

MCS components

In addition to the common components, HDF5 files recorded from the MCS arrays also have the following:

  • "room": a string indicating the room in which data is recorded
  • "bin-file-version", "bin-file-type": unsigned 32-bit integers, only kept for compatibility with old AIB formats

Hidens components

  • An attribute of the dataset "/data", called "configuration": At the moment, this is a complicated record-array. It contains one entry for each channel from which data is recorded. Each entry is a 5-tuple, which contains the following information about the electrode connected to the given channel
    • "xpos" (unsigned 32-bit integer): The X position in microns of the electrode on the array
    • "ypos" (unsigned 32-bit integer): The Y-position in microns of the electrode on the array
    • "x" (unsigned 16-bit integer): The X-index of the electrode
    • "y" (unsigned 16-bit integer): The Y-index of the electrode
    • "label" (string): A label that the Hierlemann group assigns to each electrode
Clone this wiki locally