-
Notifications
You must be signed in to change notification settings - Fork 2
data file format
This page describes the HDF5 format that is used to store raw data recorded on both the MCS and Hidens array systems.
Files are stored in a format called Hierarchical Data Format version 5. This is a self-describing and widely-used standard format for storing large amounts of structured data. The base library is written in C
, however bindings exist for a wide number of programming languages, including:
Additionally, there are standalone apps that provide a graphical interface for browsing HDF5 files. For example, check out the cross-platform HDF View application.
HDF5 files are organized similar to a file system, in that they are hierarchical. Data is organized into "groups", which are analogous to folders, and "datasets", which contain actual data (stored as arrays). Files, groups, and datasets can all have "attributes", which are arbitrary metadata associated with the given object.
Files are stored in chunked format on disk, with a chunk size of 20000 samples. Chunked format allows datasets to be resized, and the particular chunk size is chosen to be consistent with traditional AIB binary files. This is unlikely to be important for most cases, but may matter for debugging or low-level work.
Our files contain data from either the MCS or Hidens array systems. These systems are different, but there is a subset of datasets and attributes that must be in each file so that it can work with other components of the spike sorting system. Any tool or component creating these files, and expecting them to work with other components, must included at least the common components. Other attributes, datasets, etc, may be there, but they will not be referenced.
- A single root group is defined,
"/"
. - A single dataset with the raw voltage data is stored in
"/data"
. This contains the actual raw data values recorded, in whatever bit-width the system defines. This data must be formatted as a 2-dimensional array, shaped asnchannels
-by-nsamples
. - The data must have the following attributes, and must be the following data types
-
"date"
: a string date, formatted in ISO-8601 format, with resolution down to the second. I.e., with format string"%Y-%m-%dT%H:%M:%S"
. -
"sample-rate"
: 4-byte IEEE floating point, giving the sample rate of the data -
"gain"
: 4-byte IEEE floating point, giving the gain of the analog-digital conversion -
"offset"
: 4-byte IEEE floating point, giving offset of analog-digital conversion -
"array"
: a string identifying the array on which data was recorded (e.g., "hidens", "hexagonal", etc)
-
Note that, after running the extract
tool on the data file, it will be updated to include a new
attribute of the data
dataset. This is called channel-means
, and gives the means of each
channel's raw data from which snippets were extracted (not all channels).
In addition to the common components, HDF5 files recorded from the MCS arrays also have the following:
-
"room"
: a string indicating the room in which data is recorded -
"bin-file-version"
,"bin-file-type"
: unsigned 32-bit integers, only kept for compatibility with old AIB formats
Data files recorded from the Hidens array have an additional group, called "/configuration"
. This contains 6 datasets which together define the configuration of the array during the recording. Each of these is defined as follows, letting n
be the total number of possible electrodes (126 currently), and k
be the number of electrodes connected to data channels, i.e., from which data is actively being recorded.
-
"/configuration/xpos"
: shape{k}
- The x-position, in microns, of the connected electrodes
-
"/configuration/ypos"
: shape{k}
- The y-position, in microns, of the connected electrodes
-
"/configuration/x"
: shape{k}
- The x-index of the the connected electrodes
-
"/configuration/y"
: shape{k}
- The y-index of the connected electrodes
-
"/configuration/label"
: shape{k}
- A single character label associated with each electrode
-
"/configuration/channels"
: shape{n}
- The electrode index of each channel. This is a number on the interval
[0, 11016)
if the channel is actually connected to an electrode, and is that electrode's linear index. It is-1
if the channel is not connected to an electrode.
- The electrode index of each channel. This is a number on the interval