BrainScript CNTKBinary Reader

This page has migrated to our new site. Please update any bookmarks.

The CNTKBinaryReader (later simply binary reader) is designed to be used on large data corpuses formatted according to the specification below. It supports the following main features:

Multiple input streams (inputs) per file
Both sparse and dense inputs
Variable length sparse sequences

The Scripts/ctf2bin.py script can be used to convert data in the CNTKTextFormat into the CNTKBinaryFormat. Alternatively, one can implement the schema defined below.

Input Schema

The binary format is composed of 3 major sections:

header
offsets table
data

Each section is concatenated together into one contiguous file.

Header

The header contains information regarding the data in the file. It enables the binary file to be used with minimal information provided in the config. Below is the format of the header.

count	type	name
1	int64_t	version number
1	int64_t	number of chunks
1	int32_t	number of inputs
numInputs	matHeader	input header

version number is the version of the binary format that the file was created for. The current version number is 1.

number of chunks corresponds to the number of chunks (defined below) in the file.

number of inputs corresponds to the number of input streams in the file.

matHeader

matHeader is a compound type describing the required information for each input stream. It is defined as:

count	type	name
1	int32_t	len
len	char	name
1	int32_t	deserializer
1	deserializer	deserializer header data

len is the number of characters in the name variable.

name is the name of the stream. It is the same name that one will use as the name of the input node in the BrainScript specification.

deserializer is the deserializer that should be used to deserialize the binary data stored on disk. Currently there are two deserializers:

DenseBinaryDataDeserializer
SparseBinaryDataDeserializer

Each deserializer requires certain input parameters. These parameters are packed immediately after the deserializer type. Below are the specifications for the header data for the deserializers defined above.

DenseBinaryDataDeserializer Header

count	type	name
1	int32_t	elemType
1	int32_t	sampleSize

elemType is 0 if the values are floats, and 1 if the values are dobules.

sampleSize is the size (i.e., number of columns) in one input sample.

SparseBinaryDataDeserializer Header

count	type	name
1	int32_t	storageType
1	int32_t	elemType
1	int32_t	isSequence
1	int32_t	sampleSize

storageType is the storage type of the matrix. Currently only sparse CSC (with value 0) is supported.

elemType is 0 if the values are floats, and 1 if the values are dobules.

isSequence is 0 if the input is not a sequence, and 1 if the input is a sequence.

sampleSize is the size (i.e., number of columns) in one input sample.

Offsets Table

In the binary format, the data section is broken into chunks. The chunks should be sized such that reading one chunk can be done efficiently on the underlying platform. Since the chunks can be variable sized, an offsets table is required to indicate the starting position of each chunk in the data section. The offsets table also contains the number of sequences, and the number of samples in the chunk for efficient retrieval. Below is the specification for the offsets table:

count	type	name
numChunks	offsets table row	offsets row

numChunks is the number of chunks defined in the header. offsets table row is a compound type defined as:

count	type	name
1	int64_t	offset
1	int32_t	numSequences
1	int32_t	numSamples

offset is the number of bytes from the start of the data portion that the chunk begins.

numSequences is the number of sequences in the chunk.

numSamples is the number of samples in the chunk. This is defined as the sum of the number of samples in each sequence. The number of samples in a sequence is defined as the maximum number of samples over all inputs for that sequence.

Data

The data portion contains the data outlined in chunks as described above. It is constructed by serializing all of the sequences for each input, in the order the inputs were defined in the header. Note that this means that the input formats can be serialized in an arbitrary order, depending on preference.

count	type	name
numInputs	serialized input	data

Serialized Input

Input is serialized such that the deserializer can read the data from the chunk. Below are the serialization specifications for the deserialer defined in the matHeader section.

DenseBinaryDataDeserializer Data

The DenseBinaryDataDeserializer's data is serialized as each sequence concatenated consecutively in memory. Specifically:

count	type	name
numSequences	elemType[sampleDim]	data

where:

numSequences is the number of sequences in the chunk (defined in the offsets table)

elemType[sampleDim] is the data encoded as an array of length sampleSize of type elemType, where sampleSize and elemType are defined in the header.

SparseBinaryDataDeserializer Data

The SparseBinaryDataDeserializer's data is serialzed as a large matrix in CSC format, with an added caveat of how row indices are packed (see below). Before continuing, it is recommended that one reads over the specification on the page linked above.

The data is therefore serialized as:

count	type	name
1	int32_t	nnz
1	elemType[nnz]	vals
1	int32_t[nnz]	row indices
1	int32_t[numSequences+1]	condensed column indices

where:

nnz are the number of non zero values in the chunk

vals is a float array of length nnz containing the values for each element.

row indices is an int32_t array of length nnz containing the row indices for each element.

condensed column indices is an array of int32_t of length (numSequences+1) (where numSequences is defined in the offsets table) containing the offset that each sequence begins.

In order to pack variable length sequences, we make one change to the definition of the row index. In particular, each <row index[i]> is encoded as follows:

new row index[i] = <sample number> * sampleSize + <row index[i]>

Where:

sample number is the (0 indexed) sample number for the value. Thus the first sample would have a sample number of 0, the second of 1, etc.

sampleSize is the width of a sample as defined in the header

row index[i] is the original row index for the element.

Thus one can decode the sample for each element by dividing it by the sampleSize. Similarly, one can determine the actual row index by modding it with the sampleSize.

`reader` section

Parameter	Mandatory	Accepted values	Default value	Description
`readerType`	Yes	one of the supported CNTK readers		Specifies the reader flavor to load (e.g., `CNTKBinaryReader`)
`file`	Yes	File path		Path to the file containing the input dataset (Windows or Linux style)
`randomize`	No	`true`, `false`	`true`	Specifies whether the input should be randomized
`randomizationSeed`	No	Positive integer	`0`	Initial randomization seed value (incremented every sweep when the input data is re-randomized).
`randomizationWindow`	No	Positive integer		Specifies the randomization range (in number of samples)¹. This controls how much of the dataset resides in memory.
`traceLevel`	No	`0`, `1`, `2`	`1`	Output verbosity level. `0` - show only errors; `1` - show errors and warnings; `2` - show all output²
`keepDataInMemory`	No	`true`, `false`	`false`	If `true`, the whole dataset will be cached in memory

¹ If no randomizationWindow is specified, the randomization range is set to be equal to the size of the dataset (i.e., the input is randomized across the whole dataset). randomizationWindow is ignored when randomize is set to false.

² In order to force the reader to show a warning for each input error it ignores (in the sense of not raising an exception), non-default maxErrors value should be used in combination with the traceLevel set to 1 or above.

`input` sub-section

input combines a number of individual inputs, each with an appropriately labeled configuration sub-section. All parameters described below are specific to an Input name sub-section associated with a particular input.

Parameter	Mandatory	Accepted values	Description
`alias`	No	String	A way of renaming an input in a converted binary file into a new stream for use in the network.

The alias parameter is mainly used in case one has an already created binary file that one wants to use with an already created network. In such a case if the input names do not match, one would have to regenerate the binary file (or at least rewrite the header) in order to use the two pieces together. Instead, the binary reader offers the ability to rename streams in the input into new output streams via the use of the alias parameter. In such a case the input stream named by the alias parameter is renamed, and will be mapped to a new output stream.

New Documentation Site

Iteration Plans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BrainScript CNTKBinary Reader

Input Schema

Header

matHeader

DenseBinaryDataDeserializer Header

SparseBinaryDataDeserializer Header

Offsets Table

Data

Serialized Input

DenseBinaryDataDeserializer Data

SparseBinaryDataDeserializer Data

`reader` section

`input` sub-section

Clone this wiki locally

BrainScript CNTKBinary Reader

Input Schema

Header

matHeader

DenseBinaryDataDeserializer Header

SparseBinaryDataDeserializer Header

Offsets Table

Data

Serialized Input

DenseBinaryDataDeserializer Data

SparseBinaryDataDeserializer Data

reader section

input sub-section

Clone this wiki locally

`reader` section

`input` sub-section