-
Notifications
You must be signed in to change notification settings - Fork 4.3k
BrainScript CNTKBinary Reader
This page has migrated to our new site. Please update any bookmarks.
The CNTKBinaryReader (later simply binary reader) is designed to be used on large data corpuses formatted according to the specification below. It supports the following main features:
- Multiple input streams (inputs) per file
- Both sparse and dense inputs
- Variable length sparse sequences
The Scripts/ctf2bin.py script can be used to convert data in the CNTKTextFormat into the CNTKBinaryFormat. Alternatively, one can implement the schema defined below.
The binary format is composed of 3 major sections:
- header
- offsets table
- data
Each section is concatenated together into one contiguous file.
The header contains information regarding the data in the file. It enables the binary file to be used with minimal information provided in the config. Below is the format of the header.
count | type | name |
---|---|---|
1 | int64_t | version number |
1 | int64_t | number of chunks |
1 | int32_t | number of inputs |
numInputs | matHeader | input header |
version number is the version of the binary format that the file was created for. The current version number is 1.
number of chunks corresponds to the number of chunks (defined below) in the file.
number of inputs corresponds to the number of input streams in the file.
matHeader is a compound type describing the required information for each input stream. It is defined as:
count | type | name |
---|---|---|
1 | int32_t | len |
len | char | name |
1 | int32_t | deserializer |
1 | deserializer | deserializer header data |
len is the number of characters in the name variable.
name is the name of the stream. It is the same name that one will use as the name of the input node in the BrainScript specification.
deserializer is the deserializer that should be used to deserialize the binary data stored on disk. Currently there are two deserializers:
- DenseBinaryDataDeserializer
- SparseBinaryDataDeserializer
Each deserializer requires certain input parameters. These parameters are packed immediately after the deserializer type. Below are the specifications for the header data for the deserializers defined above.
count | type | name |
---|---|---|
1 | int32_t | elemType |
1 | int32_t | sampleSize |
elemType is 0 if the values are floats, and 1 if the values are dobules.
sampleSize is the size (i.e., number of columns) in one input sample.
count | type | name |
---|---|---|
1 | int32_t | storageType |
1 | int32_t | elemType |
1 | int32_t | isSequence |
1 | int32_t | sampleSize |
storageType is the storage type of the matrix. Currently only sparse CSC (with value 0) is supported.
elemType is 0 if the values are floats, and 1 if the values are dobules.
isSequence is 0 if the input is not a sequence, and 1 if the input is a sequence.
sampleSize is the size (i.e., number of columns) in one input sample.
In the binary format, the data section is broken into chunks. The chunks should be sized such that reading one chunk can be done efficiently on the underlying platform. Since the chunks can be variable sized, an offsets table is required to indicate the starting position of each chunk in the data section. The offsets table also contains the number of sequences, and the number of samples in the chunk for efficient retrieval. Below is the specification for the offsets table:
count | type | name |
---|---|---|
numChunks | offsets table row | offsets row |
numChunks is the number of chunks defined in the header. offsets table row is a compound type defined as:
count | type | name |
---|---|---|
1 | int64_t | offset |
1 | int32_t | numSequences |
1 | int32_t | numSamples |
offset is the number of bytes from the start of the data portion that the chunk begins.
numSequences is the number of sequences in the chunk.
numSamples is the number of samples in the chunk. This is defined as the sum of the number of samples in each sequence. The number of samples in a sequence is defined as the maximum number of samples over all inputs for that sequence.
The data portion contains the data outlined in chunks as described above. It is constructed by serializing all of the sequences for each input, in the order the inputs were defined in the header. Note that this means that the input formats can be serialized in an arbitrary order, depending on preference.
count | type | name |
---|---|---|
numInputs | serialized input | data |
Input is serialized such that the deserializer can read the data from the chunk. Below are the serialization specifications for the deserialer defined in the matHeader section.
The DenseBinaryDataDeserializer's data is serialized as each sequence concatenated consecutively in memory. Specifically:
count | type | name |
---|---|---|
numSequences | elemType[sampleDim] | data |
where:
numSequences is the number of sequences in the chunk (defined in the offsets table)
elemType[sampleDim] is the data encoded as an array of length sampleSize of type elemType, where sampleSize and elemType are defined in the header.
The SparseBinaryDataDeserializer's data is serialzed as a large matrix in CSC format, with an added caveat of how row indices are packed (see below). Before continuing, it is recommended that one reads over the specification on the page linked above.
The data is therefore serialized as:
count | type | name |
---|---|---|
1 | int32_t | nnz |
1 | elemType[nnz] | vals |
1 | int32_t[nnz] | row indices |
1 | int32_t[numSequences+1] | condensed column indices |
where:
nnz are the number of non zero values in the chunk
vals is a float array of length nnz containing the values for each element.
row indices is an int32_t array of length nnz containing the row indices for each element.
condensed column indices is an array of int32_t of length (numSequences+1) (where numSequences is defined in the offsets table) containing the offset that each sequence begins.
In order to pack variable length sequences, we make one change to the definition of the row index. In particular, each <row index[i]> is encoded as follows:
new row index[i] = <sample number> * sampleSize + <row index[i]>
Where:
sample number is the (0 indexed) sample number for the value. Thus the first sample would have a sample number of 0, the second of 1, etc.
sampleSize is the width of a sample as defined in the header
row index[i] is the original row index for the element.
Thus one can decode the sample for each element by dividing it by the sampleSize. Similarly, one can determine the actual row index by modding it with the sampleSize.
Parameter | Mandatory | Accepted values | Default value | Description |
---|---|---|---|---|
readerType |
Yes | one of the supported CNTK readers | Specifies the reader flavor to load (e.g., CNTKBinaryReader ) |
|
file |
Yes | File path | Path to the file containing the input dataset (Windows or Linux style) | |
randomize |
No |
true , false
|
true |
Specifies whether the input should be randomized |
randomizationSeed |
No | Positive integer | 0 |
Initial randomization seed value (incremented every sweep when the input data is re-randomized). |
randomizationWindow |
No | Positive integer | Specifies the randomization range (in number of samples)1. This controls how much of the dataset resides in memory. | |
traceLevel |
No |
0 , 1 , 2
|
1 |
Output verbosity level. 0 - show only errors; 1 - show errors and warnings; 2 - show all output2
|
keepDataInMemory |
No |
true , false
|
false |
If true , the whole dataset will be cached in memory |
1 If no randomizationWindow
is specified, the randomization range is set to be equal to the size of the dataset (i.e., the input is randomized across the whole dataset). randomizationWindow
is ignored when randomize
is set to false
.
2 In order to force the reader to show a warning for each input error it ignores (in the sense of not raising an exception), non-default maxErrors
value should be used in combination with the traceLevel
set to 1
or above.
input
combines a number of individual inputs, each with an appropriately labeled configuration sub-section. All parameters described below are specific to an Input name sub-section associated with a particular input.
Parameter | Mandatory | Accepted values | Description |
---|---|---|---|
alias |
No | String | A way of renaming an input in a converted binary file into a new stream for use in the network. |
The alias
parameter is mainly used in case one has an already created binary file that one wants to use with an already created network. In such a case if the input names do not match, one would have to regenerate the binary file (or at least rewrite the header) in order to use the two pieces together. Instead, the binary reader offers the ability to rename streams in the input into new output streams via the use of the alias
parameter. In such a case the input stream named by the alias
parameter is renamed, and will be mapped to a new output stream.