This version of the dataset contains the images in chunks of 200 000 samples pre-shuffled. Therefore, since the images are already stored with all augmentations, they can be read sequentially, avoiding downloading the entire dataset to start training.
Each *.npz
file contains three arrays inside:
-
images
is a NumPy array with shape[32, W, 3]
and typenp.uint8
, where$W$ is the sum of all widths of all images contained in the package -
width
is a NumPy array with shape[200000]
and typenp.uint16
, which contains the widths of each image inside the package -
idx
is a NumPy array with shape[200000]
and typenp.uint32
, which contains the id of each image inside the package. Thanks to the id, you can obtain thefont
andword
in the image, exploiting the corresponding dictionaries contained in thefonts.json
andwords.json
files.
The file indices.npz
contains the list of indices of all samples inside the datasets. The file is generated with this code snippet:
indices = range(len(fonts) * len(words))
indices = list(indices)
random.seed(742)
random.shuffle(indices)
indices = np.array(indices, dtype=np.uint32)
In anycase you can directly load the pre-shuffled indices with the following snippet:
indices = np.load('indices.npz')['indices']
To obtain font
and word
given the global idx
:
font_id = idx // len(words)
word_id = idx % len(words)
font = fonts[font_id]
word = words[word_id]