Skip to content

Pre-shuffled dataset

Latest
Compare
Choose a tag to compare
@vittoriopippi vittoriopippi released this 14 Jun 09:46
· 27 commits to main since this release

This version of the dataset contains the images in chunks of 200 000 samples pre-shuffled. Therefore, since the images are already stored with all augmentations, they can be read sequentially, avoiding downloading the entire dataset to start training.

Each *.npz file contains three arrays inside:

  • images is a NumPy array with shape [32, W, 3] and type np.uint8, where $W$ is the sum of all widths of all images contained in the package
  • width is a NumPy array with shape [200000] and type np.uint16, which contains the widths of each image inside the package
  • idx is a NumPy array with shape [200000] and type np.uint32, which contains the id of each image inside the package. Thanks to the id, you can obtain the font and word in the image, exploiting the corresponding dictionaries contained in the fonts.json and words.json files.

The file indices.npz contains the list of indices of all samples inside the datasets. The file is generated with this code snippet:

indices = range(len(fonts) * len(words))
indices = list(indices)
random.seed(742)
random.shuffle(indices)
indices = np.array(indices, dtype=np.uint32)

In anycase you can directly load the pre-shuffled indices with the following snippet:

indices = np.load('indices.npz')['indices']

To obtain font and word given the global idx:

font_id = idx // len(words)
word_id = idx % len(words)
font = fonts[font_id]
word = words[word_id]