You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nanoset's index builder does not re-shuffle dataset and sample indices within epochs when training secondary, third, etc epochs. It instead concatenates a copy of the same indices for any repeated data. See:
I would suggest changing this so the shuffling is applied within each epoch before concatenating the data. Here's the edited version I use:
# Shuffle the indices within each epoch and concatenate themr=np.random.RandomState(self.random_seed)
epoch_random_seeds=r.randint(0, 2**32-1, num_epochs)
dataset_indices= []
dataset_sample_indices= []
foriinrange(num_epochs):
# Shuffle the sample and dataset indices in epoch with same seednumpy_random_state=np.random.RandomState(epoch_random_seeds[i])
numpy_random_state.shuffle(dataset_index)
numpy_random_state=np.random.RandomState(epoch_random_seeds[i])
numpy_random_state.shuffle(dataset_sample_index)
dataset_indices.append(dataset_index)
dataset_sample_indices.append(dataset_sample_index)
# Concatenate the within-epoch shuffled indexesdataset_index=np.concatenate(dataset_indices)
dataset_sample_index=np.concatenate(dataset_sample_indices)
If you think this is reasonable I can submit a pull request.
The text was updated successfully, but these errors were encountered:
Nanoset's index builder does not re-shuffle dataset and sample indices within epochs when training secondary, third, etc epochs. It instead concatenates a copy of the same indices for any repeated data. See:
nanotron/src/nanotron/data/nanoset.py
Lines 114 to 124 in 51ca40b
I would suggest changing this so the shuffling is applied within each epoch before concatenating the data. Here's the edited version I use:
If you think this is reasonable I can submit a pull request.
The text was updated successfully, but these errors were encountered: