Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nanoset reuses same shuffling for repeated data (epochs) #237

Open
Lauler opened this issue Oct 31, 2024 · 1 comment
Open

Nanoset reuses same shuffling for repeated data (epochs) #237

Lauler opened this issue Oct 31, 2024 · 1 comment

Comments

@Lauler
Copy link

Lauler commented Oct 31, 2024

Nanoset's index builder does not re-shuffle dataset and sample indices within epochs when training secondary, third, etc epochs. It instead concatenates a copy of the same indices for any repeated data. See:

# Shuffle the indexes the same way
numpy_random_state = np.random.RandomState(self.random_seed)
numpy_random_state.shuffle(dataset_index)
numpy_random_state = np.random.RandomState(self.random_seed)
numpy_random_state.shuffle(dataset_sample_index)
# Concatenate num_epochs the shuffled indexes
dataset_index = np.concatenate([dataset_index for _ in range(num_epochs)])
dataset_sample_index = np.concatenate([dataset_sample_index for _ in range(num_epochs)])
# Just keep the necessary samples
dataset_index = dataset_index[: self.train_split_num_samples]
dataset_sample_index = dataset_sample_index[: self.train_split_num_samples]

I would suggest changing this so the shuffling is applied within each epoch before concatenating the data. Here's the edited version I use:

# Shuffle the indices within each epoch and concatenate them
r = np.random.RandomState(self.random_seed)
epoch_random_seeds = r.randint(0, 2**32 - 1, num_epochs)
dataset_indices = []
dataset_sample_indices = []
for i in range(num_epochs):
    # Shuffle the sample and dataset indices in epoch with same seed
    numpy_random_state = np.random.RandomState(epoch_random_seeds[i])
    numpy_random_state.shuffle(dataset_index)
    numpy_random_state = np.random.RandomState(epoch_random_seeds[i])
    numpy_random_state.shuffle(dataset_sample_index)

    dataset_indices.append(dataset_index)
    dataset_sample_indices.append(dataset_sample_index)

# Concatenate the within-epoch shuffled indexes
dataset_index = np.concatenate(dataset_indices)
dataset_sample_index = np.concatenate(dataset_sample_indices)

If you think this is reasonable I can submit a pull request.

@TJ-Solergibert
Copy link
Contributor

Yes! I warned this in the little docs I wrote a while back.

Looks good to me, make sure that the unit tests are passing and try to submit a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants