-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataloader Draft #24
Dataloader Draft #24
Conversation
src/data/sequence_dataloader.py
Outdated
from torch.utils.data import Dataset, DataLoader | ||
|
||
class SequenceDatasetBase(Dataset): | ||
def __init__(self, data_path, transform=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add here sequence_lenght=200 so we have flexibility later
src/data/sequence_dataloader.py
Outdated
# Iterating through DNA sequences from dataset and one-hot encoding all nucleotides | ||
current_seq = self.data["raw_sequence"][index] | ||
if 'N' not in current_seq: | ||
X_seq = np.array(self.one_hot_encode(current_seq, ['A','C','T','G'], 200)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we can replace 200 with self.sequence_length
src/data/sequence_dataloader.py
Outdated
return X_seq, X_cell_type | ||
|
||
# Function for one hot encoding each line of the sequence dataset | ||
def one_hot_encode(self, seq, alphabet, max_seq_len): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace max_seq_len with sequence_length
The dataloader is now up to date with all changes regarding one-hot encoding of components and renamed to suit our new folder structure.
See #17 for earlier discussion.