Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: dataloader first draft #17

Merged
merged 3 commits into from
Oct 17, 2022
Merged

wip: dataloader first draft #17

merged 3 commits into from
Oct 17, 2022

Conversation

ssenan
Copy link
Collaborator

@ssenan ssenan commented Oct 16, 2022

No description provided.

@ssenan ssenan linked an issue Oct 16, 2022 that may be closed by this pull request
@mateibejan1
Copy link
Collaborator

Could you just change the name of the file to sequence_dataloader.py, so we can differentiate from future possible loaders?

@lucapinello
Copy link
Collaborator

Thanks guys!

Two comments:

  1. I am fine to have separate files, @mateibejan1 however we can also have different classes on this module and keep all the related classes and functions in a small number of files. I am envisioning to do the same for the metrics e.g. we can have a single file called metrics and we can import from there.
  2. @ssenan I think it may be better to not use the one-hot encoding for the components and using a simple integer so we can keep memory under control when we will load the entire 3.5M sequences!

@mateibejan1
Copy link
Collaborator

I understand you perspective @lucapinello. I would only like to apply this separation-by-class to diffusion code, denoising networks and dataloaders. The reasoning is that the code for these types of files tends to get repetitive and once you have 3-4 modules in the same file it becomes tiring to keep track of which section you're debugged or changed. I am ok with keeping all metrics, utils etc. in a small number of files in their designated directories.

self.num_workers = num_workers


def prepare_data(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably drop this redundant function definition.

import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader

class SequenceDataset(Dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposition: We could also separate this script into a dataloader.py and a dataset.py.

The dataset.py would contain the SequenceDataset Class and the dataloader.py can have the SequenceDataModule Class (or a BaseDataModule and further subclasses such as SequenceDataModule as suggested by @lucapinello)

@ssenan
Copy link
Collaborator Author

ssenan commented Oct 16, 2022

@mateibejan1 Thanks for the suggestion. Can you help me better understand an example file naming breakdown you had in mind in regards to the entire project?

My thought process was more inline with Luca's where all loaders can be kept in a single dataloader.py file, and they can then be called in a main script using something like "from dataloader import SequenceDataModule". This is in line with some other implementations using medical image data that I have seen.

I'm also available on discord @ssenan

@IhabBendidi
Copy link
Collaborator

There is currently a discussion on folder structure of the code base in #19 , referencing it here so that you guys can check it and add your ideas

@ssenan ssenan merged commit ce9b3bb into pinellolab:codebase Oct 17, 2022
This was referenced Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Create a Data Loader Class with Pytorch Lightning
5 participants