-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loader should expose a snapshot dimension interface #36
Comments
It is pretty easy to add to the data set or data loader as of PR #49 . The data members that have size information are already on the dataset object, and if you have the data loader, its easy to get the dataset. We can pretty easily syntactic sugar pass-through methods so that the dataset/dataloader distinction isn't something the user has to think about. Unfortunately getting access to the data loader from top level is a bit involved. Fibad.train (or .predict) runs a function that is essentially a script, which has the data loader as a local variable during the I think in order to interrogate snapshot dimensions, @aritraghsh09 (or another notebook user) is going to run something like this:
Note that I'm thinking the best way forward is to create some fields on Fibad to hold the current model and data loader, and then move the train/predict/download verbs inside the Fibad class replacing the stubs that exist there now, so they have easy access to the Fibad fields. This wouldn't change the functionality of the cli at all, since it already routes calls through the Fibad object, but it would cut out one of our earlier design ideas that fibad "verbs" correspond to a file with a @drewoldag @aritraghsh09 Thoughts? |
I see where you're coming from with this, it does make sense to avoid the expensive file-scan calls for sure. But I'm not confident that pulling in the definitions of the verb scripts into the Fibad object is the best way to maintain access to the instantiated data loader. I think I would be more in favor being able to rapidly instantiate a dataloader. Would it be possible to cache a metadata file to skip the file-scan and enable analysis outside of fibad? I think this is very vaguely similar to what @aritraghsh09 was mentioning at our last sync about getting the filenames appended to the catalog table. The cached metadata isn't specifically what Aritra was asking for, but could augment it and serve to help the user understand the data set. Similar to what you called out in the last comment, the user might do this:
Additionally, a cached metadata file could support something similar to what huggingface offers for looking at data set statistics. Which also might be a nice thing to have for down the line. For example: https://huggingface.co/datasets/mwalmsley/gz_desi/viewer/default/train Second point: Having the Fibad object dispatch to an externally defined download/train/predict |
Ok, I'm going to try @drewoldag 's suggestion here, and pull this into the download manifest/fail-file work in order to make roughly the notebook vision above a reality without doing redundant file system scans. |
The idea here is that in the situation where a researcher is loading HSC cutouts from a directory where there are significantly heterogenous image sizes, there should be some access to the file-scan data that the HSC data loader has.
This access will in theory allow the researcher to determine what an appropriate cutout size is to give the data loader. See: #35
The text was updated successfully, but these errors were encountered: