Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confirm that HSCDataSet splits are working as expected #129

Open
drewoldag opened this issue Dec 4, 2024 · 3 comments
Open

Confirm that HSCDataSet splits are working as expected #129

drewoldag opened this issue Dec 4, 2024 · 3 comments
Assignees
Labels
Data Loader Data Loader code primarily

Comments

@drewoldag
Copy link
Collaborator

I wasn't present at the discussions about how this should behave, so perhaps it's functioning exactly as expected, but it seems a little odd that the user can configure the splits in the config file such that some data will not be included in any splits.

@drewoldag drewoldag self-assigned this Dec 4, 2024
@aritraghsh09
Copy link
Collaborator

I think this is how we designed the expected behavior to be in a situation where a user might want to use only a small fraction of a large dataset.

@aritraghsh09
Copy link
Collaborator

I have a related issue with HSCDataSet splits that I will bundle here. This is when trying to load the entirely of the HSC 0.25 < z < 0.50 dataset with train_size = 0.8 and the other two sizes each set to 0.1

[2024-12-09 07:11:00,625 fibad.data_sets.hsc_data_set:INFO] HSC Data set loader has 8088376 objects
[2024-12-09 07:11:02,224 fibad.data_sets.hsc_data_set:INFO] HSC Data Set Splits loaded are:
[2024-12-09 07:11:02,225 fibad.data_sets.hsc_data_set:INFO] test split contains 808838 items
[2024-12-09 07:11:02,225 fibad.data_sets.hsc_data_set:INFO] train split contains 6470701 items
[2024-12-09 07:11:02,225 fibad.data_sets.hsc_data_set:INFO] validate split contains 1 items
[2024-12-09 07:11:04,178 fibad.data_sets.hsc_data_set:INFO] Test split contains 808838 items
[2024-12-09 07:11:04,179 fibad.data_sets.hsc_data_set:INFO] Train split contains 6470701 items
[2024-12-09 07:11:04,179 fibad.data_sets.hsc_data_set:INFO] Validation split contains 808837 items

Interestingly, the first set of printed messages has the wrong number of items in the validation set. Also, why are there two sets of print statements, with the only difference being the capitalisation of the first letter?

@aritraghsh09 aritraghsh09 added the Data Loader Data Loader code primarily label Dec 9, 2024
@drewoldag
Copy link
Collaborator Author

It's being printed twice because there are actually two approaches to splitting that happen in the HSCDataSet class right now. The code will be making use of the second method (the one that is producing the test=808k, train=6.47M, val=808k), but both have been kept for the time being.

Given your first comment @aritraghsh09 it would be good for me to go back and make sure that the second method properly allows users to define a small fraction of a large dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Loader Data Loader code primarily
Projects
None yet
Development

No branches or pull requests

2 participants