Confirm that HSCDataSet splits are working as expected #129

drewoldag · 2024-12-04T03:23:19Z

I wasn't present at the discussions about how this should behave, so perhaps it's functioning exactly as expected, but it seems a little odd that the user can configure the splits in the config file such that some data will not be included in any splits.

aritraghsh09 · 2024-12-09T14:55:55Z

I think this is how we designed the expected behavior to be in a situation where a user might want to use only a small fraction of a large dataset.

aritraghsh09 · 2024-12-09T15:16:57Z

I have a related issue with HSCDataSet splits that I will bundle here. This is when trying to load the entirely of the HSC 0.25 < z < 0.50 dataset with train_size = 0.8 and the other two sizes each set to 0.1

[2024-12-09 07:11:00,625 fibad.data_sets.hsc_data_set:INFO] HSC Data set loader has 8088376 objects
[2024-12-09 07:11:02,224 fibad.data_sets.hsc_data_set:INFO] HSC Data Set Splits loaded are:
[2024-12-09 07:11:02,225 fibad.data_sets.hsc_data_set:INFO] test split contains 808838 items
[2024-12-09 07:11:02,225 fibad.data_sets.hsc_data_set:INFO] train split contains 6470701 items
[2024-12-09 07:11:02,225 fibad.data_sets.hsc_data_set:INFO] validate split contains 1 items
[2024-12-09 07:11:04,178 fibad.data_sets.hsc_data_set:INFO] Test split contains 808838 items
[2024-12-09 07:11:04,179 fibad.data_sets.hsc_data_set:INFO] Train split contains 6470701 items
[2024-12-09 07:11:04,179 fibad.data_sets.hsc_data_set:INFO] Validation split contains 808837 items

Interestingly, the first set of printed messages has the wrong number of items in the validation set. Also, why are there two sets of print statements, with the only difference being the capitalisation of the first letter?

drewoldag · 2024-12-09T17:25:53Z

It's being printed twice because there are actually two approaches to splitting that happen in the HSCDataSet class right now. The code will be making use of the second method (the one that is producing the test=808k, train=6.47M, val=808k), but both have been kept for the time being.

Given your first comment @aritraghsh09 it would be good for me to go back and make sure that the second method properly allows users to define a small fraction of a large dataset.

drewoldag self-assigned this Dec 4, 2024

aritraghsh09 added the Data Loader Data Loader code primarily label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confirm that HSCDataSet splits are working as expected #129

Confirm that HSCDataSet splits are working as expected #129

drewoldag commented Dec 4, 2024

aritraghsh09 commented Dec 9, 2024

aritraghsh09 commented Dec 9, 2024

drewoldag commented Dec 9, 2024

Confirm that HSCDataSet splits are working as expected #129

Confirm that HSCDataSet splits are working as expected #129

Comments

drewoldag commented Dec 4, 2024

aritraghsh09 commented Dec 9, 2024

aritraghsh09 commented Dec 9, 2024

drewoldag commented Dec 9, 2024