Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check uniqueness of input paths and obs_ids in merge tool #2611

Merged
merged 3 commits into from
Sep 11, 2024

Conversation

maxnoe
Copy link
Member

@maxnoe maxnoe commented Sep 4, 2024

This is a very simple check to partially address #2610

I will add the check for obs_id in another PR to fully address #2610

@maxnoe maxnoe changed the title Check uniqueness of input paths in merge tool Check uniqueness of input paths and obs_ids in merge tool Sep 4, 2024
docs/changes/2611.feature.rst Outdated Show resolved Hide resolved
src/ctapipe/io/hdf5merger.py Outdated Show resolved Hide resolved
@maxnoe
Copy link
Member Author

maxnoe commented Sep 5, 2024

The only real unit tests testing the merging thoroughly is merging the only two large files we have: the gamma and the proton training dl2 files.

However, due to the simulations re-using run ids for the different particles, the tool now correctly complains about duplicated obs_ids.

I am not sure how to address this, I think we should keep the test but adding an option to disable the uniqueness check just to test the that the merging works on this invalid input also seems strange.

I could modify the file on disk to have different obs_ids...

@maxnoe
Copy link
Member Author

maxnoe commented Sep 5, 2024

I could modify the file on disk to have different obs_ids...

This was pretty simple, so I opted for this approach

@kosack
Copy link
Contributor

kosack commented Sep 5, 2024

What happens if we do want to make a file with multiple particles types mixed in it, and they have overlapping obs_ids? I guess that requires a larger change since we assume obs_ids are unique, right? It's maybe not a very common use case, but I guess we should at least mention this in the docs of the merge tool, e.g. that we do not allow mixing of the same obs_id.

Or we just fix this in the OB/SB data model where the obs_ids are in the future supposed to have more digits pre-pended to make them more unique (north, south, simulation), so maybe in that case, we could consider defining the simulation pre-id to be something like (simulation_pre_id + particle_id)*X+obs_id so that all particles have a unique obs_id.

@maxnoe
Copy link
Member Author

maxnoe commented Sep 5, 2024

What happens if we do want to make a file with multiple particles types mixed in it, and they have overlapping obs_ids?

This is something that is not supported now in any case: the data model assumes that we have unique obs_id / event_id combinations and on this assumptions many things are build (TableLoader, StereoPrediction, etc.).

This is why we added the override_obs_id option to SimTelEventSource, so that we can change the obs_ids at the start to something unique.

@kosack
Copy link
Contributor

kosack commented Sep 5, 2024

This is why we added the override_obs_id option to SimTelEventSource, so that we can change the obs_ids at the start to something unique.

Yes, that's clear. So nothing to do in ctapipe, but the question is if we should update the way obs_ids are re-assigned for simulations so that each particle species is unique. That would have to just be a convention, not a software fix.

@maxnoe
Copy link
Member Author

maxnoe commented Sep 5, 2024

@GernotMaier or @orelgueta might confirm, but I think we talked about it and prod6 already has unique obs_ids, at least per site / pointing position

@maxnoe maxnoe requested review from Tobychev and kosack September 5, 2024 14:18
@maxnoe maxnoe linked an issue Sep 5, 2024 that may be closed by this pull request
@orelgueta
Copy link
Contributor

@GernotMaier or @orelgueta might confirm, but I think we talked about it and prod6 already has unique obs_ids, at least per site / pointing position

Unique obs_ids are only per site/pointing/particle. So we reuse the run numbers also for different primaries in the same pointing/site.

If I remember correctly our discussions, we realised that anyway you will have to assign a unique obs_id after the fact, so there is no reason for us to change the behaviour (which is done manually when launching the jobs so quite prone to error).

@maxnoe
Copy link
Member Author

maxnoe commented Sep 5, 2024

Ok, I don't really like adding more meaning to obs_ids... but looks like this might be the best option for now.

Including particle id automatically is easy, site probably as well. Pointing position maybe not so much.

At least for now, all our tools that need multiple species as input are designed to read from multiple files anyway.

@maxnoe maxnoe added this to the 0.22.0 milestone Sep 10, 2024
@kosack kosack merged commit b30984e into main Sep 11, 2024
12 checks passed
@maxnoe maxnoe deleted the uniqueness_check branch September 11, 2024 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check uniqueness of obs_id / event_id when merging files.
4 participants