-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check uniqueness of input paths and obs_ids in merge tool #2611
Conversation
7f239d6
to
028553d
Compare
The only real unit tests testing the merging thoroughly is merging the only two large files we have: the gamma and the proton training dl2 files. However, due to the simulations re-using run ids for the different particles, the tool now correctly complains about duplicated obs_ids. I am not sure how to address this, I think we should keep the test but adding an option to disable the uniqueness check just to test the that the merging works on this invalid input also seems strange. I could modify the file on disk to have different obs_ids... |
1aaf889
to
9dae1bf
Compare
This was pretty simple, so I opted for this approach |
9dae1bf
to
00dddbd
Compare
What happens if we do want to make a file with multiple particles types mixed in it, and they have overlapping obs_ids? I guess that requires a larger change since we assume obs_ids are unique, right? It's maybe not a very common use case, but I guess we should at least mention this in the docs of the merge tool, e.g. that we do not allow mixing of the same obs_id. Or we just fix this in the OB/SB data model where the obs_ids are in the future supposed to have more digits pre-pended to make them more unique (north, south, simulation), so maybe in that case, we could consider defining the simulation pre-id to be something like |
This is something that is not supported now in any case: the data model assumes that we have unique obs_id / event_id combinations and on this assumptions many things are build (TableLoader, StereoPrediction, etc.). This is why we added the |
Yes, that's clear. So nothing to do in ctapipe, but the question is if we should update the way obs_ids are re-assigned for simulations so that each particle species is unique. That would have to just be a convention, not a software fix. |
@GernotMaier or @orelgueta might confirm, but I think we talked about it and prod6 already has unique obs_ids, at least per site / pointing position |
Unique obs_ids are only per site/pointing/particle. So we reuse the run numbers also for different primaries in the same pointing/site. If I remember correctly our discussions, we realised that anyway you will have to assign a unique obs_id after the fact, so there is no reason for us to change the behaviour (which is done manually when launching the jobs so quite prone to error). |
Ok, I don't really like adding more meaning to obs_ids... but looks like this might be the best option for now. Including particle id automatically is easy, site probably as well. Pointing position maybe not so much. At least for now, all our tools that need multiple species as input are designed to read from multiple files anyway. |
This is a very simple check to partially address #2610
I will add the check for obs_id in another PR to fully address #2610