-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DL1 data file layout discussion #1059
Comments
just noticed I'm missing the tel_id in the image tables, but that's easy to add (not so useful without it!) |
The easiest way to explore the file is with pytables probably (since the vector columns cause pandas to complain): >>> import tables
>>> t = tables.open_file("example_dl1.h5")
>>> t.root.dl1_event.tel_image_GCT_CHEC
/dl1_event/tel_image_GCT_CHEC (Table(7,)) 'Storage of DL0Container,DL1CameraContainer,MCEventContainer'
description := {
"dl1camera_image": Float64Col(shape=(1, 2048), dflt=0.0, pos=0),
"dl1camera_pulse_time": Float64Col(shape=(1, 2048), dflt=0.0, pos=1),
"event_id": Int32Col(shape=(), dflt=0, pos=2),
"mc_alt": Float64Col(shape=(), dflt=0.0, pos=3),
"mc_az": Float64Col(shape=(), dflt=0.0, pos=4),
"mc_core_x": Float64Col(shape=(), dflt=0.0, pos=5),
"mc_core_y": Float64Col(shape=(), dflt=0.0, pos=6),
"mc_energy": Float64Col(shape=(), dflt=0.0, pos=7),
"mc_h_first_int": Float64Col(shape=(), dflt=0.0, pos=8),
"mc_shower_primary_id": Int32Col(shape=(), dflt=0, pos=9),
"mc_x_max": Float64Col(shape=(), dflt=0.0, pos=10),
"obs_id": Int32Col(shape=(), dflt=0, pos=11)}
byteorder := 'little'
chunkshape := (7,)
>>> im = t.root.dl1_event.tel_image_GCT_CHEC.col("dl1camera_image")
>>> CameraDisplay(geom, image=im.std(axis=0)[0]) |
Looks good, thanks a lot. The only thing I would modify (unless there is some strong reason to keep it) is the introduction of the MC information in the image tables that you mention. It seems more logical to have the MCEventContainer, exactly as it is read in by pyeventio, propagated to the DL1 file as part of the sub_trigger table (and the same in later data levels, like DL1-param or DL2, only removing the pixel-wise information - true # of p.e. per pixel). You mentioned that some sort of links could be created inside the file to allow having the same data in different tables without physically writing them twice, right? Can that be used to have that separate, event-wise mc table, but at the same time link each telescope image with the corresponding mc info? - the same could be done with the instrument data (telescope coordinates and all the rest) which corresponds to a given image. |
Well, links are just to allow a dataset to be defined in a separate file, but what you mean is doing a join (like in a relational database) on 2 tables, which is supported by things like pandas and astropy.Table , but at the cost of some complexity and speed. It's not too bad to do that though. e.g.: from astropy.table import Table, join
t1 = Table(dict(event_id=[1,4,5], value1=[2.5,2.3,2.6]))
t2 = Table(dict(event_id=[1,4,5], value2=[8.1,8.2,8.9]))
t3 = join(t1,t2, keys="event_id")
print(t1); print(t2); print(t3)
The problem I just found is that But yes, I think i'll just make a separate |
Reading the event tables has some problems...
h5py fails to read any tables that have unicode strings in their header attributes (which is the default output in pytables and python3). bug here (since 2015!): h5py/h5py#585 What are people using to read the data in the various pipelines? |
Indeed, that join is what I meant, i.e., with more descriptive names:
That is what we need i.e. to be able to train a type-wise RF to evaluate the energy with a single telescope's image, but making use of the position of the telescope relative to the shower (as estimated with all telescopes). I did not fully understand though the limitations of this approach (compared to repeating the MC info in the images' table) - if they are somehow solvable I think we should go for the separate "events" and "images" tables. |
One way to read them that seems to work, and also supports joining (but is a bit of a hack, and maybe not efficient) is to read first with pytables and then convert to astropy.table format: import tables
from astropy.table import Table
t = tables.open_file("example_dl1.h5")
table = Table( t.root.dl1_event.tel_image_GCT_CHEC[:] )
table
Then you could do the same for a MC table, and join it. This however loses the metadata and units... I suppose we could make a helper function that re-attaches those until astropy.table's hdf5 functionality is fixed. |
A "bug" I've noticed that will be fixed soon in ctapipe: the images have a shape of |
It might be better to add one level of sub-structure like:
So that the user doesn't have to parse the table name string. Or even deeper, but maybe this gets harder to use?
|
|
Ok, I've already modified it to add the MC table, and I think I agree that it's nicer to have the tables named purely by the camera/telescope name, so that the rest is taken into account by the hierarchy (e.g. |
I've updated the top-level of this issue with a new version, containing the split tables (and an example how to join them per-event and per-telescope). The deeper group hierarchy is not yet implemented. |
Version v3 looks fine to me, just two suggestions:
Other than that, from my side I think we can move on to bless this as the preliminary DL1-image file layout for the MC processing. |
That will happen automatically when ctapipe is updated (can't easily do it here)
Ok, I'll open an issue in ctapipe to fix that (it requires a change in |
Looks good to me as well, but for the naming of the event mc shower and trigger: /dl1_event/sub_mc_shower Dataset {10/Inf} /dl1_event/sub_trigger Dataset {10/Inf} Then I would propose that to be merged and a ctapipe release to be made (e.g. without waiting for #1066 1066), so that we can try to convert a bit of MC data and have people play around with that "new" data format... and make sure there is no significant flow. |
looking better into current developments, may be we should still wait for #1026 (Major refactoring of calibration chain) for the next ctapipe release, as it seems to be mostly ready. |
Hi Is |
it's the dataset - each event is a row in the table (per camera type, so that the row length is fixed-width). The tables are chunked, so column or row-wise access are both possible, though one might still optimize the chunksize for one or the other use case). In my mockup, the fields with |
Ok, yes, I prefer avoiding abbreviations as well. The "sub" was to follow the CTA naming guidliens, where event data may come from either telescope (TEL) or subarray (SUB) . I'll probably just make those groups in the end as described above, but that requires a change to HDF5TableWriter that isn't there yet.
For that we have to wait for a few other PRs. Right now even this mock code won't work due to a bug that is fixed in PR #1060 . Also, so far there is no code to produce this output in ctapipe, other than a hacky notebook linked above, but I'm working on migrating that to a standard tool We don't have to have all of #1066 implemented to test though, just the basic image output part, so that should come quickly, |
Right now we are missing any kind of pre-generated index tables (as used I think in the dl1-data-handler, though there seems to be no documentation) other than the trigger table, so any event merging operations have to be done later in software. Should we include something like that? Or leave it open as a later development? |
Another think that we discussed should be added is the monte-carlo thrown energy histograms. i think I will write that under |
Ok. So in |
@bryankim96 would be able to help concerning the implementation in |
Thanks - yes I'd like it to work like dl1-data-handler - is the output documented anywhere @bryankim96 ? I guess I can look at the sample file from the meeting. |
In the mean time, I've updated the example to version 4 (now including the MC thrown events histograms). See the top-level of this issue for the latest version |
I did notice that pytables allows a column index to be built for a table. We might look into that feature to support fast event merging, rather than doing it by hand. (https://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_index) Perhaps that's what the dl1-data-handler does already? |
Here's the wiki page describing the DL1 DH structure: https://github.com/cta-observatory/dl1-data-handler/wiki/CTA-ML-Data-Format I took a look at your version 4 file and it seems like the v4 and DL1 DH format are very similar to each other (one table to store images for each telescope type, tables to store telescope and subarray info, tables for both triggered events and mc events, the need for some sort of indexing/mapping method between the event table and image tables). I'd say the only meaningful differences in DL1DH are related to the issues which were discussed above regarding making joins/lookup of the images associated with each event as quick as possible (using indexing). Instead of recording the obs_id and event_id in the image tables we store the actual row index of the corresponding event (in the event table). This way mapping from images to events can always be done in constant time without the need to query/search by event_id and obs_id. Similarly, rather than just a binary trigger array, in the event table we store the row indices (rows in the image tables) in the trigger arrays. This way the lookup of the images from an event is also O(1). This format was developed as a quick solution, so maybe there is a better way to handle this? You can find a more detailed description of the format in the wiki page linked above. Regarding indexes on columns, DL1DH uses this functionality, but because we already set things up for O(1) lookup both ways (event -> image, image -> event), it was mainly to speed up queries or sorting on commonly used columns. We currently add indices on the mc_energy, alt, and az columns in the event table and on the event_index column in the image tables. |
Ok, great. I'll take a look at that then. I'm working on a version 5, and already have the index tables auto-generated, and compression enabled. I tried doing some table merging with astropy.Table and found it to be quite fast (even though it doesn't use the index tables that pytables generates), so it may not be a large issue - certainly using pytables alone will be faster, though. Note on compression: I tried a large set of options, and found in the end that BLOSC with ZSTD compression is the fastest and most efficient. However, it means that the files are no longer fully viewable in HDFView (which only seems to support bzip2, gzip and a few other compression schemes for some reason - you can view the structure, but not the table contents). VITables works fine though, and the files are much smaller than before, so I think it's a reasonable tradeoff. |
Ok, now I've created a version 5, with full substructure (which I think is nicer to read and work with),
Here's the new structure, and I'll update the top-level issue as well |
That indeed looks really good, it's nice to see such progress! |
@kosack I can't open the notebook
Could you send the latest version please? |
strange- maybe I made a cut/paste error. I'll update it now |
Thank you, it works :)
… Le 24 mai 2019 à 13:58, Karl Kosack ***@***.***> a écrit :
should work now: https://gist.github.com/kosack/7cbbe848ab4840893810c4fa3d75be9a <https://gist.github.com/kosack/7cbbe848ab4840893810c4fa3d75be9a>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#1059?email_source=notifications&email_token=ABAQ5XXKAN6ABQZ7X6S5XK3PW7J4TA5CNFSM4HIN3H2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWFCN4Y#issuecomment-495593203>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABAQ5XXCMZVKBZDPYZSTZ4LPW7J4TANCNFSM4HIN3H2A>.
|
@vuillaut You'll need the master version of ctapipe to run it |
Thanks, looks good indeed! I see no problem with the HDFView limitation, I agree the compression advantage is worth it (and vitables is not much harder to install even on a mac).
Can you explain what is the advantage of hand-made index tables over the PyTables ones? |
Hi @kosack |
Replying to myself: with this structure, it's not possible as images are stored in the same table, which implies variable length variables. question: do we want to keep the possibility to load tables with pandas (that proved useful in pipelines)? |
issue moved to #1163 |
Following the recent discussion in the ASWG, I've created a small notebook that uses the ctapipe functionality to write out a sample DL1 data file that contains only the DL1.TEL.EVT.IMAGE data (e.g. no cleaning or parameterization, which would be DL1.TEL.EVT.PARAM).
It also stores the instrument and simulation metadata, and some DL1.SUB.EVT data (the trigger pattern for now, but you could imagine this could also include an index map table). We can use this issue to discuss the hierarchy. For now, I tried to keep it not too complex (no deep sub-datasets), and I just use the default ctapipe containers, so didn't add any extra information.
For all tables, some extra custom metadata are stored, like the unit of each column and the ctapipe version.
I also attached the MC event info to all camera tables (it's repetitive, but convenient). That could also be stored in a separate table.
and the notebook I used is here:
https://gist.github.com/kosack/7cbbe848ab4840893810c4fa3d75be9a
The output file looks like this:
The text was updated successfully, but these errors were encountered: