Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DL1 data file layout discussion #1059

Closed
kosack opened this issue Apr 25, 2019 · 39 comments
Closed

DL1 data file layout discussion #1059

kosack opened this issue Apr 25, 2019 · 39 comments

Comments

@kosack
Copy link
Contributor

kosack commented Apr 25, 2019

Following the recent discussion in the ASWG, I've created a small notebook that uses the ctapipe functionality to write out a sample DL1 data file that contains only the DL1.TEL.EVT.IMAGE data (e.g. no cleaning or parameterization, which would be DL1.TEL.EVT.PARAM).

It also stores the instrument and simulation metadata, and some DL1.SUB.EVT data (the trigger pattern for now, but you could imagine this could also include an index map table). We can use this issue to discuss the hierarchy. For now, I tried to keep it not too complex (no deep sub-datasets), and I just use the default ctapipe containers, so didn't add any extra information.

For all tables, some extra custom metadata are stored, like the unit of each column and the ctapipe version.

I also attached the MC event info to all camera tables (it's repetitive, but convenient). That could also be stored in a separate table.

  • The data file can be found here (private to CTA members):

image

@kosack
Copy link
Contributor Author

kosack commented Apr 25, 2019

just noticed I'm missing the tel_id in the image tables, but that's easy to add (not so useful without it!)

@kosack
Copy link
Contributor Author

kosack commented Apr 25, 2019

The easiest way to explore the file is with pytables probably (since the vector columns cause pandas to complain):

>>> import tables
>>> t = tables.open_file("example_dl1.h5")       
>>> t.root.dl1_event.tel_image_GCT_CHEC

/dl1_event/tel_image_GCT_CHEC (Table(7,)) 'Storage of DL0Container,DL1CameraContainer,MCEventContainer'
  description := {
  "dl1camera_image": Float64Col(shape=(1, 2048), dflt=0.0, pos=0),
  "dl1camera_pulse_time": Float64Col(shape=(1, 2048), dflt=0.0, pos=1),
  "event_id": Int32Col(shape=(), dflt=0, pos=2),
  "mc_alt": Float64Col(shape=(), dflt=0.0, pos=3),
  "mc_az": Float64Col(shape=(), dflt=0.0, pos=4),
  "mc_core_x": Float64Col(shape=(), dflt=0.0, pos=5),
  "mc_core_y": Float64Col(shape=(), dflt=0.0, pos=6),
  "mc_energy": Float64Col(shape=(), dflt=0.0, pos=7),
  "mc_h_first_int": Float64Col(shape=(), dflt=0.0, pos=8),
  "mc_shower_primary_id": Int32Col(shape=(), dflt=0, pos=9),
  "mc_x_max": Float64Col(shape=(), dflt=0.0, pos=10),
  "obs_id": Int32Col(shape=(), dflt=0, pos=11)}
  byteorder := 'little'
  chunkshape := (7,)

>>> im = t.root.dl1_event.tel_image_GCT_CHEC.col("dl1camera_image") 
>>> CameraDisplay(geom, image=im.std(axis=0)[0])                                                      

image

@moralejo
Copy link
Contributor

Following the recent discussion in the ASWG, I've created a small notebook that uses the ctapipe functionality to write out a sample DL1 data file that contains only the DL1.TEL.EVT.IMAGE data (e.g. no cleaning or parameterization, which would be DL1.TEL.EVT.PARAM).

It also stores the instrument and simulation metadata, and some DL1.SUB.EVT data (the trigger pattern for now, but you could imagine this could also include an index map table). We can use this issue to discuss the hierarchy. For now, I tried to keep it not too complex (no deep sub-datasets), and I just use the default ctapipe containers, so didn't add any extra information.

I also attached the MC event info to all camera tables (it's repetitive, but convenient). That could also be stored in a separate table.

Looks good, thanks a lot. The only thing I would modify (unless there is some strong reason to keep it) is the introduction of the MC information in the image tables that you mention. It seems more logical to have the MCEventContainer, exactly as it is read in by pyeventio, propagated to the DL1 file as part of the sub_trigger table (and the same in later data levels, like DL1-param or DL2, only removing the pixel-wise information - true # of p.e. per pixel).

You mentioned that some sort of links could be created inside the file to allow having the same data in different tables without physically writing them twice, right? Can that be used to have that separate, event-wise mc table, but at the same time link each telescope image with the corresponding mc info? - the same could be done with the instrument data (telescope coordinates and all the rest) which corresponds to a given image.

@kosack
Copy link
Contributor Author

kosack commented Apr 26, 2019

You mentioned that some sort of links could be created inside the file to allow having the same data in different tables without physically writing them twice, right?

Well, links are just to allow a dataset to be defined in a separate file, but what you mean is doing a join (like in a relational database) on 2 tables, which is supported by things like pandas and astropy.Table , but at the cost of some complexity and speed. It's not too bad to do that though.

e.g.:

from astropy.table import Table, join
t1 = Table(dict(event_id=[1,4,5], value1=[2.5,2.3,2.6]))   
t2 = Table(dict(event_id=[1,4,5], value2=[8.1,8.2,8.9]))      
t3 = join(t1,t2, keys="event_id")
print(t1); print(t2); print(t3)       
event_id value1
-------- ------
       1    2.5
       4    2.3
       5    2.6
event_id value2
-------- ------
       1    8.1
       4    8.2
       5    8.9
event_id value1 value2
-------- ------ ------
       1    2.5    8.1
       4    2.3    8.2
       5    2.6    8.9

The problem I just found is that astropy.table doesn't seem to read these tables correctly... it has a problem with unicode strings in the headers (related to a problem in h5py which it uses underneath). Pandas reads them only if you exclude the vector columns (the images), so that's not so useful.

But yes, I think i'll just make a separate mc_shower table, similar to the trigger one. Probably in the final DL2 table (or DL1 parameters tables), per-joining the MC info into the table is more useful for debugging though.

@kosack
Copy link
Contributor Author

kosack commented Apr 26, 2019

Reading the event tables has some problems...

  • pytables: works fine, bit a bit of a low-level interface (no nice output or join operations)
  • astropy: works for the instrument tables, but the event tables crash due to h5py bug shown below
  • pandas: fails due to vector columns (pure scalar column tables work fine though)
  • ctapipe.io.HDF5TableReader: should work fine for event-wise access (since it can read anything the HDF5TableWriter writes), but not really made for column-wise access.
  • xarray : seems to think the tables are NetCDF4 format, and then promptly crashes hard

h5py fails to read any tables that have unicode strings in their header attributes (which is the default output in pytables and python3). bug here (since 2015!): h5py/h5py#585

What are people using to read the data in the various pipelines?

@moralejo
Copy link
Contributor

Indeed, that join is what I meant, i.e., with more descriptive names:

from astropy.table import Table, join
triggered_events_table = Table(dict(event_id=[1,2,4,5,6], mcenergy=[2.5,2.3,2.6,3.1,5.0]))   
images_table = Table(dict(event_id=[1,1,1,2,4,4,4,4,5,5,6], tel_id=[1,2,3,1,1,2,3,4,2,3,1], npixels=[12,4,7,21,30,9,6,17,44,3,20]))
joined = join(triggered_events_table, images_table, keys="event_id")
print(triggered_events_table)
print(images_table)
print(joined)

event_id mcenergy
-------- --------
       1      2.5
       2      2.3
       4      2.6
       5      3.1
       6      5.0
event_id tel_id npixels
-------- ------ -------
       1      1      12
       1      2       4
       1      3       7
       2      1      21
       4      1      30
       4      2       9
       4      3       6
       4      4      17
       5      2      44
       5      3       3
       6      1      20
event_id mcenergy tel_id npixels
-------- -------- ------ -------
       1      2.5      1      12
       1      2.5      2       4
       1      2.5      3       7
       2      2.3      1      21
       4      2.6      1      30
       4      2.6      2       9
       4      2.6      3       6
       4      2.6      4      17
       5      3.1      2      44
       5      3.1      3       3
       6      5.0      1      20

That is what we need i.e. to be able to train a type-wise RF to evaluate the energy with a single telescope's image, but making use of the position of the telescope relative to the shower (as estimated with all telescopes). I did not fully understand though the limitations of this approach (compared to repeating the MC info in the images' table) - if they are somehow solvable I think we should go for the separate "events" and "images" tables.

@kosack
Copy link
Contributor Author

kosack commented Apr 26, 2019

One way to read them that seems to work, and also supports joining (but is a bit of a hack, and maybe not efficient) is to read first with pytables and then convert to astropy.table format:

import tables
from astropy.table import Table
t = tables.open_file("example_dl1.h5")       
table = Table( t.root.dl1_event.tel_image_GCT_CHEC[:] )
table
<Table length=7>
          dl1camera_image [1,2048]           ... obs_id
                  float64                    ... int32 
-------------------------------------------- ... ------
   0.3318978682429098 .. -0.4532076182151918 ...     10
    -0.2584160417636828 .. 2.736135692431119 ...     10
  -0.3184606437584632 .. -0.3663640898614069 ...     10
 -0.30672830749174085 .. -0.3623771232263547 ...     10
-0.24628153563156552 .. -0.42364885693236587 ...     10
 0.0032427051185768066 .. 0.5239655580436923 ...     10
   0.197100664181763 .. -0.15613698836044967 ...     10

Then you could do the same for a MC table, and join it. This however loses the metadata and units... I suppose we could make a helper function that re-attaches those until astropy.table's hdf5 functionality is fixed.

@kosack
Copy link
Contributor Author

kosack commented Apr 26, 2019

A "bug" I've noticed that will be fixed soon in ctapipe:

the images have a shape of (n_channels, n_pixels) i.e. a 2D array with one dimension that is length 1, which should be (n_pixels) i.e. 1D array, after gain-selection is applied (the ctapipe calibration code is currently being re-factored, and this will be included)

@kosack
Copy link
Contributor Author

kosack commented Apr 26, 2019

It might be better to add one level of sub-structure like:

- dl1_event/
    - tel/
        - image_MST_NectarCam
        - image_SST_1m_DigiCam
        - ...
        - params 
    - sub/
        - trigger

So that the user doesn't have to parse the table name string.
Right now this is not working in HDF5TableWriter, which needs a small fix to support arbitrarily deep table groups.

Or even deeper, but maybe this gets harder to use?

- dl1_event/
    - tel/
        - image/
            - MST_NectarCam
            - SST_1m_DigiCam
            - ...
        - params
    - sub/

@bregeon
Copy link

bregeon commented Apr 30, 2019

  • a separate MC table sounds a must (provided that the join stuff works)
  • a deeper structure, separating "image" and "params" also looks better to me
    but then we'll have to make sure to provide an interface to wrap the access to the deep parts of the structure and to join tables in just one line of code.
    As we won't be able to really test quickly the efficiency of the format, it's probably to decide quickly upon one and make a "real life" test by processing a significant MC data set.

@kosack
Copy link
Contributor Author

kosack commented Apr 30, 2019

Ok, I've already modified it to add the MC table, and I think I agree that it's nicer to have the tables named purely by the camera/telescope name, so that the rest is taken into account by the hierarchy (e.g. file['/dl1/event/tel/image/'] is the set of image tables per camera type) - for that I need to change HDF5TableWriter to allow for deeper hierarchies, but that shouldn't be too difficult.

@kosack
Copy link
Contributor Author

kosack commented Apr 30, 2019

I've updated the top-level of this issue with a new version, containing the split tables (and an example how to join them per-event and per-telescope). The deeper group hierarchy is not yet implemented.

@moralejo
Copy link
Contributor

moralejo commented May 2, 2019

Version v3 looks fine to me, just two suggestions:

  • discussed elsewhere: rename num_mirrors to mirror_class, and make it 1 for single-dish telescopes, 2 for dual mirror ones.
  • In instrument.subarray.layout replace "id" by "tel_id", so that the table can be joined with others like the dl1_event.tel_image*

Other than that, from my side I think we can move on to bless this as the preliminary DL1-image file layout for the MC processing.

@kosack
Copy link
Contributor Author

kosack commented May 2, 2019

discussed elsewhere: rename num_mirrors to mirror_class, and make it 1 for single-dish telescopes, 2 for dual mirror ones.

That will happen automatically when ctapipe is updated (can't easily do it here)

In instrument.subarray.layout replace "id" by "tel_id", so that the table can be joined with others like the dl1_event.tel_image*

Ok, I'll open an issue in ctapipe to fix that (it requires a change in SubarrayDescription.to_table())

@bregeon
Copy link

bregeon commented May 3, 2019

Looks good to me as well, but for the naming of the event mc shower and trigger:
`
/dl1_event Group

/dl1_event/sub_mc_shower Dataset {10/Inf}

/dl1_event/sub_trigger Dataset {10/Inf}
`
The "sub" prefix should be either "subarray" or may be just removed.

Then I would propose that to be merged and a ctapipe release to be made (e.g. without waiting for #1066 1066), so that we can try to convert a bit of MC data and have people play around with that "new" data format... and make sure there is no significant flow.

@bregeon
Copy link

bregeon commented May 3, 2019

looking better into current developments, may be we should still wait for #1026 (Major refactoring of calibration chain) for the next ctapipe release, as it seems to be mostly ready.

@vuillaut
Copy link
Member

vuillaut commented May 3, 2019

It might be better to add one level of sub-structure like:

- dl1_event/
    - tel/
        - image_MST_NectarCam
        - image_SST_1m_DigiCam
        - ...
        - params 
    - sub/
        - trigger

Hi

Is dl1_event a single event or just the name of the dataset?
--> are you proposing to store the data event-wise?

@kosack
Copy link
Contributor Author

kosack commented May 3, 2019

Is dl1_event a single event or just the name of the dataset?

it's the dataset - each event is a row in the table (per camera type, so that the row length is fixed-width). The tables are chunked, so column or row-wise access are both possible, though one might still optimize the chunksize for one or the other use case).

In my mockup, the fields with / are just dataset groups, so the table is e.g. image_MST_NectarCam

@kosack
Copy link
Contributor Author

kosack commented May 3, 2019

The "sub" prefix should be either "subarray" or may be just removed.

Ok, yes, I prefer avoiding abbreviations as well. The "sub" was to follow the CTA naming guidliens, where event data may come from either telescope (TEL) or subarray (SUB) . I'll probably just make those groups in the end as described above, but that requires a change to HDF5TableWriter that isn't there yet.

Then I would propose that to be merged and a ctapipe release to be made (e.g. without waiting for #1066 1066), so that we can try to convert a bit of MC data and have people play around with that "new" data format... and make sure there is no significant flow.

For that we have to wait for a few other PRs. Right now even this mock code won't work due to a bug that is fixed in PR #1060 . Also, so far there is no code to produce this output in ctapipe, other than a hacky notebook linked above, but I'm working on migrating that to a standard tool

We don't have to have all of #1066 implemented to test though, just the basic image output part, so that should come quickly,

@kosack
Copy link
Contributor Author

kosack commented May 3, 2019

Right now we are missing any kind of pre-generated index tables (as used I think in the dl1-data-handler, though there seems to be no documentation) other than the trigger table, so any event merging operations have to be done later in software. Should we include something like that? Or leave it open as a later development?

@kosack
Copy link
Contributor Author

kosack commented May 3, 2019

Another think that we discussed should be added is the monte-carlo thrown energy histograms. i think I will write that under /simulation/thrown_energy as a table indexed by mc_run_id, similar to the mc_run_config table.

@vuillaut
Copy link
Member

vuillaut commented May 3, 2019

Is dl1_event a single event or just the name of the dataset?

it's the dataset - each event is a row in the table (per camera type, so that the row length is fixed-width). The tables are chunked, so column or row-wise access are both possible, though one might still optimize the chunksize for one or the other use case).

In my mockup, the fields with / are just dataset groups, so the table is e.g. image_MST_NectarCam

Ok. So in image_MST_NectarCam for e.g. we have the entries event_id, telescope_id, image and peakpos, so that we can load all images as a column but also rebuild the events? (similar to dl1_data_handler

@vuillaut
Copy link
Member

vuillaut commented May 3, 2019

Right now we are missing any kind of pre-generated index tables (as used I think in the dl1-data-handler, though there seems to be no documentation) other than the trigger table, so any event merging operations have to be done later in software. Should we include something like that? Or leave it open as a later development?

@bryankim96 would be able to help concerning the implementation in dl1_data_handler

@kosack
Copy link
Contributor Author

kosack commented May 3, 2019

Thanks - yes I'd like it to work like dl1-data-handler - is the output documented anywhere @bryankim96 ? I guess I can look at the sample file from the meeting.

@kosack
Copy link
Contributor Author

kosack commented May 3, 2019

In the mean time, I've updated the example to version 4 (now including the MC thrown events histograms). See the top-level of this issue for the latest version

@kosack
Copy link
Contributor Author

kosack commented May 3, 2019

I did notice that pytables allows a column index to be built for a table. We might look into that feature to support fast event merging, rather than doing it by hand. (https://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_index)

Perhaps that's what the dl1-data-handler does already?

@bryankim96
Copy link
Member

bryankim96 commented May 3, 2019

Here's the wiki page describing the DL1 DH structure:

https://github.com/cta-observatory/dl1-data-handler/wiki/CTA-ML-Data-Format

I took a look at your version 4 file and it seems like the v4 and DL1 DH format are very similar to each other (one table to store images for each telescope type, tables to store telescope and subarray info, tables for both triggered events and mc events, the need for some sort of indexing/mapping method between the event table and image tables).

I'd say the only meaningful differences in DL1DH are related to the issues which were discussed above regarding making joins/lookup of the images associated with each event as quick as possible (using indexing). Instead of recording the obs_id and event_id in the image tables we store the actual row index of the corresponding event (in the event table). This way mapping from images to events can always be done in constant time without the need to query/search by event_id and obs_id. Similarly, rather than just a binary trigger array, in the event table we store the row indices (rows in the image tables) in the trigger arrays. This way the lookup of the images from an event is also O(1). This format was developed as a quick solution, so maybe there is a better way to handle this? You can find a more detailed description of the format in the wiki page linked above.

Regarding indexes on columns, DL1DH uses this functionality, but because we already set things up for O(1) lookup both ways (event -> image, image -> event), it was mainly to speed up queries or sorting on commonly used columns. We currently add indices on the mc_energy, alt, and az columns in the event table and on the event_index column in the image tables.

@kosack
Copy link
Contributor Author

kosack commented May 6, 2019

Ok, great. I'll take a look at that then. I'm working on a version 5, and already have the index tables auto-generated, and compression enabled. I tried doing some table merging with astropy.Table and found it to be quite fast (even though it doesn't use the index tables that pytables generates), so it may not be a large issue - certainly using pytables alone will be faster, though.

Note on compression: I tried a large set of options, and found in the end that BLOSC with ZSTD compression is the fastest and most efficient. However, it means that the files are no longer fully viewable in HDFView (which only seems to support bzip2, gzip and a few other compression schemes for some reason - you can view the structure, but not the table contents). VITables works fine though, and the files are much smaller than before, so I think it's a reasonable tradeoff.

@kosack
Copy link
Contributor Author

kosack commented May 9, 2019

Ok, now I've created a version 5, with full substructure (which I think is nicer to read and work with),

  • I generate PyTables indices for the event_id, tel_id, and obs_id columns (these are hidden tables with specialized mapping info generated by pytables, which get used internally for fast indexing). They show up only if you look at the tables with h5dump or HDFView

  • I still have not yet included "hand-made" index tables like in dl1-data-handler, but I'll try that next. It may not be necessary if we use pytables' indices, but that restricts us to using it as an access library (maybe not the best). I found that loading a few thousand events and using astropy.Table to merge by event_id was actually pretty fast, even without any indices, so maybe it's not so critical.

  • I also enabled BLOSC-ZSTD compression (so you can't open this in HDFView and see the data anymore, but it's much smaller - use vitables instead).

Here's the new structure, and I'll update the top-level issue as well

image

@bregeon
Copy link

bregeon commented May 10, 2019

That indeed looks really good, it's nice to see such progress!
Quick and dumb question: Looking into the notebook (In [15]), the /instrument/telescope/camera substructure does not seem to have a CHEC camera... from reading the code, I can't understand why ?

@kosack kosack mentioned this issue May 10, 2019
@vuillaut
Copy link
Member

@kosack I can't open the notebook

Unreadable Notebook: DL1_model.ipynb NotJSONError('Notebook does not appear to be JSON: \'{\\n "cells": [\\n {\\n "cell_type": "m...')

Could you send the latest version please?
Or the code to create the version 5?

@kosack
Copy link
Contributor Author

kosack commented May 24, 2019

strange- maybe I made a cut/paste error. I'll update it now

@kosack
Copy link
Contributor Author

kosack commented May 24, 2019

@vuillaut
Copy link
Member

vuillaut commented May 24, 2019 via email

@kosack
Copy link
Contributor Author

kosack commented May 24, 2019

@vuillaut You'll need the master version of ctapipe to run it

@moralejo
Copy link
Contributor

Ok, now I've created a version 5, with full substructure (which I think is nicer to read and work with),

Thanks, looks good indeed! I see no problem with the HDFView limitation, I agree the compression advantage is worth it (and vitables is not much harder to install even on a mac).

  • I generate PyTables indices for the event_id, tel_id, and obs_id columns (these are hidden tables with specialized mapping info generated by pytables, which get used internally for fast indexing). They show up only if you look at the tables with h5dump or HDFView
  • I still have not yet included "hand-made" index tables like in dl1-data-handler, but I'll try that next. It may not be necessary if we use pytables' indices, but that restricts us to using it as an access library (maybe not the best). I found that loading a few thousand events and using astropy.Table to merge by event_id was actually pretty fast, even without any indices, so maybe it's not so critical

Can you explain what is the advantage of hand-made index tables over the PyTables ones?
And, in any case, is the idea to store the indices with the file rather than generating them on the fly as needed? The data size increase is negligible, I assume. Related to data size, it would be nice (not urgent) to have an example file also for a "normal size" array like the north or south baselines, since this one is for a 566-telescopes super-array.

@vuillaut
Copy link
Member

Hi @kosack
With implementation v5, is the events table still readable with pandas?

@vuillaut
Copy link
Member

Hi @kosack
With implementation v5, is the events table still readable with pandas?

Replying to myself: with this structure, it's not possible as images are stored in the same table, which implies variable length variables.

question: do we want to keep the possibility to load tables with pandas (that proved useful in pipelines)?
If so, we need to store images and pulse_time in separated tables (that could be aligned with the parameters to ease filtering).

@kosack
Copy link
Contributor Author

kosack commented Oct 30, 2019

issue moved to #1163

@kosack kosack closed this as completed Oct 30, 2019
@kosack kosack mentioned this issue Apr 9, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants