-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on separation of Event, Monitoring, and Service/header data #1041
Comments
One option I have is that the EventSource is no longer responsible for creating the DataContainer. Instead it is part of the chain responsible for filling the DataContainer, as are other Components. This makes this option somewhat a combination of option 1 and option 4.... but its probably actually close to something you intended with option 4. data = DataContainer() # Create the container
event_source = EventSource().from_url(x)
mon_source = MonitoringSource().from_url(y)
svc_source = ServiceSource().from_url(z)
for _ in event_source:
event_source.load_event(data) # Fill the DataContainer with the event
mon_source.load_event(data) # Fill the DataContainer with the monitor information corresponding to the event
svc_source.load_event(data) In this implementation, the MonitoringSource and ServiceSource requires the event information to be loaded into the data container before the monitor information corresponding can be loaded (+ interpolated) into the DataContainer. Perhaps moving the DataContainer creator outside of the EventSource then does not matter. I guess the idea here is that the highest rate the DataContainer is updated is defined by the event rate, therefore the iterator should be over the event source, and the other sources act accordingly to the current event. The advantage of this over option 1 is ONLY that the class responsible for filling the event information is separate to the one responsible for filling the monitor information. Though really the same could be achieved with a higher level class that contains all three of the data-filling classes. |
I guess in that example, it might be nicer to have the "load_event" happen automatically in the event_source iterator, but the others be explicit (since otherwise what is the point of the iteration) |
I am not a fan of option 2 - accessing monitor information from a component becomes difficult, unless its explicitly passed. And has the potential for unclarity. |
Though I don't like that the user has to do so much just to get the data, it should be made as fool-proof and automatic as possible. However, I do like the idea that filling monitoring info could be just a Component (which could interpolate or otherwise) |
agreed - it's also not so easy to allow for changing runs |
Then it sounds like the way to go with this option should be: class DataFiller: # Name TBD
def __init__(self, event_url, mon_url):
self.event_source = EventSource(event_url) # Class responsible for only filling the event container
self.mon_source = MonitorSource.from_url(mon_url) # Class responsible for only filling the monitor container
# ....other sources
def __iter__():
data = DataContainer()
self.event_source.fill_container(data)
self.mon_source.fill_container(data) # Interpolates according to the current event in the container
data_filler = DataFiller(event_url, mon_url)
for data in data_filler:
print(data.event.mc.energy)
print(data.mon.tel[i].pointing.az) # this is just the "latest?"
print(data.event.mon.tel[i].pointing.az) # interpolated
print(data.service.subarray.tel_coords) Alternatively the DataFiller could be defined as: class DataFiller: # Name TBD
def __init__(self, event_url, mon_url):
self.event_source = EventSource(event_url) # Class responsible for only filling the event container
self.mon_source = MonitorSource.from_url(mon_url) # Class responsible for only filling the monitor container
# ....other sources
def __iter__():
for data in self.event_source:
self.mon_source.fill_container(data) # Interpolates according to the current event in the container
yield data As it doesn't seem to make sense to skip the event_source step anyway. Unless we had simulators we may wish to use.... |
Hi, So, given the present situation (where archive tools are still rather absent), I would prefer to leave a bigger role to the event source:
So probably this is more option 1? |
One thing we can start is by just agreeing what the monitoring data is: is it per-event (e.g. some slower monitoring stream that has been interpolated to be the right value for an event) I kind of prefer the event-wise monitoring, since we can then read the "slow" monitoring stream in any way we like, and just update the event version for each event using a Component. In that case, we only have I could imagine the loop looking like: # reads a file and just constructs an interp function for each telescope
pointing = PointingInterpolator.from_file("tracking.hdf5")
# Pedestals are computed on-the-fly, not from a file, so each time it is called, it might
# measure some data from the event, update the pedestals, and output the latest per-event version
pedestals = PedestalMeasurer()
for service, event in event_source:
event.mon.pointing = pointing(event.time)
event.mon.pedestals = pedestals(event.time, event.dl1) |
For things like the pedestals, we also have to think of adding an event buffer of some kind, so it can essentially be n events ahead of the reconstruction (e.g. so that it can already gather enough statistics to compute the pedestals for the first few events). Or we always do multi-pass processing, where we first compute pedestals and gains, and then in a second pass read them in (as with the pointing example above), and loop a second time to do apply the calibration and do the reconstruction. I think the multi-pass system is what France has implemented already, but it wouldn't be too hard to do both (but supporting on-the-fly computation of calibration of course adds complexity) |
The on-the-fly approach means we do not need to try and match the calibration to the events, and we instead rely on the latest calculated values. It also avoids having to have two loops through the data stream (assuming the calibration events are interleaved among the cherenkov events). I believe the workflow Franca has been working on uses the on-the-fly approach... she utilises the concept of interleaved calibration (pedestal & flat-field) events, and updates the calibration coefficients as she progresses through the observation. She uses a flag inside the event container to identify the type of event. In Franca's design, there is also the possibility of creating a different calibration-calculator subclass that instead reads the correct calibration coefficients from a file (perhaps created with a separate pipeline, i.e. the two-pass approach). In the case of a on-the-fly calibration, it might make sense that at the start of each observation, a bunch of calibration events are taken, to build up the statistics of the calibration. This means the same, single pipeline can be relied upon. Alternatively, the class loads the latest coefficients to begin with. The great thing about Franca's design is we are not restricted in how the calibration is performed - as long as it obeys the principle of filling the mon container for observation events, so that they can be applied. For things like the pedestal and flat-field (and other calibration parameters), I don't think we should force it to be written to the event/data container at the start of the pipeline. This would complicate how we handle the calibration. These coefficients depend on factors such as the charge extraction method that is used. Therefore, it would be better that the coefficients are calculated via following the normal pipeline, up to charge extraction, and then uses the result of the charge extraction to populate the calibration coefficient calculation classes. Ensuring compatibility. The following is how I imagine a on-the-fly calibration would work:
Similarly for a two pass calibration, the calibration events would follow the same pipeline until charge extraction. Then the values are extracted and stored. This approach ensures that the correct calibration coefficients are used for the specified pipeline configuration. I think it also makes the pipeline easy to follow, and reduces the chance for mistakes. Regarding the decision between event.mon or data.mon, I am more inclined to the second. It may be that the calibration coefficients are updated at the same rate as event...but only if the telescope chooses to do such a thing as interpolate or utilises part of the waveform without signal. There is a possibility that the telescope chooses to only update the calibration every second. As they have deemed this as enough and see nothing gained with the complication of interpolating between calibration coefficients. There is nothing wrong with the data.mon being updated at the same rate as data.event, but it is not something we should force - we should stick to the original reason for the separation of event and mon, which is that there is the possibility of them being updated at different rates. |
Thanks for this precise description @watsonjj, indeed this is what we are planing for LST online calibration. In this discussion I think we should remember that calibration will be an iterative process: the on-fly calibration is meant for feeding the RTA pipeline and producing the first DL0/DL1 data to be written on disk. Most probably, there will be afterwards a more fine calibration based on a multi-pass process and interpolation for a optimised quality. Therefore, we should think to a data model that is suitable for both approaches. My major concern at this point is that for the prototype data we should be able (now!) to easily connect the calibration data to the event data without having the possibility to inquire an archive system that is not available yet. Therefore, whatever we chose we should think that the connection will be done (at least for a while) at the ctapipe code level. For that, as a first guess I thought to give to the event_souce the role of "connector" and then option 1 seemed the most natural to be used. But the DataFiller proposed by Jason above (second option?) will also play the role. |
Hi,
The DataFiller could be a plugin Component as the EventSource class now, so that for raw (or specific) data analysis each telescope can define its own DataFiller and eventually DataContainer (inherited by the general DataContainer), e.g: class LSTDataContainer(DataContainer):
"""
Data container including LST information
"""
lst = Field(LSTContainer(), "LST specific Information") This would permit an immediate use of this approach with the prototype data. |
A couple of comments:
|
@moralejo, I think you mis-understood me. I will reply to the question explaining which is my problem right now: The point here is: for LST (and NectarCAM, which will take data in June) is urgent to converge rapidly with this discussion and start to implement the solution that we think the best. This will permit also to push in clean environment the monitoring containers in ctapipe and allow their use, test and improvement by other telescopes. |
Ok, thanks for the clarification. If we follow the same approach as for the "event" data, then the creation and (partial) filling of the standard mon containers for the case of the MC should be done by pyeventio, and not by a telescope-type-specific plugin. Or do you prefer to separate the creation of the mon containers from the io machinery? Probably for the MC that would be a complication, since the mon info (including calibration mon info, in the case you want to use the one provided by simtelarray instead of calculating it) is read from the simtel files.
For the real data you say that the mon containers are initialized by the LST io plugin, and that should still be the case. What is missing, I guess, is to add the code to fill them (with whatever is not to be calculated during the analysis) - once we will know from where we will retrieve e.g. the LST-1 pointing information etc. |
To clarify again: what I want to is to test the LST calibration procedure with MC calibration data (flat-flied and pedestals), therefore I need to access and use the mon containers with MC calibration data exactly as I do with real data (even if obviously in the case of MC data the events will be read by pyeventio).
For the moment the mon containers ARE in the LST plugin, so we don't have other choses. When they will be moved in ctapipe, we can indeed keep the initialisation in each telescope plugin by mean of a DataFiller plugin or similar. @kosack what do you think? Could we perhaps plan a discussion on this point the next ctapipe meeting (after the next ASWG)? |
This is most probably a temporary solution, while waiting for the outcome of the issue cta-observatory#1041
* Added Monitoring containers * Move pedestals.py in camera direcotry Delete empty file * Added Component PedestalCalculator to calculate pedestals from R1.waveform * Add test for the pedestas.py module * Add EventAndMonDataContainer to handle monitoring events. This is most probably a temporary solution, while waiting for the outcome of the issue #1041 * Add the calc_pedestals_from_traces function to the __all__ list * Add flat-field calculator components and their test * Change phe to pe * test_pedestals.py is now included in calib/camera/tests * Improved variable names related to the pixel stus masks * Improved docstring comments * Delete import not used * Improved treatement of failing pixels * Correct bug in test_flatfield.py Change initialization of FlasherFlatFieldCalculator.time_cut_outliers * Correct several code style errors as suggested by travis * Add some other style corrections
if we want to follow the current data model I presented in Bologna, we should really separate four things:
As discussed a bit in #1041, at least some of this should be loaded automatically by the EventSource (at least event + the header info, perhaps not the monitoring or service by default since those can be loaded later on demand). So if you want to implement all that here, let's decide on what it should look like. i.e.: source = EventSource(input_url=x) already here, the input can be a single file (events), or in the case of most prototypes and probably all real data, a set of files: either we assume there is one "index" file that contains links to others, like in gammapy, or we accept a directory here, or assume there is always a single file and we use the features of e.g. HDF5 to generate a file that contains linked datasets. How does LSTChain or NectarChain handle this? Then there is the use of source: source.instrument.peek() # would be nice to be able to do this before the event loop
print(source.simulation.mcheader)
print(source.simulation.thrown_shower_distribution) # in SimTel at end of file only, so not easy to do. In the HDF5 DL1, this could work though
print(source.observation) # implement only when we define what is in a Observation That means in SimTelEventSource having perhaps to do some trickery to make sure it's loaded, but shouldn't be hard. Defining classes that need subarray info, either at construction time, or call time?
The event loop itself: for event in source:
event = calibrate(event)
event = parameterize(event)
# or
event = parameterize(event, subarray=source.instrument) Which also brings up: is there anything else in instrument than the subarray? |
In fact-tools, we have a Service approach to this. These are basically classes that provide information given some property. E.g. GainService gets a timestamp and returns gains for each pixel. This would roughly look like this:
And inside
|
The Problem:
The concept of an
EventSource
has been quite successful and easy to use in ctapipe, since it allows one to write clear loops over events, where the event data is part of an iterable object. However, we now need to think about other data streams that are not tied directly to each Cherenkov event.When processing data, we don't only need the data for an event, but several other pieces of information
SubarrayDescription
)These info come at different rates than the events (much more slowly). However, some form of synchronization to the event data is needed.
Special design considerations
for Monitoring data
The monitoring (MON) data comes rates around 1HZ, but it is not necessarily correct to simply synchronize the "latest" MON data with the current event: normally one would like to do some interpolation, so how the monitoring data is accessed may need some thought. This means there may be two types of MON data:
These do not need to be stored in the same way, e.g. one might first read the raw MON data, construct an interpolator, and then use it to fill
event.mon
info (event-wise interpolated MON data)Note also that there are also two further types of MON data:
For Service Data:
In general these are all static information for a given observation, but in the case of Monte-Carlo files, it is possible for these to change in the middle of a file (e.g. if multiple runs are merged). The question is whether or not we support these changing for a single
EventSource
loop, or assume they are header info that is read once at the start.Parallelization and distributed processing
Right now parallelizing the event loop (e.g.
some_parallel_system.map(processing_function, source)
) means sending around all of the event, mon, and service data every time an event changes (even though the mon and service info doesn't change). This is probably very inefficient, since that is a lot of data to transmit. Separating the streams makes this much cleaner.Implementation Options
Most of these options involve generalizing
EventSources
, e.g. to have theEventSource
also synchronize MON and SVC data.This could be done in a few ways:
Option 1
a higher-level
DataContainer
that containsdata.event
,data.mon
anddata.service
splits (e.g. see #1040). One issue with this is that semantically we are no longer looping over "Events", but we could stipulate that the event part is the one to which all others are synched, so it still make some sense, even if it's clunky.Option 2
we keep only the
event
part in the iterator loop of theEventSource
, and have the svc data accessed from the source itselfIn this case, it's not clear to the user that the monitoring and service data are synced. E.g. the the user needs to re-read the service data if it changes.
Option 3
we always return all three data streams separately:
Here, it's clear that you can pass each one to further algorithms, and you might get a pointer to the same info , or new info, and it shouldn't matter.
Option 4
we handle MON and SVC data separately (probably a bit of an ugly solution, and not clear if this would work for MON data generated during the event loop)
Here, it's not quite clear what the event loop(s) should look like. I guess only EventSource would provide an iterator, and the others would just be interpolators or static info.
Any other ideas?
The text was updated successfully, but these errors were encountered: