Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to remove ext files from dev branch #852

Open
ajedele opened this issue Apr 26, 2023 · 19 comments
Open

Proposal to remove ext files from dev branch #852

ajedele opened this issue Apr 26, 2023 · 19 comments

Comments

@ajedele
Copy link
Contributor

ajedele commented Apr 26, 2023

The external files are generated for each experiment in the unpacker. The external files are used by the reader to map the channels correctly.

The external files are experiment specific and should be generated in the experiment-specific unpacker and should be copied over for each experiment. Otherwise, this leads to potential incorrect mapping or channel numbers/ADC ranges/trigger information. Our naming and mapping changes often enough that this will be an issue.

Since this is in the 1st step of the analysis, most new users will be unaware of the potentially incorrect external files.

@klenze
Copy link
Contributor

klenze commented May 5, 2023

While the unpackers are experiment specific, I do not think that the resulting files describing the mapped data are necessarily so.

Two typical cases of changes might appear:

  • One experiment uses more PSPs than another one, so that the number of channels in the ext_h101 file changes between experiments. This is not a problem because one can simply use the file created from the experiment with more channels and the smaller experiment will simply have some array size overhead.
  • One experiment unpacker records additional features (as with the CALIFA per-channel WRTS). In principle, the ext_h101 format is backwards compatible there, but one might have to decide which fields are optional nice-to-have and which fields are essential (e.g. the califa wrts).

It would be great if every ext h101 file committed to r3broot would include the unpacker (name, git hash), date, path and exact invocation used to create that file.

If we were to auto-generate the h101 headers during compilation, this would imply that each compile can only work with a single experiment. I would favor the current approach where one can analyze different experiments with the same compile. (I would however favor CI tests with experiment unpackers so we can ensure that nobody breaks the compatibility of the h101 files in git with any unpacker.)

On a side note, I think the construction of the overall h101 for the ucesb reader in the C macros, complete with offsetof's and everything is something which should go. The overall struct consisting of N detector specific structs does not seem to be required anywhere. Instead, each reader class could tell the R3BUcesbReader (or source or whatever) the size of its h101 struct at runtime, and the R3BUcesbSomething could use the arithmetic technique of addition to figure out how large the buffer should be in total, and provide each reader with a pointer to their region in the buffer. For bonus points, this should be accomplished by a variadic method instead of passing void pointers.

Of course, it could also be argued that there is no reason why the readers one wants to instantiate should be specified in the macro at all, as they are implied by the mapping. If the unpacker of an experiment provides data of some detector Foo, just instantiate R3BFooReader. (In that case, there should be some way to override this. When analyzing calibration runs of Califa with the s522 unpacker, one probably would not want to instantiate the readers for all of the other detectors. Of course, one fix would be to use the califa_only unpacker for that (if it were not for the misguided decision to encode the presumed gain settings of the califa preamplifiers in the channel mapping).)

Another improvement would be to allow multiple instances of one reader which read data of the same type, but with different name prefixes.

For example, if both the PSPs and CALIFA use FEBEX with the TUM firmware, we can already have have the same code handling the unpacking. If we break the 1:1 mapping between ext h101 structs (data types) and the name string prefix (instances), we could also use two instnces of the same R3BTumFebexReader(std::string srcPrefix) for reading both of into two TClonesArrays of identical types but different names.

Perhaps @inkdot7 or @bl0x have some comments.

@inkdot7
Copy link
Contributor

inkdot7 commented May 5, 2023

This seems related to the cleanup that is needed to address the issues of incomplete data mappings of #712 .
(I think my comment dated Dec 15 could be relevant.)

Keeping generated files away from version control systems is a good idea. (Edit: to avoid misunderstandings: This means to either generate them on-the-fly, or avoid them at all. Keeping files outside version control is not an improvement.)

One thing that should for sure be changed is to stop using the ..._LAYOUT_INIT - they are not nice.

git grep LAYOUT_INIT

should then show nothing. That would be a step on the way.

Hmm, perhaps they are not used, just copy-pasted? Then much easier to get rid of them. Then the comment of Dec 15 could be re-read.

@inkdot7
Copy link
Contributor

inkdot7 commented May 5, 2023

I should perhaps clarify a bit:

The structures declared/used in r3broot I imagine to be hand-crafted/cleaned versions of what may possibly have come from ucesb as a suggestion for what structure members are available for a detector (i.e. a generated header). Likewise, the EXT_STR_xxx_ITEMS_INFO structure for that detector (or even part of detector) is a (possibly sanitized) version, and then gradually modified if new members are added. That information is given to the ext_data interface on the reading end (r3broot), by the ext_data_structure_info means in the call to ext_data_setup().

From the writing end, the ext_data interface gets a list of everything which is available from a certain unpacker. It does not need any generated header file for that. It always comes with the data!

There is thus no need to generate any headers again and again, e.g. during compilation. There is also no need to dump them for full experiments, since the bits and pieces are known by the substructures in r3broot.

@inkdot7
Copy link
Contributor

inkdot7 commented May 6, 2023

UCESB has been fixed to not include the deprecated LAYOUT_INIT items in generated headers (unless particularly requested with LAYOUT next to STRUCT_HH). Also found+fixed a small bug when hunting around, so thanks for bringing this up.

@inkdot7
Copy link
Contributor

inkdot7 commented May 6, 2023

The LAYOUT_INIT stuff is gone! #859

@ajedele
Copy link
Contributor Author

ajedele commented May 7, 2023

Hi all,

As Haakan mentioned, 'Keeping generated files away from version control systems is a good idea.' That is my suggestion.

Most of our Readers are written to be versatile in regards to detector and channel numbers. They extract this information from the mapping information in the ext_h101_{det name}.h file. The issue is that for 'frozen' detectors, the mapping in the unpacker is consistent. You can copy-paste the ext files from a previous experiment. However, the number of truly 'frozen' detectors is very small. In addition, the mapping naming convention and channel mapping is commonly not 1-to-1 from experiment to experiment. For example, the mapping and naming conventions of the tofd has slightly deviated for ever experiment.

Most students who do the calibrations do not know the ext file are generated files. They just use the tofd_h101_tofd.hh files in the Git repository. This is also not clearly specified anywhere in the code base - not basic-user friendly. If (for example) you use the ext_h101_tofd.hh file from the previous experiment, the mapping and naming convention is not the same and the code should fail. This does not always happen (especially if it's a mapping problem).

I do NOT want the ext files to be generated during the compile. That's overkill. Rather, the ext files should just be removed from version control and copied over for each experiment. Alternatively, as Philipp mentioned, we could just add the experiment to the end of the ext file name. I think that might be the easiest - especially since the current naming convention is all over the place and the ext file repository needs cleaning up.

@klenze
Copy link
Contributor

klenze commented May 7, 2023

From my understanding, the typical ext_h101 file defines a struct with some member variables and a macro to map these to the corresponding data fields in the stream by name. (Onion data structures are just byte-compatible to the flat structures and are thus initialized with the same macro, it seems.)

As the mapping between struct member names and strings in the macro is 1:1, I would assume that a mapping file which has different strings for mapping fields from the stream will also have different names in the struct (i.e. will not compile with the reader code). I had a look at the ext_h101_tofd.h for S522 and S454 and the onion struct does not look very different (apart from array sizes differing, which seems fine (if you have the largest size in git)).

The actual mapping being different is a separate problem. I do not have any good solution for mapping, generally. Doing the mapping in R3BRoot will involve tons of illegible parameter containers. Mapping in the unpacker (which is what we do) avoids this, but it turns out that there are mapping adjacent problems (like "what is the febex path of that califa channel?" or "what is the trigger timing I have to use with that TofD channel?" which do not lend themselves to be encoded with a permutation. \footnote{Not impossible, though: califa encodes the gain setting as part of a channel id. Still, I very much disrecommend using the most significant byte of the TofD channel index to store the index of the trigger timing channel to use with that.} In practice, this means that the mapping is all over the place: the main mapping happens in the unpacker, and either the complete mapping or ancillary parts of it are also accessed from R3BRoot.

@inkdot7
Copy link
Contributor

inkdot7 commented May 7, 2023

ext_data_fetch_event() does remapping of item locations from the source to destination structure based on names, is given by *struct_info in the call to ext_data_setup().

@klenze
Copy link
Contributor

klenze commented May 7, 2023 via email

@inkdot7
Copy link
Contributor

inkdot7 commented May 7, 2023

@klenze yes. And the name 'smuggling' too! I.e. the string name is what is matched, the 'bare' name is what is used to find the actual offset in the local (=destination) structure.

@YanzhaoW
Copy link
Contributor

Could it be a better solution that we have a separate git repository for all ext header files and include this separate repository as a git submodule under this repository?

We could tag the version of the ext header repository according to experiment name before each beam time. If someone needs to analysis old data, it's also very flexible to change the version of the ext header repository back to a specific experiment version while keeping using the latest version of R3BRoot.

@lr11345
Copy link

lr11345 commented Nov 14, 2023

A candiate place to store them would be in the experimental upexps dir for each experiement as they are generated from here e.g
->
/u/land/fake_cvmfs/10/cvmfs_fairroot_v18.8.0_fs_nov22p1_extra/upexps/202402_s091_s118/ext/ext_...

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 14, 2023

Could it be a better solution that we have a separate git repository for all ext header files and include this separate repository as a git submodule under this repository?

We could tag the version of the ext header repository according to experiment name before each beam time. If someone needs to analysis old data, it's also very flexible to change the version of the ext header repository back to a specific experiment version while keeping using the latest version of R3BRoot.

Dear @YanzhaoW @lr11345, what you are proposing is duplication of code. That is opposite to what I suggested and something which I will not deal with. In fact, the duplications in upexps is something which I for a very long time have pointed out as not being good, and asked to be cleaned up.

@YanzhaoW
Copy link
Contributor

Dear @inkdot7

I agree. But I'm not sure it's possible to always keep this backward compatibility for all detectors. For NeuLAND, sure, an ext header file with trigger for 13 dps would also work for 12 dps without trigger (I will test this) iff we never change the var names or var types. But I don't know whether this could also be applied to all other detectors. If yes, we should just use the latest version of ext header for old experiments as well. If no, then maybe a git submodule may be an option.

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 14, 2023

I agree. But I'm not sure it's possible to always keep this backward compatibility for all detectors. For NeuLAND, sure, an ext header file with trigger for 13 dps would also work for 12 dps without trigger (I will test this) iff we never change the var names or var types.

If there is a need (or simply better names come up) then change all experiment analysis / unpackers to use those names. That will of course be much easier with less copy-paste in upexps.

But I don't know whether this could also be applied to all other detectors.

Most of them at least. Begin there, and hopefully the joy spreads (i.e. amount of code in git shrinks.)

@klenze
Copy link
Contributor

klenze commented Nov 15, 2023

Could it be a better solution that we have a separate git repository for all ext header files and include this separate repository as a git submodule under this repository?

I absolutely would not recommend subjecting our users to git submodules.

@inkdot7: A practical problem @ajedele has encountered is that for some experiments, ROLU ran in a separate TDC (which has its own trigger timing channels which can and must be mapped) and for other experiments, ROLU ran in the same TDC module as LOS, so the trigger signals would have to be mapped to both LOS and ROLU, which I vaguely recall not working?

Also useful would be the ability to declare a signal which gets mapped to nothing (or a zero element signal array) to make a consumer happy. Of course I fully expect you to tell us to actually take ownership of our h101 in the consumer and just use EXT_DATA_ITEM_FLAGS_OPTIONAL. :-)

Looking at the PSP h101 file, It seems to me that the number of preprocessor lines generated scales with the number of detectors.

A decent language would prove a way to generate all these by just feeding the onion version of the psp struct into an appropriate macro, but the metaprogramming facilities of C++ (templates, preprocessor) are not suitable for that.

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 15, 2023

@inkdot7: A practical problem @ajedele has encountered is that for some experiments, ROLU ran in a separate TDC (which has its own trigger timing channels which can and must be mapped) and for other experiments, ROLU ran in the same TDC module as LOS, so the trigger signals would have to be mapped to both LOS and ROLU, which I vaguely recall not working?

Also useful would be the ability to declare a signal which gets mapped to nothing (or a zero element signal array) to make a consumer happy. Of course I fully expect you to tell us to actually take ownership of our h101 in the consumer and just use EXT_DATA_ITEM_FLAGS_OPTIONAL. :-)

Yes. :-)

To not duplicate code, it would be useful to track which detectors are measured on a global clock, and which are measured 'just' relative to a master start signal.

(Thanks for pointing this out, as it explained what is meant by trigger time channels.)

Looking at the PSP h101 file, It seems to me that the number of preprocessor lines generated scales with the number of detectors.

Sounds very plausible. Is unfortunate.

A decent language would prove a way to generate all these by just feeding the onion version of the psp struct into an appropriate macro, but the metaprogramming facilities of C++ (templates, preprocessor) are not suitable for that.

Yes, the proposal is to do that semi-manually, i.e. to extract the core out of the ext_files and then do some loop around that to set up multiple detectors.

@YanzhaoW
Copy link
Contributor

I absolutely would not recommend subjecting our users to git submodules.

Why?

A decent language would prove a way to generate all these by just feeding the onion version of the psp struct into an appropriate macro, but the metaprogramming facilities of C++ (templates, preprocessor) are not suitable for that.

Do you mean compile-time reflection?

@klenze
Copy link
Contributor

klenze commented Nov 17, 2023

Why?
Submodules share some properties of the much-detested macros repository, though the details differ.
The increased flexibility comes at a price of increased complexity. Anyone doing a PR which changes both an h101 and a reader in a forward and backward incompatible way would have to do two PRs. CI would have to use the h101 commit before it is really merged. Users would have to go through additional steps to get a compile-able checkout.

I think the main use case for git submodules might be instances of Conway's law. If you have a handful of different teams working on with a frequently changing lightweight interface (such as a protocol or file format), then it might make sense to put that interface into a git submodule in all of their projects. (I dunno, I have never designed such a system.)

Maintaining one h101 branch per experiment would not be a good use of git submodules (where a commit in the parent repository always contains the has of a single commit in the child repository).

Additionally, most of the arguments against the original proposal to auto-create the h101 files also apply:

  • The h101 format can handle the case of "there is a field missing" much more gracefully than C structs. We would have to have #if CALIFA_h101_has_TOT in the reader or something.
  • For each experiment, we would need to figure out in CMake which readers we actually can compile dependent on the detectors which were used.
  • Not being able to use the same build (or indeed working directory, with submodules) to analyze multiple experiments would be a hassle.

Anecdotally, the CALIFA DAQ originally used submodules within submodules. That was already a hassle even though I had push permissions, and I flattened everything into a single repository. (We have an independent repo for per machine configurations, but that is in a .gitignored subdirectory.)

Do you mean compile-time reflection?
Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants