Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram Scope #114

Open
iguinn opened this issue Nov 1, 2024 · 5 comments
Open

Histogram Scope #114

iguinn opened this issue Nov 1, 2024 · 5 comments
Labels
question Further information is requested types LGDO types

Comments

@iguinn
Copy link
Contributor

iguinn commented Nov 1, 2024

I recently added a .fill function to the lgdo histogram. This turned out to be controversial, under the argument that any non-trivial histogram operations should be done by the hist library. So we should figure out the proper scope of this class. @lvarriano @ManuelHu @gipert @jasondet

Option 1: It's purely an I/O class. If we want to do any manipulations at all on it we should look towards other implementations
Option 2: It's a data container class. This means it should have both I/O functions and functions for setting the contents (this was the logic I was following when adding fill). Under this logic the only other functions from boost-hist (https://boost-histogram.readthedocs.io/en/latest/user-guide/histogram.html) that I could imagine wanting, are reset, and maybe add. A counter argument to this is the risk of scope creep.
Option 3 (which nobody wants but I include for completeness): A full fledged hist class that does all the things in https://boost-histogram.readthedocs.io/en/latest/user-guide/histogram.html. This seems clearly beyond the scope of what we want, and using view_as to get these kinds of manipulations seems clearly preferable.

@iguinn iguinn added question Further information is requested types LGDO types labels Nov 1, 2024
@lvarriano
Copy link
Contributor

Personally, I would use it purely to write and read the bins and data/weights from and to boost-hist, whatever is needed to recreate the boost-hist object without having to re-fill it with data. i.e. if you use LGDO to read a histogram from an .lh5 file, it should return a boost-hist, and if you use LGDO to write a boost-hist, it writes it to file in some predetermined way. That way, the histogram can be saved and communicated easily between different people and scripts and does not require adding any supporting methods to LGDO. It is a small price to convenience if the user cannot directly apply some methods to the LGDO object, but I would think it makes implementation much easier and cleaner. So this falls to your option 1.

@gipert
Copy link
Member

gipert commented Nov 1, 2024

As Louis said, we should try to keep LGDOs as pure I/O classes as possible. Third-party formats are much better than us at data manipulation and we want to keep our code as lightweight as possible. This is the overall philosophy behind this package.

As a side note: I personally think that the approach followed by the LH5IO.jl package (i.e. directly read data into Julia native data structure) is better than ours, but it would be harder to implement in Python, unfortunately.

@ManuelHu
Copy link
Contributor

ManuelHu commented Nov 1, 2024

I see one problem with option 1, especially with large histograms. The conversion into a hist.Hist or boost-histogram is always copying the whole data buffer. So if you want to update a histogram in an LGDO file, this way would always use twice the actually required memory.

For small histograms this would be negligible and not worth the code overhead here.
For optical maps this could easily be some GBs of extra memory allocation.
(I am not saying that I am at the moment thinking about using this fill function for reboost(-optical), but it might come in handy for similar cases).

@iguinn
Copy link
Contributor Author

iguinn commented Nov 1, 2024

I think part of the problem with trying to make them pure I/O, is that sometimes I/O is best done with somewhat specialized containers (see #109). When dealing with large amounts of data, iterating becomes important, and so having a way to fill in steps becomes useful (or in the case of other objects append). In this case, the hist library is quite good at filling in this way, so in this case we can probably get away more with pure I/O. However, in other cases, filling directly into the LGDO object is quite normal; also, the numpy arrays required a lot of re-allocating/copying to do any change in size, so I'm not sure we want to make pure I/O our goal?

The documentation makes it sound like hist.Hist.view() is no-copy (although I haven't tested it or anything...).

I'll also add that if the best way to fill histograms is through boost histogram, we should include filling in the tutorial.

@ManuelHu
Copy link
Contributor

ManuelHu commented Nov 2, 2024

The documentation makes it sound like hist.Hist.view() is no-copy (although I haven't tested it or anything...).

That's right. But there is no way to create a hist.Hist from a numpy buffer, without copying data (the data= constructor argument also copies in all cases...).
This is a different situation than for - just an example - pandas tables that can be initialized and manipulated from a loaded LGDO without copying all data again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested types LGDO types
Projects
None yet
Development

No branches or pull requests

4 participants