Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: histogram weights not handled correctly in hist / boost conversion #774

Merged
merged 31 commits into from
Nov 2, 2022

Conversation

lobis
Copy link
Collaborator

@lobis lobis commented Oct 31, 2022

Currently the fSumw2 (weights) are not being correctly read/written when performing conversion to/from hist/boost histograms.

This is causing issues in #672 and solving this will also close #722.

  • We now use hist storage_type property to check wether it's a weighted histogram or not, instead of the variances which is always defined and was cuasing false positives.
  • Add two new tests, to check the conversion with a histogram with weights and another one without weights.
  • Updated slicing of values / variances for categorical axes.
  • Refactored weighted property and to_boost method to use inheritance.
  • Write test for 2D, 3D histograms with weights / labels.

@lobis lobis changed the title Histogram weights not handled correctly in hist / boost conversion fix: histogram weights not handled correctly in hist / boost conversion Oct 31, 2022
@lobis
Copy link
Collaborator Author

lobis commented Nov 1, 2022

I refactored weighted property and to_boost method using inheritance with the goal of reducing code replication. However perhaps there is a reason this was not done in the first place, and I can see lines such as no_inherit = (uproot.behaviors.TH1.TH1,) hint me to think this way.

Let me know if I should remove it @jpivarski or if maybe I should have made a separate PR for this. Perhaps I should've implemented it in the TH1 class instead of Histogram class.

@lobis lobis marked this pull request as ready for review November 1, 2022 18:18
@lobis lobis added the next-release Required for the next release label Nov 1, 2022
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! All of my requested changes are inline.

src/uproot/behaviors/TH1.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH1.py Outdated Show resolved Hide resolved
tests/test_0167-use-the-common-histogram-interface.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH1.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH1.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH1.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH2.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH2.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH3.py Outdated Show resolved Hide resolved
src/uproot/behaviors/TH3.py Outdated Show resolved Hide resolved
@lobis
Copy link
Collaborator Author

lobis commented Nov 2, 2022

Looking at the code of TH{1,2,3}.py I see it can be refactored in a similar way to what is done in to_boost in this PR (implement to_numpy, etc in Histogram). This would leave only axes property and perhaps the axis method on TH{1,2,3}.py.

I don't think the complexity of the code increases enough to discourage this and it would reduce code duplication (triplication!), but I would love to hear some thoughts @jpivarski @agoose77.

@jpivarski
Copy link
Member

Generally, I didn't worry about code duplication up to three, particularly if it was right next to each other and formed a clear pattern. (The problem with duplicated code is that a future maintainer might fix one copy without applying the same update to the other copies, and then they get out of date until another future maintainer wonders if they were supposed to be the same or not.) In the case of histograms, I knew that the code duplication/triplication would never go beyond three, since ROOT only defines 1, 2, and 3-D histograms.

I'll let you decide whether you want to squash more duplication. As it is, I would accept this PR now. So if you consider yourself done, go ahead and do the honors by pressing the "squash and merge" button. (Make sure it's set to "squash and merge" mode; we don't want all of the commits in this PR to be individual items in the global history.)

image

@lobis lobis merged commit ec80b88 into scikit-hep:main Nov 2, 2022
@lobis lobis deleted the fix-hist-weights branch November 2, 2022 14:26
jpivarski added a commit that referenced this pull request Nov 2, 2022
* fix: initialize empty `TObject` members on `to_TObjString`

* add test for serialization of `TObjString`

* remove unused dependency on test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add `tojson` method to `TObjString`

* add additional check to `TObjString` write test

* fix bad field in `TList` tojson conversion

* add inexpensive `assert` to `TList` serialization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add `to_THashList` method

* add aux method to create `THashList` from categorical axis

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update src/uproot/writing/identify.py

Co-authored-by: Angus Hollands <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bad serialization of non-empty TList due to options (#763 (comment))

* add tests for TList serialization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed bad `__repr__` for `TObject`

* add serialization of `fUniqueID` to `TObject`

* add `empty` method to `TObject`

* remove redundant `TObject` member initialization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add serialization for `THashList`

* made `THashList` writable

* add test to check serialization of `TList` vs `THashList` (which should be mostly equal)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* initialize boost histogram axis as `IntCategory` if labels can be converted to integers

* Update src/uproot/writing/identify.py

Co-authored-by: Jim Pivarski <[email protected]>

* moved `TList` serialization list to `serialize` method

* add helper serialization method `bytestring` as suggested in #763 (comment) by @agoose77

* keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add bin index as `TObject.fUniqueID` when setting hist labels as done by root

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reset `serialization.py` to `main` branch status

* Revert "keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper"

This reverts commit 8e6ad2e.

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit 897972f.

# Conflicts:
#	src/uproot/serialization.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "reset `serialization.py` to `main` branch status"

This reverts commit 9a500c1.

* Revert "Revert "keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper""

This reverts commit 884ec4d.

* Apply suggestions from code review

Co-authored-by: Jim Pivarski <[email protected]>

* initialize `opt` with different list initialization for consistent style

* add back two spaces before imports (this is not handled by the pre-commit formatter!)

* Update src/uproot/models/TObject.py

Co-authored-by: Angus Hollands <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bad label placement

* axis `fName` will always be one of the default values (`xaxis`, `yaxis`, `zaxis`)

* add test for categorical histogram

* hist categorical test: check for correctly overwritten axis name

* add missing assert to test

* updated tests for new axis naming behaviour

* remove unused dependency from test

* do not used mutable default arguments

* fix `to_hist` not using options from function arguments (was using default arguments all the time instead)

* fix not checking for all categories

* add test for TH1 -> hist/boost conversion

* TH1 to boost/hist do not use weights (temporary solution)

* build action: move checkout right before install

* add debug step to pipeline

* fix test not being updated to new functionality

* reverted build CI to original state

* updated test to check for working `to_hist` conversion

* simplified test

* add check for boost histogram conversion

* check also for intcategory to resize values

* add test for TH3 histogram

* remove redundant test

* add link to issue on test

* add temporary fix to bad `to_boost` conversion for categorical histograms

* add tmp fix

* reduced repeated code for slicing values when categorical axis via helper function

* implemented `to_boost` in `Histogram` parent class to reduce code duplication

* remove now unecessary helper function `_slice_values_if_categorical_axis`

* remove "temporary" `and False`

* remove unused dependency

* add missing asserts

* add test for TH1 weights from root

* add test for issue

* alternative way to detect weighted hist

* made `fSumw2` None if storage is not weights

* remove comments from test

* add test for hist with weights

* fix bad check for storage type

* remove comment

* add test for hist with(out) weights and labels

* updated TH1 `to_boost` to handle weights/labels better

* placed histogram `to_boost` in parent `Histogram` class to reduce code duplication

* updated `weighted` property

* implemented histogram `weighted` property in parent `Histogram` class

* using weighted property instead of copying check

* do not use mutable default arguments

* fix calling property

* add missing asserts to test

* add test for issue #722

* add weight test for 2D and 3D histograms

* add temporary skip to test until file is uploaded

* update issue test file

* add back temporary skip until file is available

* Apply suggestions from code review

Co-authored-by: Jim Pivarski <[email protected]>
Co-authored-by: Angus Hollands <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add suggestion from #774 (comment)

* add check for length of `fSumw2` greater than 0 so empty histograms are not weighted

* remove unnecessary subclass method implementation

* addressed #764 (comment)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Angus Hollands <[email protected]>
Co-authored-by: Jim Pivarski <[email protected]>
@jpivarski jpivarski removed the next-release Required for the next release label Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot convert TH1 to hist via .to_boost()
3 participants