Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correcting a JEC indexing bug for the pat::Jet class in association to scaleEnergy function calls #36559

Merged
merged 1 commit into from
Jan 8, 2022

Conversation

errai-
Copy link

@errai- errai- commented Dec 21, 2021

PR description:

  • When an energy scale was added to a jet using the scaleEnergy() function, the index (currentJECSet_) of the current JEC level was not correctly updated
  • The scale is added to the JEC collection vector onto the index 0, so this will always increment the correct index for the current JEC level by one
  • The bug is mostly visible in use cases of the function jecFactor(), involving additionally jet energy fraction functions and the function correctedJet() EDIT: AND the function correctedP4()
  • As additional energy scales are rarely added in Data, this issue is most likely to touch the analyses of MC samples
  • Special features of MC sample analysis:
    • Before the fix, if the original JEC level was set to L3Absolute, one "safe call" of scaleEnergy() was allowed, because L3Absolute is a unitary dummy; two calls would break the return values of jecFactor()
    • Before the fix, if the original JEC level was set to L2L3Residual, two "safe calls" of scaleEnergy() were allowed, because also L2L3Residual is a unitary dummy in MC; three calls would break the return values of jecFactor()
    • Standard jet smearing produces one call of scaleEnergy(), see here
    • E.g. systematic jet energy variations can produce another call to scaleEnergy(), which would be enough for breaking the JECs if the original level was L3Absolute
    • If the code is not fully optimized, there might exist several calls to scaleEnergy() instead of a single collective call, potentially causing the JECs to break
    • These points underline the fact that the bug can be very elusive and only appear e.g. in the study of systematics
  • Possible checks to see if an analysis or ntuple production is affected:
    • There is no silver bullet, as the bug can get activated in many ways
    • One could, however, start by checking whether L3Absolute or L2L3Relative was used for jets in MC
    • Next, one could do a 'grep -rn . -e "scaleEnergy' and/or 'grep -rn . -e "scaleEnergy(" ' on the whole code base and check the amount of consecutive uses of the energy scaling function per a single jet, including the one call from jet smearing.
    • Consecutive calls are still incremented also even if a jet is copied to another jet collection
    • If one is still unsure about the amount of scaleEnergy() calls per jet, one could check this by calling 'jet.availableJECLevels()' on the final jet collection and calculating the amount of "Unscaled" entries per jet
    • If there are one/two (L3Abs/L2L3Res) or less scaleEnergy calls per jet, one should be safe
    • If there are more calls than this but one does not utilize the functions jecFactor() or correctedJet(), the effects are very limited
    • EDIT: also the function correctedP4() needs to be checked, as it employs correctedJet()
    • One should probably check the usage of these functions with the same grep tricks as with scaleEnergy()
    • Also the jet energy fractions and hence the JetID are affected if the bug gets activated, but this is a very limited effect
    • If the bug gets active, the most likely consequence is that L2Rel or L1+L2Rel JECs are applied twice, causing a scaling effect around 5-10%
  • The virtual slides with exhaustive documentation can be found at https://indico.cern.ch/event/1005590/#5-patjet-bug-report-virtual

PR validation:

  • Explicit printouts of the contents of the vector jet.availableJECLevels() and the values of jet.jecFactor(idx) at the corresponding indices 'idx' were made before and after the update in association to a varying count of dummy energy scaling calls of the format jet.scaleEnergy(1.)
  • In a similar fashion, the pt values from jet.correctedJet() were studied before and after the update after a varying count of dummy energy scaling calls of the format jet.scaleEnergy(1.)
  • The basic test procedure suggested in the CMSSW PR instructions has also been run through.

if this PR is a backport please specify the original PR and why you need to backport that PR:

  • This is not a backport, but the bug has been there from CMSSW_10_6_X onwards, so if possible, this patch should be backported there

… not correctly updated. The scale is added to the JEC collection onto the index 0, so this will always affect the correct index for the current JEC level. This commit provides a simple correction for the issue.
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36559/27503

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @errai- (Hannu Siikonen) for master.

It involves the following packages:

  • DataFormats/PatCandidates (reconstruction)

@jpata, @cmsbuild, @clacaputo, @slava77 can you please review it and eventually sign? Thanks.
@gpetruc, @gouskos, @rovere, @hatakeyamak this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@errai-
Copy link
Author

errai- commented Dec 21, 2021

There are no edits to scaleEnergy in the smeared MET PR, so there are no direct conflicts - and an edit to scaleEnergy is necessary to fix the bug at hand in this thread. However, as also the MET PR increments to pat::Jet and jet smearing introduces scaleEnergy calls, it seems worthwhile for me to take a moment to check the changes introduced in this MET smearing PR more deeply.

@jpata
Copy link
Contributor

jpata commented Dec 22, 2021

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-38ac16/21430/summary.html
COMMIT: 07867f1
CMSSW: CMSSW_12_3_X_2021-12-21-2300/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36559/21430/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 43
  • DQMHistoTests: Total histograms compared: 3461692
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3461670
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 42 files compared)
  • Checked 181 log files, 42 edm output root files, 43 DQM output files
  • TriggerResults: no differences found

@slava77
Copy link
Contributor

slava77 commented Dec 22, 2021

  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total failures: 0

@errai- were some differences actually expected in the persisted data?
If so, then it looks like relevant values are not monitored; which values changed (perhaps it can be added).

@@ -249,6 +249,7 @@ void Jet::scaleEnergy(double fScale, const std::string& level) {
if (jecSetsAvailable()) {
std::vector<float> factors = {float(jec_[0].correction(0, JetCorrFactors::NONE) / fScale)};
jec_[0].insertFactor(0, std::make_pair(level, factors));
++currentJECLevel_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method Jet::scaleEnergy was added in #27428 with an intention to fix calls like this https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0/DataFormats/PatCandidates/interface/Jet.h#L407-L410

Now that you are bringing up also possible repeated calls of scaleEnergy, I'm not sure that this implementation which always inserts a new factor is going to do the right thing.

Is the information about the multiple scales needed/required?

  • If not, then (somewhat unsafe, but still) assuming that only scaleEnergy inserts a level=0,flavor=NONE correction at a set 0, should this code instead of repeated insertion accumulate the product?
  • Alternatively, if the information about multiple scalings is important then the call to chargedEmEnergyFraction() const {return chargedEmEnergy() / jecFactor(0) * energy());} would be incomplete because jecFactor(0) only looks at a set=0, while in this context it should loop over all sets and get a product of level=0,flavor=NONE in all available sets

@cms-sw/jetmet-pog-l2 @laurenhay @ahinzmann please check/clarify.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the second question correctly, I might add a quick comment on it. On line 250 each new scale is added relative to the previous index zero scale. Cumulatively this indicates that jecFactor(0) * energy() always gives the jet energy where all JECs and all scaleEnergy scales are undone. The functions chargedEmEnergy() et al. refer to the jet energy composition before all jet energy corrections and scales, so this is sensible.

The first question is open for political discussion. In well-optimized code there should be no need for more than two calls to scaleEnergy, so there is not much difference. There might be some use-cases where both of the scales are needed separately. On the other hand, not so-well optimized code could abuse scaleEnergy calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the second question correctly ...

Thanks for the clarification.
I missed somehow the accumulation of /fScale from the previous state.
So, I'd say that no changes are needed here. The first case above is not that essential.

@errai-
Copy link
Author

errai- commented Dec 22, 2021

  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total failures: 0

@errai- were some differences actually expected in the persisted data? If so, then it looks like relevant values are not monitored; which values changed (perhaps it can be added).

The expected output of the tests was actually unknown territory. Because the bug was there since July 2019, it was expected that the tests from back then didn't catch the issue, but there could have been new test developments in-between.

Before the bugfix, each call to scaleEnergy() would create an offset by one to he current index (currentJECLevel_) w.r.t. the vector indices stored in the member jec_. Generally, scaleEnergy is used at least in jet smearing in MC, and likely also if systematic energy variations are applied in MC. The trick is that in MC, the L3Absolute and L2L3Residual jet energy corrections are unitary dummies in MC. Thanks to this, if the current (MC) JEC level was set to L3Absolute, the bug would only appear after two or more calls to scaleEnergy. And similarly, if the current MC JEC level was set to L2L3Relative, the bug would only appear after three or more calls to scaleEnergy. If the tests do not include involved scenarios such as systematical variations, they simply would not bring up the issue. It is most likely to come up in private (full) analyses with fully blown-up systematics studies.

@slava77
Copy link
Contributor

slava77 commented Dec 22, 2021

If the tests do not include involved scenarios such as systematical variations, they simply would not bring up the issue. It is most likely to come up in private (full) analyses with fully blown-up systematics studies.

tests include standard miniAOD production for data (run2 so far has miniAOD, around 1K events over several workflows) and MC (run2 and run3, but less events per workflow).
Is the scaleEnergy called in the production of the standard miniAOD jets?

@errai-
Copy link
Author

errai- commented Dec 22, 2021

If the tests do not include involved scenarios such as systematical variations, they simply would not bring up the issue. It is most likely to come up in private (full) analyses with fully blown-up systematics studies.

tests include standard miniAOD production for data (run2 so far has miniAOD, around 1K events over several workflows) and MC (run2 and run3, but less events per workflow). Is the scaleEnergy called in the production of the standard miniAOD jets?

Thanks for the clarification! In my understanding the generic miniAOD workflow shouldn't be doing calls scaleEnergy, as it only produces the generic slimmedJets collection. The analyzers are expected to apply jet smearing and jet energy variations manually or via central tools by themselves. Both of these are easiest and most likely done by utilizing the scaleEnergy function. Thus, the issues materialize only late in the analysis chain. On a slightly related note: I'm not sure how and what is done in the nanoAOD workflow, and there were mentions that also this should be checked.

@slava77
Copy link
Contributor

slava77 commented Dec 22, 2021

On a slightly related note: I'm not sure how and what is done in the nanoAOD workflow, and there were mentions that also this should be checked.

there are nanoAOD workflows in the tests as well and at least the reco monitoring is of the nanoAOD products is a bit more inclusive (all values in a list of hardcoded tables are used for comparisons).

@slava77
Copy link
Contributor

slava77 commented Dec 22, 2021

Before signing, I'd still like to understand if the standard miniAOD/nanoAOD outputs are affected.
IIUC, some of the scaling may be present via MET uncertainties/corrections. The usual pat METs are monitored including the values stored in the corrections and uncertainties. But no differences are showing up.

Perhaps as a brute force check it's worthwhile to add a temporary commit that throws an exception in pat::Jet::scaleEnergy and rerun the PR tests.
@errai- may I ask you to do it locally with runTheMatrix.py -l limited -i all --ibeos
Please let me know.
Thank you.

@errai-
Copy link
Author

errai- commented Dec 23, 2021

Before signing, I'd still like to understand if the standard miniAOD/nanoAOD outputs are affected. IIUC, some of the scaling may be present via MET uncertainties/corrections. The usual pat METs are monitored including the values stored in the corrections and uncertainties. But no differences are showing up.

Perhaps as a brute force check it's worthwhile to add a temporary commit that throws an exception in pat::Jet::scaleEnergy and rerun the PR tests. @errai- may I ask you to do it locally with runTheMatrix.py -l limited -i all --ibeos Please let me know. Thank you.

@slava77 I have already ran this command before submitting the PR on lxplus within the CMSSW environment that contains the patch (as it was a part of the suggested standard tests). There were no errors back then. Do we expect different results if I run this command again? Or would I need to perform some other modifications and then run the command again?

@perrotta
Copy link
Contributor

When running tests multithreaded on miniAOD for wf 28234.0 (TTbar 14 TeV 2026D60) in PR #36568 there is hint of some non reproducibility of the MET corrections:
image

I don't think it has anything to do with this proposed fix of the Jet energy corrections. But perhaps in this thread there is the expertize to evaluate it.

Let me try to re-run threaded tests here, to see if it can reproduce in another PR.

@perrotta
Copy link
Contributor

enable threading

@perrotta
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-38ac16/21452/summary.html
COMMIT: 07867f1
CMSSW: CMSSW_12_3_X_2021-12-22-1100/slc7_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36559/21452/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 43
  • DQMHistoTests: Total histograms compared: 3461692
  • DQMHistoTests: Total failures: 10
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3461659
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 42 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 181 log files, 42 edm output root files, 43 DQM output files
  • TriggerResults: no differences found

@errai-
Copy link
Author

errai- commented Dec 27, 2021

Perhaps as a brute force check it's worthwhile to add a temporary commit that throws an exception in pat::Jet::scaleEnergy and rerun the PR tests. @errai- may I ask you to do it locally with runTheMatrix.py -l limited -i all --ibeos Please let me know. Thank you.

@slava77 I have already ran this command before submitting the PR on lxplus within the CMSSW environment that contains the patch (as it was a part of the suggested standard tests). There were no errors back then. Do we expect different results if I run this command again? Or would I need to perform some other modifications and then run the command again?

I was asking to make the check in the context of a temporary commit that throws an exception in pat::Jet::scaleEnergy

@slava77 my bad for responding in a rush. If the info is still needed, the run with a scaleEnergy error thrown ended up with

exit: 0 0 0 16640 0
42 36 11 6 5 1 1 1 1 1 tests passed, 0 5 24 2 0 0 0 0 0 0 failed

so failures were indeed produced. I made a second test with a counter (originally initialized to zero) condition:

if (++countEx_ >= 2) throw cms::Exception("ScaleEnergy") << "This was visited!";

and now the printout was

exit: 0 0 0 0 0
42 41 40 30 18 4 1 1 1 1 tests passed, 0 0 0 0 0 0 0 0 0 0 failed

That is, there seem to be maximally one scaleEnergy call - perhaps from jet smearing - in these tests. As long as the call was done to MC only, there is no reason to expect changes. If scaleEnergy was for some reason called for data, this would immediately cause problems with L2L3Res, but such calls seem unlikely. Hence the hypothesis stands: the scaleEnergy bug is most likely materialized in private analyses, performing e.g. jet energy scale variations. In the central workflow, it luckily seems to remain "silent".

@slava77
Copy link
Contributor

slava77 commented Dec 28, 2021

the run with a scaleEnergy error thrown ended up with

exit: 0 0 0 16640 0
42 36 11 6 5 1 1 1 1 1 tests passed, 0 5 24 2 0 0 0 0 0 0 failed

so failures were indeed produced.

Thank you for checking explicitly.
I would still like to understand which modules/computation is affected, to see if the related variables are monitored or not.

@slava77
Copy link
Contributor

slava77 commented Jan 6, 2022

I would still like to understand which modules/computation is affected, to see if the related variables are monitored or not.

@errai-
sorry for not being explicit; this was in part a question to you: which variables are relying on this scaleEnergy?

@errai-
Copy link
Author

errai- commented Jan 7, 2022

I would still like to understand which modules/computation is affected, to see if the related variables are monitored or not.

@errai- sorry for not being explicit; this was in part a question to you: which variables are relying on this scaleEnergy?

Within the pat::Jet class, scaleEnergy modifies std::vectorpat::JetCorrFactors jec_ and the current p4 through the function setP4, inherited from LeafCandidate ParticleState m_state. Moreover, it should modify currentJECLevel_, as was proposed in this fix.

To the vector jec_ (within class pat::Jet) at index zero JetCorrfactors object the new scale is added. Here, the index zero refers to the jecSet index (not to be confused with jecLevel and currentJECLevel_). It is possible that instead of zero, one should use the index currentJECSet_. Typically there is only one set, and this does not make a difference. However, the intricacies of currentJECSet_ or 0 usage are beyond my understanding.

Within the index zero JetCorrFactors, all the scales reside within (class JetCorrFactors) std::vector jec_. The factor for undoing the new scale is appended into the beginning of this vector, thus increasing the index of the current factor by one. The pat::Jet index currentJECLevel_ refers to the position within this vector.

@slava77 let me know if this answers your question. The indexing and vectoring is split between the pat::Jet and pat::JetCorrFactors classes, adding a bit complexity. I'd guess that this is a result of making small increments to the classes over the years, and avoiding refactoring to preserve full backwards compatibility.

@slava77
Copy link
Contributor

slava77 commented Jan 7, 2022

@slava77 let me know if this answers your question. The indexing and vectoring is split between the pat::Jet and pat::JetCorrFactors classes, adding a bit complexity. I'd guess that this is a result of making small increments to the classes over the years, and avoiding refactoring to preserve full backwards compatibility.

@errai-
thank you for the detailed accounting of the affected data members.
What remains unclear is which collections in miniAOD are affected. Is it applied to the regular slimmedJets?

@errai-
Copy link
Author

errai- commented Jan 7, 2022

@slava77 let me know if this answers your question. The indexing and vectoring is split between the pat::Jet and pat::JetCorrFactors classes, adding a bit complexity. I'd guess that this is a result of making small increments to the classes over the years, and avoiding refactoring to preserve full backwards compatibility.

@errai- thank you for the detailed accounting of the affected data members. What remains unclear is which collections in miniAOD are affected. Is it applied to the regular slimmedJets?

@slava77 in my understanding slimmedJets should not be affected. Jet smearing is recommended practically for all MC samples, but it should only be run by the end user. The same goes for JEC systematics variations. It's a mystery to me, where the scaleEnergy call is produced.

@ahinzmann
Copy link
Contributor

ahinzmann commented Jan 7, 2022

As far as I remember, there is only one occasion where scaleEnergy is used in official production namely in the MET uncertainties stored in MiniAOD+NanoAOD, one of them is JER-smearing another JES. As this is calling scaleEnergy only once, the bug may not affect MiniAOD+NanoAOD content.
"Smear" appears in many places here:
https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PatUtils/python/tools/runMETCorrectionsAndUncertainties.py#L1128

@slava77
Copy link
Contributor

slava77 commented Jan 7, 2022

@cmsbuild please test with cms-sw/cms-bot#1685

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 7, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-38ac16/21564/summary.html
COMMIT: 07867f1
CMSSW: CMSSW_12_3_X_2022-01-07-1100/slc7_amd64_gcc10
Additional Tests: THREADING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36559/21564/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 43
  • DQMHistoTests: Total histograms compared: 3461659
  • DQMHistoTests: Total failures: 5
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3461631
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 42 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 181 log files, 42 edm output root files, 43 DQM output files
  • TriggerResults: no differences found

@slava77
Copy link
Contributor

slava77 commented Jan 8, 2022

+reconstruction

for #36559 07867f1

Based on this, it seems safe to also backport this feature, but I'd leave it to @cms-sw/jetmet-pog-l2 to confirm that this has to be backported.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 8, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Jan 8, 2022

+1

@cmsbuild cmsbuild merged commit 7235ebc into cms-sw:master Jan 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants