Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index error in L1Trigger/L1TTrackMatch/L1TrackJetEmulatorProducer #43723

Closed
dan131riley opened this issue Jan 16, 2024 · 15 comments
Closed

Index error in L1Trigger/L1TTrackMatch/L1TrackJetEmulatorProducer #43723

dan131riley opened this issue Jan 16, 2024 · 15 comments

Comments

@dan131riley
Copy link

this line

int j = eta_bin_firmwareStyle(L1TrkPtrs_[k]->getTanlWord()); //Function defined in L1TrackJetClustering.h

is occasionally producing a -1 bin index which is subsequently used to index into the stack-allocated epbins array, smashing the stack of the previous stack allocations. On very rare occasions this results in a segfault, but it actually happens (without a segfault) fairly often for some workflows. The way to see this to add

      assert(i < phiBins_ && i >= 0 && j < etaBins_ && j >= 0);

at line 300, and run wf 23234.0, 24834.0, or 25034.999, which should frequently give an assertion failure in step2 or step3. ASAN and UBSAN don't catch this, but valgrind memcheck does:

==113056== Thread 9:
==113056== Conditional jump or move depends on uninitialised value(s)
==113056==    at 0x1503E3C41: ap_fixed_base<15, 10, false, (ap_q_mode)5, (ap_o_mode)3, 0>& ap_fixed_base<15, 10, false, (ap_q_mode)5, (ap_o_mode)3, 0>::operator=<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:665)
==113056==    by 0x15037286B: ap_fixed_base<15, 10, false, (ap_q_mode)5, (ap_o_mode)3, 0>::ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:396)
==113056==    by 0x1503DCB9B: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::RType<14, 9, false>::plus ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator+<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) const (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:1163)
==113056==    by 0x1503D77A3: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator+=<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:1183)
==113056==    by 0x1503D4203: L1TrackJetEmulatorProducer::produce(edm::Event&, edm::EventSetup const&) (L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc:301)
==113056==  Uninitialised value was created by a stack allocation
==113056==    at 0x1503D777C: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator+=<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:1183)
==113056== 
==113056== Conditional jump or move depends on uninitialised value(s)
==113056==    at 0x1503DCCA3: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator=<15, 10, false, (ap_q_mode)5, (ap_o_mode)3, 0>(ap_fixed_base<15, 10, false, (ap_q_mode)5, (ap_o_mode)3, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:665)
==113056==    by 0x1503D77B6: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator+=<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:1183)
==113056==    by 0x1503D4203: L1TrackJetEmulatorProducer::produce(edm::Event&, edm::EventSetup const&) (L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc:301)
==113056==  Uninitialised value was created by a st ack allocation
==113056==    at 0x1503D777C: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator+=<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:1183)
==113056== 
==113056== Conditional jump or move depends on uninitialised value(s)
==113056==    at 0x1503DCD86: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator=<15, 10, false, (ap_q_mode)5, (ap_o_mode)3, 0>(ap_fixed_base<15, 10, false, (ap_q_mode)5, (ap_o_mode)3, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:745)
==113056==    by 0x1503D77B6: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator+=<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:1183)
==113056==    by 0x1503D4203: L1TrackJetEmulatorProducer::produce(edm::Event&, edm::EventSetup const&) (L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc:301)
==113056==  Uninitialised value was created by a stack allocation
==113056==    at 0x1503D777C: ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>& ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>::operator+=<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0>(ap_fixed_base<14, 9, false, (ap_q_mode)5, (ap_o_mode)0, 0> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_fixed_base.h:1183)
==113056== 
==113056== Invalid read of size 1
==113056==    at 0x1503E0F13: ap_private<5, false, true>& ap_private<5, false, true>::operator+=<1, false>(ap_private<1, false, (1)<=(64)> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/etc/ap_private.h:1904)
==113056==    by 0x1503DD0EA: ap_int_base<5, false>& ap_int_base<5, false>::operator+=<1, false>(ap_int_base<1, false> const&) (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_int_base.h:667)
==113056==    by 0x1503D7964: ap_int_base<5, false>::operator++() (/cvmfs/cms.cern.ch/el9_amd64_gcc12/external/hls/2019.08-8afb4083e7b06154cf0bca6d787b688f/include/ap_int_base.h:693)
==113056==    by 0x1503D445A: L1TrackJetEmulatorProducer::produce(edm::Event&, edm::EventSetup const&) (L1Trigger/L1TTrackMatch/plugins/L1TrackJetEmulatorProducer.cc:313)

In all the cases I've checked, L1TrkPtrs_[k]->getTanlWord() at the failure has the value

$1 = {
  <ap_int_base<16, false>> = {
    <ssdm_int_sim<16, false>> = {
      V = {
        static mask = 65535,
        VAL = 45824
      }
    }, <No data fields>}, <No data fields>}

and working through eta_bin_firmwareStyle() confirms this gives a -1 index.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 16, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @dan131riley Dan Riley.

@antoniovilela, @makortel, @smuzaffar, @Dr15Jones, @sextonkennedy, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign l1

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

@srimanob
Copy link
Contributor

Just a note from #43735 (closed, and follow up in this issue) is that it is unclear to me why the rates of failure are different with very close geometry version. For example, I try 20 jobs of ttbar nopu, 500 events per job with same GEN event, between 24834.0 and 25234.0. I don't see any fail job from 24834.0 (D98), but 4 job fail from 25234.0 (D99). However, it should be fixed before the Phase-2 production starts.

@srimanob
Copy link
Contributor

@epalencia @aloeliger @BenjaminRS
Anything on L1T side to fix this issue? Thanks very much.

@aloeliger
Copy link
Contributor

@BenjaminRS Do you know the track group responsible for this producer?

@BenjaminRS
Copy link
Contributor

@SClarkPhysics - we have an issue with the line mentioned above. I see you added this particular line in this commit. Can you have a look into this problem as soon as you can please?
Thanks,
Benjamin

@srimanob
Copy link
Contributor

srimanob commented Feb 1, 2024

To reproduce the error with CMSSW_14_0_0_pre2:

cmsDriver.py step2 -s DIGI:pdigi_valid,L1TrackTrigger,L1,DIGI2RAW,HLT:@relval2026 --conditions auto:phase2_realistic_T25 --datatier GEN-SIM-DIGI-RAW -n -1 --eventcontent FEVTDEBUGHLT --geometry Extended2026D98 --era Phase2C17I13M9 --python step2_D98_DIGIHLT_test.py --no_exec --filein file:step1_D98.root --fileout file:step2_D98.root --nThreads 1 --customise SLHCUpgradeSimulations/Configuration/aging.customise_aging_1000 --customise_commands "process.source.firstEvent = cms.untracked.uint32(200)"

with input file: /eos/cms/store/group/offcomp_upgrade-sw/srimanob/D99/HGCAL-DEBUG-3/GENSIM_D98_20072.root

I see that crash at event 209.

@skinnari
Copy link
Contributor

skinnari commented Feb 1, 2024

Can maybe @NJManganelli help look at this from the GTT side?

@ccahoughton
Copy link
Contributor

We are working on a fix. Should this be pushed to master? @srimanob

@srimanob
Copy link
Contributor

srimanob commented Feb 2, 2024

We are working on a fix. Should this be pushed to master? @srimanob

Hi @ccahoughton, please. Thanks very much.

@srimanob
Copy link
Contributor

srimanob commented Feb 3, 2024

Fix is in #43852

Testing in private production, issue is gone. I can completely produce sample with no crash (using 1000 events/lumi).

@smuzaffar
Copy link
Contributor

Looks like this was fixed by #43852 .... @dan131riley can we close this issue?

@dan131riley
Copy link
Author

Looks like this was fixed by #43852 .... @dan131riley can we close this issue?

ok to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants