Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Workflow failures when running the alpaka customization in presence of a Fake menu #44119

Closed
mmusich opened this issue Feb 27, 2024 · 30 comments · Fixed by #44221
Closed

Comments

@mmusich
Copy link
Contributor

mmusich commented Feb 27, 2024

Several workflows {12434,12450}.{402,403,404,412} fail in GPU IB tests in CMSSW_14_1_GPU_X_2024-02-26-2300 along:

DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2023,ENDJOB
We have determined that this is simulation (if not, rerun cmsDriver.py with --data)
with DB:
entry filelist:step1_dasquery.log
found files:  ['/store/relval/CMSSW_13_0_10/RelValTTbar_14TeV/GEN-SIM/130X_mcRun3_2023_realistic_withEarly2023BS_v1_2023-v1/2590000/4a9c4099-1812-4afd-9c94-6f9409595929.root', '/store/relval/CMSSW_13_0_10/RelValTTbar_14TeV/GEN-SIM/130X_mcRun3_2023_realistic_withEarly2023BS_v1_2023-v1/2590000/99db1b20-ec34-4bff-84df-dfffcbdfb184.root', '/store/relval/CMSSW_13_0_10/RelValTTbar_14TeV/GEN-SIM/130X_mcRun3_2023_realistic_withEarly2023BS_v1_2023-v1/2590000/c388e800-ddaa-408d-a2ec-b40a9b8c7a08.root']
Step: DIGI Spec: ['pdigi_valid']
Step: L1 Spec: 
Step: DIGI2RAW Spec: 
Step: HLT Spec: ['@relval2023']
Traceback (most recent call last):
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 40, in <module>
    run()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 16, in run
    configBuilder.prepare()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 2310, in prepare
    self.addStandardSequences()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 850, in addStandardSequences
    getattr(self,"prepare_"+stepName)(stepSpec = '+'.join(stepSpec))
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 1670, in prepare_HLT
    self.loadAndRemember('HLTrigger/Configuration/HLT_%s_cff' % stepSpec)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 376, in loadAndRemember
    self.process.load(includeFile)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 761, in load
    module = __import__(moduleName)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/HLT_Fake2_cff.py", line 237, in <module>
    fragment = customizeHLTforCMSSW(fragment,"Fake2")
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforCMSSW.py", line 262, in customizeHLTforCMSSW
    (alpaka & run3_common).makeProcessModifier(customizeHLTforAlpaka).apply(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1980, in apply
    self.__func(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 917, in customizeHLTforAlpaka
    process = customizeHLTforAlpakaEcalLocalReco(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 908, in customizeHLTforAlpakaEcalLocalReco
    process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02826/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1656, in __getattribute__
    return getattr(self.__process, name)
AttributeError: 'Process' object has no attribute 'HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask'

this likely comes from the integration of #44026 that moved @relval2023 to @Fake2.

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

assign hlt, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,heterogeneous

@Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 27, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@smuzaffar, @antoniovilela, @Dr15Jones, @makortel, @rappoccio, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

@thomreis FYI

@Martin-Grunewald
Copy link
Contributor

The customisation should check whether HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask actually exists, before messing with it.

@thomreis
Copy link
Contributor

What menu was used for this?

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

What menu was used for this?

Fake one. See above. In any case it does not matter. Please provide a fix, since the customization needs to run irrespectively

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Feb 27, 2024

Hmm, alternatively, it may be best to remove alpaka from these (failing) 2023 (HLT) workflows (as those are now using the Fake menus). Testing alpaka on Fake HLT menus does not make much sense!

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

Hmm, alternatively, it may be best to remove alpka from these (failing) 2023 (HLT) workflows (as those are now using the Fake menus). Testing alpaka on Fake HLT menus does not make much sense!

this is what this PR #44075 is going to do . On the other hand the customization should not break in any circumstance IMHO.

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

On the other hand the customization should not break in any circumstance IMHO.

in order to achieve that, though also all the other customization pieces need to comply, perhaps better to remove all years with the fake menu from the alpaka customization

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

assign pdmv

@cmsbuild
Copy link
Contributor

New categories assigned: pdmv

@AdrianoDee,@sunilUIET,@miquork you have been requested to review this Pull request/Issue and eventually sign? Thanks

@thomreis
Copy link
Contributor

Would add a condition to this line would fix this?

if hasattr(process, 'HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask') and hasattr(process, 'HLTPreshowerTask'):
    process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)

@Martin-Grunewald
Copy link
Contributor

This error, yes, I think so.

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

Would add a condition to this line would fix this?

It does, but then it fails with:

DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2023,ENDJOB
We have determined that this is simulation (if not, rerun cmsDriver.py with --data)
with DB:
entry file:step1.root
Step: DIGI Spec: ['pdigi_valid']
Step: L1 Spec: 
Step: DIGI2RAW Spec: 
Step: HLT Spec: ['@relval2023']
Traceback (most recent call last):
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 40, in <module>
    run()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/bin/el8_amd64_gcc12/cmsDriver.py", line 16, in run
    configBuilder.prepare()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 2310, in prepare
    self.addStandardSequences()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 850, in addStandardSequences
    getattr(self,"prepare_"+stepName)(stepSpec = '+'.join(stepSpec))
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 1670, in prepare_HLT
    self.loadAndRemember('HLTrigger/Configuration/HLT_%s_cff' % stepSpec)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/Configuration/Applications/python/ConfigBuilder.py", line 376, in loadAndRemember
    self.process.load(includeFile)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 761, in load
    module = __import__(moduleName)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/HLT_Fake2_cff.py", line 237, in <module>
    fragment = customizeHLTforCMSSW(fragment,"Fake2")
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforCMSSW.py", line 262, in customizeHLTforCMSSW
    (alpaka & run3_common).makeProcessModifier(customizeHLTforAlpaka).apply(process)
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1980, in apply
    self.__func(process)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 919, in customizeHLTforAlpaka
    process = customizeHLTforAlpakaPixelReco(process)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 809, in customizeHLTforAlpakaPixelReco
    process = customizeHLTforAlpakaPixelRecoVertexing(process)
  File "/tmp/musich/CMSSW_14_1_GPU_X_2024-02-26-2300/src/HLTrigger/Configuration/python/customizeHLTforAlpaka.py", line 732, in customizeHLTforAlpakaPixelRecoVertexing
    process.hltTrimmedPixelVertices 
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_1_GPU_X_2024-02-26-2300/src/FWCore/ParameterSet/python/Config.py", line 1656, in __getattribute__
    return getattr(self.__process, name)
AttributeError: 'Process' object has no attribute 'hltTrimmedPixelVertices'

@thomreis
Copy link
Contributor

But that is not and issue of the ECAL customisation anymore. Looks like Pixel in this case.

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Feb 27, 2024

It looks there are more instances where alpaka customisation parts fail on Fake* menus.
#44075 (#44076 bp) would fix it from the workflow use-case side?!

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

But that is not and issue of the ECAL customisation anymore. Looks like Pixel in this case.

right, but it does not solve the issue.

@thomreis
Copy link
Contributor

right, but it does not solve the issue.

Well it would solve this issue. But there seem to be others.

@mmusich mmusich changed the title [GPU] Workflow failures from missing HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask [GPU] Workflow failures when running the alpaka customization in presence of a Fake menu Feb 27, 2024
@Martin-Grunewald
Copy link
Contributor

I guess it is faster to get the PRs in, rather than making alpaka customisations failsafe - given that the alpaka customisation will be folded into the ConfDb menus within a couple of weeks?
Or are there issues not fixed by the two PRs?

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

Well it would solve this issue. But there seem to be others.

I edited the issue title to be more inclusive, so no, unfortunately it's not an adequate fix.

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

Or are there issues not fixed by the two PRs?

getting the PR in will probably remove the failures from the IBs tests, but the workflows will remain broken IIUC

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

diff --git a/HLTrigger/Configuration/python/customizeHLTforAlpaka.py b/HLTrigger/Configuration/python/customizeHLTforAlpaka.py
index d1ca276fb3e..a9bdb2feae0 100644
--- a/HLTrigger/Configuration/python/customizeHLTforAlpaka.py
+++ b/HLTrigger/Configuration/python/customizeHLTforAlpaka.py
@@ -190,6 +190,10 @@ def customizeHLTforAlpakaParticleFlowClustering(process):
             pfRecHits = cms.InputTag("hltPFRecHitSoAProducerHCALCPUSerial"),
             )
 
+    ## failsafe for fake menus
+    if(not hasattr(process,'hltParticleFlowClusterHBHE')):
+        return process
+
     process.hltLegacyPFClusterProducer = cms.EDProducer("LegacyPFClusterProducer",
             src = cms.InputTag("hltPFClusterSoAProducer"),
             pfClusterParams = cms.ESInputTag("pfClusterParamsESProducer:"),
@@ -725,6 +729,10 @@ def customizeHLTforAlpakaPixelRecoVertexing(process):
         src = cms.InputTag("hltPixelVerticesCPUSerial")
     )
 
+    ## failsafe for fake menus
+    if(not hasattr(process,'hltTrimmedPixelVertices')):
+        return process
+
     process.HLTRecopixelvertexingTask = cms.ConditionalTask(
         process.HLTRecoPixelTracksTask,
         process.hltPixelVerticesSoA,
@@ -905,7 +913,9 @@ def customizeHLTforAlpakaEcalLocalReco(process):
         if hasattr(process, 'hltEcalUncalibRecHitSoA'):
             delattr(process, 'hltEcalUncalibRecHitSoA')
 
-    process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)
+        ## failsafe for fake menus
+        if hasattr(process, 'HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask') and hasattr(process, 'HLTPreshowerTask'):
+            process.HLTDoFullUnpackingEgammaEcalTask = cms.ConditionalTask(process.HLTDoFullUnpackingEgammaEcalWithoutPreshowerTask, process.HLTPreshowerTask)
 
     return process
 

this seems to be enough to avoid runtime failures.

@AdrianoDee
Copy link
Contributor

I don't think #44075 will fix this in the IBs since I didn't remove 2023 wfs but added 2024 ones (if I understood well the issue here). Alternatively to the solution here by @mmusich one could inhibit the *FakeHLT steps for the Alpaka wfs.

@mmusich
Copy link
Contributor Author

mmusich commented Feb 27, 2024

one could inhibit the *FakeHLT steps for the Alpaka wfs.

this assumes that we are (correctly) running the FakeHLT RECO+DQM sequence in the workflows that run a Fake HLT menu, but this is not in general guaranteed nor enforced (even though we've been trying to be diligent with it). On the other hand since all the customization thing will get reabsorbed soon, I guess it's an academic discussion.
I would open a PR now with #44119 (comment) to get rid of failures for the next few weeks and be done with it.

@AdrianoDee
Copy link
Contributor

Alternatively to the solution here by @mmusich one could inhibit the *FakeHLT steps for the Alpaka wfs.

Ok, on a second thought this could overcomplicate things. Would protect the customizer with the failsafes.

@AdrianoDee
Copy link
Contributor

this assumes that we are (correctly) running the FakeHLT RECO+DQM sequence in the workflows that run a Fake HLT menu, but this is not in general guaranteed nor enforced (even though we've been trying to be diligent with it). On the other hand since all the customization thing will get reabsorbed soon, I guess it's an academic discussion.

Agreed, you just preceded me.

@makortel
Copy link
Contributor

+heterogeneous

@mmusich
Copy link
Contributor Author

mmusich commented Feb 29, 2024

+hlt

  • for the record

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants