-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033
[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033
Conversation
A new Pull Request was created by @sroychow (Suvankar Roy Chowdhury) for CMSSW_13_0_X. It involves the following packages:
@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
type bug-fix |
urgent |
test parameters:
|
please test |
-1 Failed Tests: RelVals-INPUT RelVals-INPUTThe relvals timed out after 4 hours.
Expand to see more relval errors ...Comparison SummarySummary:
GPU Comparison SummarySummary:
|
should the tests be re-triggered? |
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b59733/33338/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
@sroychow , would you please confirm that the differences are as expected, also in the master PR: #42010 (comment) |
@malbouis @jordan-martins After some checks, and also discussing with @mmusich, we think the differences in the GPU comparisons are spurious. If you look at even older PRs (which were merged before), e.g. 42014 or before, you should see some differences pointed out by the bot in the GPU comparisons. I would propose we can merge this PR. |
Thanks, @sroychow ! Indeed, it would be good to have this PR merged and a new release cut with this in. |
@cms-sw/reconstruction-l2 can someone take a look at this urgent PR to sign off? |
Apologies, but can someone reiterate why we would like to put this bug-fix directly into reconstruction without proper release validation? But according to HLT, this PR will only affect the fallback mode when no GPU is available. |
Small clarification (already mentioned at the last ORP): the CPU unpacker also runs at HLT for the fraction of events used for GPU-vs-CPU comparisons. Edit : at HLT, the CPU pixel unpacker corresponds to the module
|
Do we have a number about the fraction of events that take this path?
I agree. This needs more validation. OTOH done the standard way it will come so late to not be useful if the intention from PPD is to not reprocess the last chunk of 2023 data (and have it consistent with the reprocessed part, that presumably will have the fix). |
This statement is in square contradiction with the whole of #41715 . |
(I edited my comment above about the GPU-vs-CPU comparisons, as already noticed)
I think this has happened very rarely. I think the number is |
#41715 relates to offline studies where we compare the triggers results 'running on GPU' vs 'running on CPU'. The goal is to make sure the two reconstructions give the same results, since (in general) GPU is the default online, while CPU is the fallback online and is used in most offline use cases (e.g. MC). Part of this validation also runs online as part of the HLT menu: there is a Path ( #41715 led to identifying a bugfix, and I thought PPD was in favour of backporting it (that's how I read #42010 (comment)). Whether or not this backport is critical for HLT, it can be debated. Maybe @silviodonato or @fwyzard have a different opinion. |
Hi @mandrenguyen, Yes, since it is a bug fix, we decided to get in asap for the start of ERA D. We do indeed want to perform some validation, and we have asked TRK to propose some approaches to assist PdmV with the best way to propose a quick validation that we could rely on. We wanted to get this in now because we only foresee a rereco of the initial chunk of the DATA from ERA A to C. Do you think this is too risky? Could you propose a way around this to help us move in the safest way possible? Thanks, FYI @cms-sw/ppd-l2 @cms-sw/pdmv-l2 |
It would appear this is a long-standing bug in the unpacker, which has a very minor effect on the offline reconstruction, at least judging by the comparisons. I understand that there is some CPU-GPU validation that we would like to converge, but do we really want to risk breaking reco to fix validation? Especially since we heard this is only a partial fix, and there are at least more changes coming on the GPU side. The risk of directly implementing this bug-fix is admittedly small, but the consequences could be quite bad. I suppose we should be able to fix the validation using the fixed CPU code without actually deploying the fix in prompt reco for Era D. |
So it validates a use case that never happens (in practice). Now that this is clarified I guess it also settles the urgency for a fix (and for all future fix requests of this type as well).
The effect is minor when the detector is well-behaved. I am under the impression the effect becomes more sizeable in presence of large rates of soft error recoveries. |
The bugfix should definitely be backported and included in the online reconstruction ASAP. RECO conveners may have a cavalier approach towards the integrity of the data taking, but DAQ and TSG (should) consider the stability and correctness of the online reconstruction and data taking of paramount importance. |
Mostly to clarify to myself
In case one of the legacy unpacker event products would be persisted for being outputed in the event stram as we're planning in https://its.cern.ch/jira/browse/CMSHLT-2846 I would think it will start to be run as well. |
Does something prevent us from using an era or process modifier such that:
? |
backport of #42010 |
+reconstruction
|
This pull request is fully signed and it will be integrated in one of the next CMSSW_13_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_13_2_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1
|
@mandrenguyen that's an interesting approach, but looking at the changes, I do not think it can be done: an era or process modifier can affect only the python configuration, while this bug fix is a c++ change. |
For completeness, I have to correct myself again wrt #42033 (comment) (apologies). As one can see from here, there are two instances of the CPU Pixel unpacker at HLT, |
PR description:
Backport of #42010
PR validation:
code compiles