-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cut parser error in ROOT master IB #33084
Comments
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core, xpog |
FYI @pcanal |
New categories assigned: core,xpog @Dr15Jones,@smuzaffar,@fgolf,@mariadalfonso,@makortel,@gouskos you have been requested to review this Pull request/Issue and eventually sign? Thanks |
assign dqm |
New categories assigned: dqm @jfernan2,@andrius-k,@ahmad3213,@kmaeshima,@rvenditti,@ErnestaP you have been requested to review this Pull request/Issue and eventually sign? Thanks |
FYI @peruzzim |
Another occurrance in CMSSW_11_3_DEVEL_X_2021-03-17-2300 1325.6 step 2
|
@gpetruc according to github history, you were the one who introdudec nanoDQMC in the code, could you please have a look or point us for a responsible? |
Is this issue still valid? Thanks |
We still see this exception intermittently, the latest I could find is last week: |
Thanks @dan131riley On the other hand, the error seems related to these lines: |
The IBs run with 4 threads, and what we see is all 4 threads failing on the first event for that thread. With all 4 threads failing, it's likely some kind of initialization failure, possibly a multi-thread race condition. |
Thanks @dan131riley |
More likely timing dependent. It's an all or none failure--either all the streams fail or none do, that's not consistent with an event dependent failure. Thread races can be very dependent on the system load, and the IB machines tend to be heavily loaded. |
Ok, I understand, but that makes even harder to reproduce... |
|
Thanks @dan131riley That is stranger since the method (even if it is not a DQM class) exists:
So, I do not understand |
+1 |
Occurred in CMSSW_12_3_X_2021-12-13-2300 slc7_ppc64le_gcc11
|
Thanks, I am still not able to reproduce in that last example, either in single-thread or in multi-thread..... and without reproducing I cannot debug... The only thing I know but which I don't understand, is that the crash is coming from: |
-1 |
This particular failure would most likely not have happened if #44575 were used as the module knows exactly the type of data product it uses. |
I have no objection trying this patch in ROOT6 IBs. I will open cms-sw/root PR to try out this patch |
Note that a more extensive version of the ROOT patch will be uploaded today. |
ok thanks @pcanal , I will wait for that then. |
Let's discuss further in the core software meeting (we just discussed with @Dr15Jones possible other avenues to debug the problem itself) |
First failure after merging #44590: link
|
Unfortunately, that one is a 'false positive'. Turns out the code expects exceptions to happen as it has a 'try...catch' block as it tests various class types: cmssw/PhysicsTools/PatUtils/interface/StringParserTools.h Lines 80 to 86 in 7e39f77
So the change just outright broke this code. |
I wonder if we should temporarily disable that test? |
CMSSW_14_1_NONLTO_X_2024-04-04-1100 wf 140.03 step 3 on the other hand shows a real occurrence
but the stack trace doesn't look too helpful |
It's very telling that we have two threads throwing at the same time. It would be nicer if we could get the details of the throw, but it's still a good clue. |
So 3 of our TBB threads are doing EventSetup related work and the other one (which fails) is running the cutParser. |
We still need to gather more stack traces, but from the two we have it seems to me like the problem doesn't happen in the cutParser itself, instead it looks like the problem happened sometime before the call and this is just a byproduct. If so, then it will be even harder to track down :(. |
Note: The related ROOT PR is root-project/root#15113 which addresses several form of root-project/root#15090 |
Hit in CMSSW_14_1_CLANG_X_2024-04-07-2300
|
So here is a possibly related seg fault in cling where both a cut parser and a ROOT I/O are BOTH inside cling concurrently: NOTE: this crash happened at initialization time, NOT Event time.
@pcanal FYI |
Indeed. #33084 (comment) is fixed by root-project/root@c1a0840 which is part of root-project/root#15113 |
Another one in CMSSW_14_1_NONLTO_X_2024-04-11-1100 wf 11634.914 step 3
|
We had another failure, this time in https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el9_amd64_gcc12/CMSSW_14_1_X_2024-04-12-2300/pyRelValMatrixLogs/run/11634.911_TTbar_14TeV+2021_DD4hep/step3_TTbar_14TeV+2021_DD4hep.log#/
Based on the errors we've seen, it seems likely to me that there is NOT a race condition happening while the cut parser is running. Instead, a cling related error has happened earlier in the job and these failures are just a symptom of that earlier problem. |
Caught another one
|
Another assert has occurred
|
This is odd (and/or missing some threads): #33084 (comment)
but none seems to be holding the lock (unless there is some ROOT stuff in the cut part of thread 5) |
We seemingly have the same situation in #33084 (comment) |
There shouldn't be. The cut out part should contain only framework's (ROOT-independent) functions. |
I guess we haven't seen this error after all the workarounds and the ROOT update (that had at least partial, if not full, fix), so I'm now wondering there is any issue left, and if there is, did it become so rare it has negligible practical impact? |
Workflow 1325.61 step 2 fails in CMSSW_11_3_ROOT6_X_2021-03-04-2300 with
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_11_3_ROOT6_X_2021-03-04-2300/pyRelValMatrixLogs/run/1325.61_TTbar_13_106Xv1NanoAODINPUT+TTbar_13_106Xv1NanoAODINPUT+NANOAODMC2017_106XMiniAODv1/step2_TTbar_13_106Xv1NanoAODINPUT+TTbar_13_106Xv1NanoAODINPUT+NANOAODMC2017_106XMiniAODv1.log#/
The text was updated successfully, but these errors were encountered: