-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Components stuck due to failure to send logs #7851
Comments
can we patch
but |
Another case. If I use 2024-10-23 07:24:59 UTC DataManagement/FTS3Agent/treatOperation/9186766 INFO: FTS3Operation 9186766: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186784/CERN_MC-DST-EOS_IN2P3-ARCHIVE INFO: Got GUID e90bc34c-910f-11ef-a5b7-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26805295d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c0268790> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f27405f1c10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26e00de9d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f274053f550> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c00242d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2740515750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2754211750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26806edb10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186782/GRIDKA_MC-DST_CNAF-ARCHIVE INFO: Got GUID e903e244-910f-11ef-bcef-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186784 INFO: FTS3Operation 9186784: Submitted job for 2 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186782 INFO: FTS3Operation 9186782: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2700160350> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog |
It looks like FTS3Agent is still stuck, with 48 threads waiting to acquire this lock: |
And the reason they're waiting is because a 49th thread is stuck in this state: Thread 774821 (idle): "Thread-81 (worker)"
send (stomp/transport.py:639)
Arguments:
self: <Transport at 0x7f275dca8150>
encoded_frame: <bytes at 0x7f2740790cb0>
transmit (stomp/transport.py:282)
Arguments:
self: <Transport at 0x7f275dca8150>
frame: <Frame at 0x7f275cd3ed10>
Locals:
listeners: [("protocol-listener", <StompConnection11 at 0x7f275dcaa510>)]
_: "protocol-listener"
listener: <StompConnection11 at 0x7f275dcaa510>
lines: [<bytes at 0x7f27405e3bd0>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f26e009f630>, <bytes at 0x7f26e0784b20>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f27406e23a0>, <bytes at 0x7f275cb43720>]
packed_frame: <bytes at 0x7f2740790cb0>
send_frame (stomp/protocol.py:248)
Arguments:
self: <StompConnection11 at 0x7f275dcaa510>
cmd: "SEND"
headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
Locals:
frame: <Frame at 0x7f275cd3ed10>
send (stomp/protocol.py:396)
Arguments:
self: <StompConnection11 at 0x7f275dcaa510>
destination: "/queue/lhcb.dirac.logging"
body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
content_type: None
headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
Locals:
keyword_headers: {}
put (DIRAC/Resources/MessageQueue/StompMQConnector.py:166)
Arguments:
self: <StompMQConnector at 0x7f275dc5c410>
message: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
parameters: {"destination": "/queue/lhcb.dirac.logging"}
Locals:
log: <Logging at 0x7f275c275090>
destination: "/queue/lhcb.dirac.logging"
put (DIRAC/Resources/MessageQueue/MQProducer.py:20)
Arguments:
self: <MQProducer at 0x7f275dc64b50>
msg: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
Locals:
result: {"OK": True, "Value": <StompMQConnector at 0x7f275dc5c410>}
connector: <StompMQConnector at 0x7f275dc5c410>
emit (DIRAC/FrameworkSystem/private/standardLogging/Handler/MessageQueueHandler.py:49)
Arguments:
self: <MessageQueueHandler at 0x7f275e0dc8d0>
record: <LogRecord at 0x7f26e00ae890>
Locals:
strRecord: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
handle (logging/__init__.py:978)
Arguments:
self: <MessageQueueHandler at 0x7f275e0dc8d0>
record: <LogRecord at 0x7f26e00ae890>
Locals:
rv: True
callHandlers (logging/__init__.py:1706)
Arguments:
self: <Logger at 0x7f275edd3e90>
record: <LogRecord at 0x7f26e00ae890>
Locals:
c: <Logger at 0x7f275edd3e90>
found: 2
hdlr: <MessageQueueHandler at 0x7f275e0dc8d0>
handle (logging/__init__.py:1644)
Arguments:
self: <Logger at 0x7f275edd3e90>
record: <LogRecord at 0x7f26e00ae890>
_log (logging/__init__.py:1634)
Arguments:
self: <Logger at 0x7f275edd3e90>
level: 20
msg: "%s"
args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
exc_info: False
extra: {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}
stack_info: False
stacklevel: 1
Locals:
sinfo: None
fn: "(unknown file)"
lno: 0
func: "(unknown function)"
record: <LogRecord at 0x7f26e00ae890>
log (logging/__init__.py:1559)
Arguments:
self: <Logger at 0x7f275edd3e90>
level: 20
msg: "%s"
Locals:
args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
kwargs: {"exc_info": False, "extra": {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}}
_createLogRecord (DIRAC/FrameworkSystem/private/standardLogging/Logging.py:421)
Arguments:
.... |
I think the problem is that we need to set a timeout in |
I think it can be set here as |
the alternative is "heartbeats" https://jasonrbriggs.github.io/stomp.py/api.html#dealing-with-disconnects |
In LHCb we're seeing many services stuck. If I attach py-spy the spit out a bunch of errors and then start working again:
The text was updated successfully, but these errors were encountered: