Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Components stuck due to failure to send logs #7851

Closed
chrisburr opened this issue Oct 23, 2024 · 7 comments
Closed

Components stuck due to failure to send logs #7851

chrisburr opened this issue Oct 23, 2024 · 7 comments
Labels

Comments

@chrisburr
Copy link
Member

In LHCb we're seeing many services stuck. If I attach py-spy the spit out a bunch of errors and then start working again:

2024-10-23 07:35:36 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d03:3::100:319]:37772)[lhcb_user:menghao] RPC/getJobsStates(<masked>)
2024-10-23 07:35:36 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d03:3::100:319]:37772)[lhcb_user:menghao] (0.04 secs) OK
2024-10-23 07:35:40 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:2f::100:1fb]:35594)[hosts:lbvobox301.cern.ch] RPC/getJobsStatus(<masked>)
2024-10-23 07:35:40 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:2f::100:1fb]:35594)[hosts:lbvobox301.cern.ch] (0.01 secs) OK
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Sleeping as service requested throttling WorkloadManagement/JobMonitoring
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Error processing proposal Error in _write: (110, 'Connection timed out') SSLError(110, 'Connection timed out')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring ERROR: Exception in handleConnection thread
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.49-1729166531/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 351, in _processInThread
    if result["closeTransport"] or not result["OK"]:
       ~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'closeTransport'
... more of the same
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring WARN: Error while accepting a connection:  Error in acceptConnection: [Errno 107] Transport endpoint is not connected OSError(107, 'Transport endpoint is not connected')
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:68::100:468]:55134)[hosts:lbvobox900.cern.ch] RPC/getJobsStatus(<masked>)
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:68::100:468]:55134)[hosts:lbvobox900.cern.ch] (0.00 secs) OK
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:68::100:468]:55144)[hosts:lbvobox900.cern.ch] RPC/getJobsStatus(<masked>)
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:68::100:468]:55144)[hosts:lbvobox900.cern.ch] (0.01 secs) OK
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Executing action ([2001:1458:d00:2f::100:1fb]:51200)[hosts:lbvobox301.cern.ch] RPC/getJobParameters(<masked>)
2024-10-23 11:00:33 UTC WorkloadManagement/JobMonitoring NOTICE: Returning response ([2001:1458:d00:2f::100:1fb]:51200)[hosts:lbvobox301.cern.ch] (0.03 secs) OK
@ryuwd
Copy link

ryuwd commented Oct 23, 2024

can we patch result["closeTransport"] with result.get("closeTransport", None)?

_processProposal isn't guaranteed to set closeTransport at all if the socket fails to send initially

https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/Core/DISET/private/Service.py#L349-L354

but

https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/Core/DISET/private/Service.py#L510-L515

@chrisburr
Copy link
Member Author

Another case. If I use py-spy --nonblocking it doesn't revive the process but if I don't include that flag it spits out this error:

2024-10-23 07:24:59 UTC DataManagement/FTS3Agent/treatOperation/9186766 INFO: FTS3Operation 9186766: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186784/CERN_MC-DST-EOS_IN2P3-ARCHIVE INFO: Got GUID e90bc34c-910f-11ef-a5b7-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26805295d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c0268790> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f27405f1c10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26e00de9d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f274053f550> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26c00242d0> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2740515750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2754211750> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f26806edb10> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/submit/9186782/GRIDKA_MC-DST_CNAF-ARCHIVE INFO: Got GUID e903e244-910f-11ef-bcef-fa163e0a22e3
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186784 INFO: FTS3Operation 9186784: Submitted job for 2 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent/treatOperation/9186782 INFO: FTS3Operation 9186782: Submitted job for 1 transfers
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Issue getting socket: <DIRAC.Core.DISET.private.Transports.M2SSLTransport.SSLTransport object at 0x7f2700160350> : ('dips', 'lbvobox306.cern.ch', 9197, 'DataManagement/FileCatalog') : timed
out:TimeoutError('timed out')
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent WARN: Non-responding URL temporarily banned dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog
2024-10-23 11:05:54 UTC DataManagement/FTS3Agent INFO: Retry connection : 1 to dips://lbvobox306.cern.ch:9197/DataManagement/FileCatalog

@chrisburr
Copy link
Member Author

It looks like FTS3Agent is still stuck, with 48 threads waiting to acquire this lock:

https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/FrameworkSystem/private/standardLogging/Logging.py#L398

@chrisburr
Copy link
Member Author

And the reason they're waiting is because a 49th thread is stuck in this state:

Thread 774821 (idle): "Thread-81 (worker)"
    send (stomp/transport.py:639)
        Arguments:
            self: <Transport at 0x7f275dca8150>
            encoded_frame: <bytes at 0x7f2740790cb0>
    transmit (stomp/transport.py:282)
        Arguments:
            self: <Transport at 0x7f275dca8150>
            frame: <Frame at 0x7f275cd3ed10>
        Locals:
            listeners: [("protocol-listener", <StompConnection11 at 0x7f275dcaa510>)]
            _: "protocol-listener"
            listener: <StompConnection11 at 0x7f275dcaa510>
            lines: [<bytes at 0x7f27405e3bd0>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f26e009f630>, <bytes at 0x7f26e0784b20>, <bytes at 0x55adc06cd5b0>, <bytes at 0x7f27406e23a0>, <bytes at 0x7f275cb43720>]
            packed_frame: <bytes at 0x7f2740790cb0>
    send_frame (stomp/protocol.py:248)
        Arguments:
            self: <StompConnection11 at 0x7f275dcaa510>
            cmd: "SEND"
            headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
            body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
        Locals:
            frame: <Frame at 0x7f275cd3ed10>
    send (stomp/protocol.py:396)
        Arguments:
            self: <StompConnection11 at 0x7f275dcaa510>
            destination: "/queue/lhcb.dirac.logging"
            body: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
            content_type: None
            headers: {"destination": "/queue/lhcb.dirac.logging", "content-length": 298}
        Locals:
            keyword_headers: {}
    put (DIRAC/Resources/MessageQueue/StompMQConnector.py:166)
        Arguments:
            self: <StompMQConnector at 0x7f275dc5c410>
            message: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
            parameters: {"destination": "/queue/lhcb.dirac.logging"}
        Locals:
            log: <Logging at 0x7f275c275090>
            destination: "/queue/lhcb.dirac.logging"
    put (DIRAC/Resources/MessageQueue/MQProducer.py:20)
        Arguments:
            self: <MQProducer at 0x7f275dc64b50>
            msg: {"levelname": "INFO", "message": "Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7", "asctime": "2024-10-23 11:06:01,727334", "componentname": "DataManagement/FTS3Agent", "varmessage": "", "customname": "", "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST", "hostname": "lbvobox303.cern.ch"}
        Locals:
            result: {"OK": True, "Value": <StompMQConnector at 0x7f275dc5c410>}
            connector: <StompMQConnector at 0x7f275dc5c410>
    emit (DIRAC/FrameworkSystem/private/standardLogging/Handler/MessageQueueHandler.py:49)
        Arguments:
            self: <MessageQueueHandler at 0x7f275e0dc8d0>
            record: <LogRecord at 0x7f26e00ae890>
        Locals:
            strRecord: "{\"levelname\": \"INFO\", \"message\": \"Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7\", \"asctime\": \"2024-10-23 11:06:01,727334\", \"componentname\": \"DataManagement/FTS3Agent\", \"varmessage\": \"\", \"customname\": \"\", \"local_name\": \"submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST\", \"hostname\": \"lbvobox303.cern.ch\"}"
    handle (logging/__init__.py:978)
        Arguments:
            self: <MessageQueueHandler at 0x7f275e0dc8d0>
            record: <LogRecord at 0x7f26e00ae890>
        Locals:
            rv: True
    callHandlers (logging/__init__.py:1706)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            record: <LogRecord at 0x7f26e00ae890>
        Locals:
            c: <Logger at 0x7f275edd3e90>
            found: 2
            hdlr: <MessageQueueHandler at 0x7f275e0dc8d0>
    handle (logging/__init__.py:1644)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            record: <LogRecord at 0x7f26e00ae890>
    _log (logging/__init__.py:1634)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            level: 20
            msg: "%s"
            args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
            exc_info: False
            extra: {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}
            stack_info: False
            stacklevel: 1
        Locals:
            sinfo: None
            fn: "(unknown file)"
            lno: 0
            func: "(unknown function)"
            record: <LogRecord at 0x7f26e00ae890>
    log (logging/__init__.py:1559)
        Arguments:
            self: <Logger at 0x7f275edd3e90>
            level: 20
            msg: "%s"
        Locals:
            args: ("Got GUID ca170fea-912e-11ef-af36-fa163e5a69c7")
            kwargs: {"exc_info": False, "extra": {"componentname": "DataManagement/FTS3Agent", "varmessage": "", "spacer": "", "customname": "", "headerIsShown": True, "timeStampIsShown": True, "contextIsShown": True, "threadIDIsShown": False, "color": True, "local_name": "submit/9186818/IN2P3_MC-DST_NIPNE-07_MC-DST"}}
    _createLogRecord (DIRAC/FrameworkSystem/private/standardLogging/Logging.py:421)
        Arguments:
....

@chrisburr
Copy link
Member Author

I think the problem is that we need to set a timeout in StompMQConnector.setupConnection.

@ryuwd
Copy link

ryuwd commented Oct 23, 2024

@ryuwd
Copy link

ryuwd commented Oct 23, 2024

@chrisburr chrisburr changed the title Services stuck Components stuck due to failure to send logs Oct 23, 2024
@chrisburr chrisburr added the Bug label Oct 23, 2024
@fstagni fstagni closed this as completed Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants