Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bobd freezing inside io_uring call #335

Closed
ikopylov opened this issue Sep 30, 2021 · 5 comments · Fixed by #357
Closed

bobd freezing inside io_uring call #335

ikopylov opened this issue Sep 30, 2021 · 5 comments · Fixed by #357
Assignees
Labels
bug Something isn't working investigation

Comments

@ikopylov
Copy link
Member

bobd sometimes freezes under unclear circumstances

@ikopylov ikopylov added bug Something isn't working investigation labels Sep 30, 2021
@idruzhitskiy
Copy link
Contributor

idruzhitskiy commented Sep 30, 2021

Freeze log (3 nodes config, 2 nodes down):

2021-09-30 19:09:44  bob::cluster::quorum:133 DEBUG    PUT[ED04000000000000] ~~~TRY PUT TO REMOTE ALIENS FIRST~~~
2021-09-30 19:09:44  bob::cluster::quorum:141 TRACE    selection of free nodes available for data writing
2021-09-30 19:09:44  bob::cluster::quorum:143 DEBUG    PUT[ED04000000000000] sup put nodes: [third_node=192.168.1.0:20000]
2021-09-30 19:09:44  bob::cluster::quorum:150 DEBUG    PUT[ED04000000000000] additional alien requests: [(third_node=192.168.1.0:20000, PutOptions { remote_nodes: ["second_node"], force_node: true, overwrite: false })]
2021-09-30 19:09:44  bob::cluster::operations:221 DEBUG    Err(Output { node_name: "third_node", inner: Error { ctx: Failed("No active connection third_node=192.168.1.0:20000") } })
2021-09-30 19:09:44  bob::cluster::quorum:152 DEBUG    support nodes errors: [Output { node_name: "second_node", inner: Error { ctx: Failed("No active connection third_node=192.168.1.0:20000") } }]
2021-09-30 19:09:44  bob::cluster::quorum:159 DEBUG    need additional local alien copies: 1
2021-09-30 19:09:44  bob::cluster::operations:195 DEBUG    PUT[ED04000000000000] put to local alien: "second_node"
2021-09-30 19:09:44           bob::server: 82 TRACE    - - - - - SERVER PUT START - - - - -
2021-09-30 19:09:44           bob::server: 84 TRACE    process incoming put request /0.000ms/
2021-09-30 19:09:44           bob::server: 82 TRACE    - - - - - SERVER PUT START - - - - -
2021-09-30 19:09:44           bob::server: 89 TRACE    convert request into inner, /0.089ms/
2021-09-30 19:09:44           bob::server: 84 TRACE    process incoming put request /0.002ms/
2021-09-30 19:09:44           bob::server: 95 TRACE    extract params from request, /0.532ms/
2021-09-30 19:09:44           bob::server: 89 TRACE    convert request into inner, /0.497ms/
2021-09-30 19:09:44           bob::server:102 TRACE    PUT[F403000000000000] data size: 1024, /1.001ms/
2021-09-30 19:09:44           bob::server: 95 TRACE    extract params from request, /0.959ms/
2021-09-30 19:09:44          bob::grinder: 82 DEBUG    PUT[F403000000000000] will route to cluster
2021-09-30 19:09:44           bob::server:102 TRACE    PUT[0601000000000000] data size: 1024, /1.270ms/
2021-09-30 19:09:44  bob::cluster::quorum: 29 DEBUG    PUT[F403000000000000] ~~~PUT LOCAL NODE FIRST~~~
2021-09-30 19:09:44          bob::grinder: 82 DEBUG    PUT[0601000000000000] will route to cluster
2021-09-30 19:09:44  bob::cluster::quorum: 36 DEBUG    disk path is present, try put local
2021-09-30 19:09:44  bob::cluster::quorum: 29 DEBUG    PUT[0601000000000000] ~~~PUT LOCAL NODE FIRST~~~
2021-09-30 19:09:44  bob::cluster::operations:242 DEBUG    local node has vdisk replica, put local
2021-09-30 19:09:44  bob::cluster::quorum: 36 DEBUG    disk path is present, try put local
2021-09-30 19:09:44  bob::cluster::operations:242 DEBUG    local node has vdisk replica, put local
2021-09-30 19:09:44  bob::cluster::quorum: 44 DEBUG    PUT[F403000000000000] local node put successful
2021-09-30 19:09:44  bob::cluster::quorum: 49 DEBUG    PUT[F403000000000000] need at least 1 additional puts
2021-09-30 19:09:44  bob::cluster::quorum: 51 DEBUG    PUT[F403000000000000] ~~~PUT TO REMOTE NODES~~~
2021-09-30 19:09:44  bob::cluster::quorum:118 DEBUG    PUT[F403000000000000] cluster quorum put remote nodes 2 total target nodes
2021-09-30 19:09:44  bob::cluster::operations: 26 DEBUG    PUT[F403000000000000] put to second_node
2021-09-30 19:09:44  bob::cluster::operations: 81 DEBUG    total handles count: 1
2021-09-30 19:09:44  bob::cluster::quorum: 44 DEBUG    PUT[0601000000000000] local node put successful
2021-09-30 19:09:44  bob::cluster::operations: 37 DEBUG    handle returned
2021-09-30 19:09:44  bob::cluster::quorum: 49 DEBUG    PUT[0601000000000000] need at least 1 additional puts
2021-09-30 19:09:44  bob::cluster::operations: 42 ERROR    Output { node_name: "second_node", inner: Error { ctx: Failed("No active connection second_node=127.0.0.1:20001") } }
2021-09-30 19:09:44  bob::cluster::quorum: 51 DEBUG    PUT[0601000000000000] ~~~PUT TO REMOTE NODES~~~
2021-09-30 19:09:44  bob::cluster::operations: 66 DEBUG    ok_count/at_least: 0/1
2021-09-30 19:09:44  bob::cluster::quorum:118 DEBUG    PUT[0601000000000000] cluster quorum put remote nodes 2 total target nodes
2021-09-30 19:09:44  bob::cluster::operations: 83 DEBUG    remains: 0, errors: 1
2021-09-30 19:09:44  bob::cluster::operations: 26 DEBUG    PUT[0601000000000000] put to second_node
2021-09-30 19:09:44  bob::cluster::quorum: 62 WARN    PUT[F403000000000000] quorum was not reached. ok 1, quorum 2, errors: [Output { node_name: "second_node", inner: Error { ctx: Failed("No active connection second_node=127.0.0.1:20001") } }]
2021-09-30 19:09:44  bob::cluster::quorum:133 DEBUG    PUT[F403000000000000] ~~~TRY PUT TO REMOTE ALIENS FIRST~~~
2021-09-30 19:09:44  bob::cluster::operations: 81 DEBUG    total handles count: 1
2021-09-30 19:09:44  bob::cluster::quorum:141 TRACE    selection of free nodes available for data writing
2021-09-30 19:09:44  bob::cluster::quorum:143 DEBUG    PUT[F403000000000000] sup put nodes: [third_node=192.168.1.0:20000]
2021-09-30 19:09:44  bob::cluster::quorum:150 DEBUG    PUT[F403000000000000] additional alien requests: [(third_node=192.168.1.0:20000, PutOptions { remote_nodes: ["second_node"], force_node: true, overwrite: false })]
2021-09-30 19:09:44  bob::cluster::operations:221 DEBUG    Err(Output { node_name: "third_node", inner: Error { ctx: Failed("No active connection third_node=192.168.1.0:20000") } })
2021-09-30 19:09:44  bob::cluster::quorum:152 DEBUG    support nodes errors: [Output { node_name: "second_node", inner: Error { ctx: Failed("No active connection third_node=192.168.1.0:20000") } }]
2021-09-30 19:09:44  bob::cluster::quorum:159 DEBUG    need additional local alien copies: 1
2021-09-30 19:09:44  bob::cluster::operations:195 DEBUG    PUT[F403000000000000] put to local alien: "second_node"
2021-09-30 19:09:44  bob::cluster::operations: 37 DEBUG    handle returned
2021-09-30 19:09:44  bob::cluster::operations: 42 ERROR    Output { node_name: "second_node", inner: Error { ctx: Failed("No active connection second_node=127.0.0.1:20001") } }
2021-09-30 19:09:44  bob::cluster::operations: 66 DEBUG    ok_count/at_least: 0/1
2021-09-30 19:09:44  bob::cluster::operations: 83 DEBUG    remains: 0, errors: 1
2021-09-30 19:09:44  bob::cluster::quorum: 62 WARN    PUT[0601000000000000] quorum was not reached. ok 1, quorum 2, errors: [Output { node_name: "second_node", inner: Error { ctx: Failed("No active connection second_node=127.0.0.1:20001") } }]
2021-09-30 19:09:44  bob::cluster::quorum:133 DEBUG    PUT[0601000000000000] ~~~TRY PUT TO REMOTE ALIENS FIRST~~~
2021-09-30 19:09:44  bob::cluster::quorum:141 TRACE    selection of free nodes available for data writing
2021-09-30 19:09:44  bob::cluster::quorum:143 DEBUG    PUT[0601000000000000] sup put nodes: [third_node=192.168.1.0:20000]
2021-09-30 19:09:44  bob::cluster::quorum:150 DEBUG    PUT[0601000000000000] additional alien requests: [(third_node=192.168.1.0:20000, PutOptions { remote_nodes: ["second_node"], force_node: true, overwrite: false })]
2021-09-30 19:09:44  bob::cluster::operations:221 DEBUG    Err(Output { node_name: "third_node", inner: Error { ctx: Failed("No active connection third_node=192.168.1.0:20000") } })
2021-09-30 19:09:44  bob::cluster::quorum:152 DEBUG    support nodes errors: [Output { node_name: "second_node", inner: Error { ctx: Failed("No active connection third_node=192.168.1.0:20000") } }]
2021-09-30 19:09:44  bob::cluster::quorum:159 DEBUG    need additional local alien copies: 1
2021-09-30 19:09:44  bob::cluster::operations:195 DEBUG    PUT[0601000000000000] put to local alien: "second_node"
2021-09-30 19:09:45     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:09:45     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
2021-09-30 19:09:47           bob::server:187 DEBUG    PING
2021-09-30 19:09:47     bob::link_manager: 31 ERROR    No connection to second_node:[127.0.0.1:20001] - transport error
2021-09-30 19:09:48     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:09:48     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
2021-09-30 19:09:52           bob::server:187 DEBUG    PING
2021-09-30 19:09:52     bob::link_manager: 31 ERROR    No connection to second_node:[127.0.0.1:20001] - transport error
2021-09-30 19:09:55     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:09:55     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
2021-09-30 19:09:57           bob::server:187 DEBUG    PING
2021-09-30 19:09:57     bob::link_manager: 31 ERROR    No connection to second_node:[127.0.0.1:20001] - transport error
2021-09-30 19:09:58     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:09:58     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
2021-09-30 19:10:02           bob::server:187 DEBUG    PING
2021-09-30 19:10:02     bob::link_manager: 31 ERROR    No connection to second_node:[127.0.0.1:20001] - transport error
2021-09-30 19:10:05     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:10:05     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
2021-09-30 19:10:07           bob::server:187 DEBUG    PING
2021-09-30 19:10:07     bob::link_manager: 31 ERROR    No connection to second_node:[127.0.0.1:20001] - transport error
2021-09-30 19:10:08     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:10:08     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
^C2021-09-30 19:10:09        rocket::server:440 WARN    Received SIGINT. Requesting shutdown.
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:09  pearl::storage::observer_worke: 26 WARN    active blob will no longer be updated, shutdown the system
2021-09-30 19:10:09  pearl::storage::observer_worke: 27 WARN    ActiveBlobNotSet
2021-09-30 19:10:12           bob::server:187 DEBUG    PING
2021-09-30 19:10:12     bob::link_manager: 31 ERROR    No connection to second_node:[127.0.0.1:20001] - transport error
2021-09-30 19:10:14        rocket::server:473 WARN    Server failed to shutdown cooperatively.
2021-09-30 19:10:14        rocket::server:474 WARN    Server is executing inside of a custom runtime.
2021-09-30 19:10:14        rocket::server:476 WARN    Refusing to terminate runaway custom runtime.
2021-09-30 19:10:15     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:10:15     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
2021-09-30 19:10:17           bob::server:187 DEBUG    PING
2021-09-30 19:10:17     bob::link_manager: 31 ERROR    No connection to second_node:[127.0.0.1:20001] - transport error
^C2021-09-30 19:10:17        rocket::server:438 WARN    Received SIGINT. Shutdown already in progress.
2021-09-30 19:10:18     bob::link_manager: 31 ERROR    No connection to third_node:[192.168.1.0:20000] - transport error
2021-09-30 19:10:18     bob::link_manager: 43 INFO    Node status: first_node second_node third_node 
Killed

@ikopylov
Copy link
Member Author

io_uring is the root cause of the problem. @idruzhitskiy please, add more details

@ikopylov ikopylov changed the title Investigate and fix bobd freezing issue bobd freezing inside io_uring call Oct 28, 2021
@idruzhitskiy
Copy link
Contributor

According to the debugger, on freeze multiple threads are waiting for return from syscall ENTER (https://github.com/spacejam/rio/blob/master/src/io_uring/syscall.rs#L57) with args fd=..., to_submit=0, min_complete=1, flags=1, sig=0x0. Disabling AIO removes the problem.
Tested on the latest version of rio.

@agend
Copy link
Member

agend commented Oct 29, 2021

Is it a bug in kernel?

@ikopylov
Copy link
Member Author

ikopylov commented Nov 5, 2021

@agend either in kernel or in rio. We don't know for sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants