Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows a deadlock on nng_close() #1827

Closed
alzix opened this issue Apr 24, 2024 · 13 comments
Closed

Windows a deadlock on nng_close() #1827

alzix opened this issue Apr 24, 2024 · 13 comments

Comments

@alzix
Copy link
Contributor

alzix commented Apr 24, 2024

Describe the bug
When calling nng_close() there is sometimes a deadlock which causes nng_close() to hang.
This happens also when using only sync APIs (no AIOs).

nng_close should finish successfully.

Actual Behavior

nng_close() hangs.

To Reproduce

I created a modified version of the reqrep example code here (I use it with IPC transport): https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3

In the modified example in the client code right before calling nng_recv for the reply from the server I start a thread which just calls nng_close, after a couple of successful runs the deadlock happen:

mischwar@tlv-mpawy reqrep % ./reqrep client ipc:///tmp/reqrep_test
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
...
...
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET
nng_recv error - Object closed
1712848018 - CLIENT: SENDING DATE REQUEST
1712848018 - CLIENT: WAITING FOR REPLY
1712848018 - CLIENT: CLOSING SOCKET. <--- The thread is suck in nng_close here
Environment Details

NNG version: 1.7.3
Operating system and version: MacOS Sonoma 14.4.1 (but also happens on Windows and Linux)
Compiler and language used: C/C++ clang
Shared or static library - static

Additional context
image

it seems that on Windows it is a TOCTOU issue in the nni_sock_shutdown

it is stuck on

// We have to wait for pipes to be removed.
while (!nni_list_empty(&sock->s_pipes)) {
=>	nni_cv_wait(&sock->s_cv);
}

while s_pipes is already empty
image

@leowang552
Copy link

image

SHA-1: aac4dc3

node->**ln_prev** is 0xFFFFFFFFFFFFFFFF

https://gist.github.com/mikisch81/428c4ad87afcc1c8881b282cd5e80eb3
server side
vs 2022 debug version
sdk 10.0.19041.0
64bit win 10

@leowang552
Copy link

void SetThreadsNng()
{
	nng_init_set_parameter(NNG_INIT_MAX_TASK_THREADS, 1);
	nng_init_set_parameter(NNG_INIT_MAX_EXPIRE_THREADS, 1);

	nng_init_set_parameter(NNG_INIT_MAX_POLLER_THREADS, 1);
	nng_init_set_parameter(NNG_INIT_NUM_RESOLVER_THREADS, 1);
}

add above code for easy debugging

@alzix
Copy link
Contributor Author

alzix commented Apr 25, 2024

I also reproduced the crash reported by @leowang552 a couple of times

@gdamore
Copy link
Contributor

gdamore commented Apr 25, 2024

The crash is a different issue... it's a sign that we're trying to remove from a list when already not on the list.

Let's open another ticket for it. (And I'd like to know whether that happens only on IPC on Windows, or does it happen either on other platforms or for non-IPC connections -- because the Windows named pipe IPC is totally unlike the other transports.)

@gdamore
Copy link
Contributor

gdamore commented Apr 25, 2024

As to this bug, I have a PR out that I'd appreciate it if folks could try out.

@leowang552
Copy link

I see the bug not easily reproduced if the computer performance is high.
Still not reproduced after running 1 hour.
This PC:
11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
64G RAM

@alzix
Copy link
Contributor Author

alzix commented Apr 25, 2024

hi @gdamore, i've tested the proposed changes from #1828 92fd162 and it seems to not solve the issue

image
image

@gdamore
Copy link
Contributor

gdamore commented Apr 25, 2024

Does this problem only happen with Windows IPC?

@gdamore
Copy link
Contributor

gdamore commented Apr 27, 2024

@alzix it seems you actually have an item on the pipes list. This is different than the reported issue before that the pipe list was empty.

Again, I am wondering if this is exclusive to the IPC dialer.

@gdamore
Copy link
Contributor

gdamore commented Apr 27, 2024

Ah I see the original problem was reported on macos.

@alzix
Copy link
Contributor Author

alzix commented Apr 27, 2024

In my project I use nng for IPC this I cannot provide any input on other protocols.
I noticed the stuck frame was different on my last check - but the check was based on #1828 pr.
Perhaps it solved once issue but uncovered another.
Please notice that reqrep server was used for test, this the problem is probably in listener and not in the dialer

@gdamore
Copy link
Contributor

gdamore commented Apr 27, 2024

Interesting. Ok, I have pushed another change to the same PR, which attempts to fix a possible issue if the pipes are closed during negotiation (which likely is happening here.)

If you can try that PR branch again, I'd be grateful @alzix .

@alzix
Copy link
Contributor Author

alzix commented Apr 28, 2024

did - left my insights in PR comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants