Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added null check to prevent segfaults on disconnects while using websockets #33930

Conversation

JoshLee0915
Copy link
Contributor

While testing a websocket enabled server I started to experience some random crashes when players disconnected similar to what was reported on issue #33279. Looking over the logs the best I could tell is that the _peer_map was being modified between the ERR_FAIL_COND(!has_peer(p_peer_id)) check in disconnect_peer and the ERR_FAIL_COND_V(!has_peer(p_id), NULL) check in get_peer causing it to return a NULL causing the segfault.

To fix this I replaced the ERR_FAIL_COND(!has_peer(p_peer_id)); check in disconnect_peer with a null check similar to the change made on PR #31482. This does not fix the underlying race condition with _peer_map but the server now only logs the issue instead of out right crashing.

It should be noted that there where two areas in websocket_multiplayer_peer that also needed null checks but these where added on PR #31482 and #31483. This issue also occurs in 3.1 so cherry-picking this commit would help solve it there since it looks like PR #31482 and #31483 have already been moved to 3.1 as well.

@JoshLee0915 JoshLee0915 requested a review from a team as a code owner November 27, 2019 03:29
@Faless
Copy link
Collaborator

Faless commented Nov 27, 2019

Looking over the logs the best I could tell is that the _peer_map was being modified between the ERR_FAIL_COND(!has_peer(p_peer_id)) check in disconnect_peer and the ERR_FAIL_COND_V(!has_peer(p_id), NULL) check in get_peer causing it to return a NULL causing the segfault.

There should be no race condition there. Unless you are using threads yourself to poll the server/client.

I don't see how this fixes an issue, can you provide a minimum reproduction project that shows just this issue? And that this patch fixes it?

Note: as stated in the comments, #33279 is most likely related to #33788 , are you sure you are not experiencing that as well?

@JoshLee0915
Copy link
Contributor Author

There should be no race condition there.

I agree but I am just saying what we where seeing. As you can see from the log I posted on issue #33279 our system logs that a player disconnected followed by:

ERROR: get_peer: Condition ' !has_peer(p_id) ' is true. returned: __null
At: modules/websocket/lws_server.cpp:178.

Around the websocket networking code in 3.1 there where 3 locations that where accessing the result from get_peer directly that where not trying to access the server.

https://github.com/JoshLee0915/godot/blob/d51bce47deb9b291cf8b95381d8d191c716cad1b/modules/websocket/websocket_multiplayer_peer.cpp#L269-L271

https://github.com/JoshLee0915/godot/blob/d51bce47deb9b291cf8b95381d8d191c716cad1b/modules/websocket/websocket_multiplayer_peer.cpp#L321-L323

(You fixed both of these on PR #31482 and #31483)

https://github.com/JoshLee0915/godot/blob/d51bce47deb9b291cf8b95381d8d191c716cad1b/modules/websocket/lws_server.cpp#L194-L198

Stability improved when the first two where fixed but we did not fully see the segfaults go away till I added the null check to the disconnect_peer call. If you have a better idea on how it is succeeding that first peer id check but failing the one in get_peer I would be interested in hearing it, but all I can tell you is we where able to run the server for 20hrs nonstop after these fixes, where before it would usually crash between 15mins to 3hrs depending on traffic.

Unless you are using threads yourself to poll the server/client.

We are not using any threads or doing any polls ourselves. We are using the standard Godot High Level Multiplayer API.

Note: as stated in the comments, #33279 is most likely related to #33788 , are you sure you are not experiencing that as well?

I saw that but we are running a headless mono godot server in release and from you comments (correct me if I am wrong though) it only affected debug enabled exports.

I don't see how this fixes an issue, can you provide a minimum reproduction project that shows just this issue? And that this patch fixes it?

Sure, I can work on putting one together later this week, but testing it may be tricky. Like with issue #33279 it fails quite randomly and the frequency of failures depends on the frequency of connections and disconnections. We only found the problem after we put the server up live for people to test out.

I do not think it would be of much help to you but I can provide you our most recent logs as well if you wish to see them.

For Reference Full Log of Segfault I posted on issue #33279

Current player count: 5
Game Resumed
User: 1591479155 disconnected
Current player count: 4
Game Resumed
ERROR: get_peer: Condition ' !has_peer(p_id) ' is true. returned: __null
   At: modules/websocket/lws_server.cpp:178.

=================================================================
        Native Crash Reporting
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
/proc/self/maps:
402a9000-402b9000 rwxp 00000000 00:00 0
414bb000-414cb000 rwxp 00000000 00:00 0
41d21000-41d51000 rwxp 00000000 00:00 0
55c51d9fd000-55c52033c000 r-xp 00000000 fc:01 1553231                    /root/v                                                                                                                                                             sm-godot-server/bin/TestHeadlessServer/vsm_test_server
55c52053b000-55c5205c8000 r--p 0293e000 fc:01 1553231                    /root/v                                                                                                                                                             sm-godot-server/bin/TestHeadlessServer/vsm_test_server
55c5205c8000-55c5205d2000 rw-p 029cb000 fc:01 1553231                    /root/v                                                                                                                                                             sm-godot-server/bin/TestHeadlessServer/vsm_test_server
55c5205d2000-55c520600000 rw-p 00000000 00:00 0
55c520bf1000-55c52364c000 rw-p 00000000 00:00 0                          [heap]
7f5c60000000-7f5c60021000 rw-p 00000000 00:00 0
7f5c60021000-7f5c64000000 ---p 00000000 00:00 0
7f5c667c0000-7f5c66840000 rw-p 00000000 00:00 0
7f5c66844000-7f5c668c4000 rw-p 00000000 00:00 0
7f5c668c8000-7f5c66948000 rw-p 00000000 00:00 0
7f5c6694c000-7f5c669cc000 rw-p 00000000 00:00 0
7f5c669d0000-7f5c66a50000 rw-p 00000000 00:00 0
7f5c66a54000-7f5c66ad4000 rw-p 00000000 00:00 0
7f5c66ad8000-7f5c66b58000 rw-p 00000000 00:00 0
7f5c66b5c000-7f5c66bdc000 rw-p 00000000 00:00 0
7f5c66be0000-7f5c66c60000 rw-p 00000000 00:00 0
7f5c66c64000-7f5c66ce4000 rw-p 00000000 00:00 0
7f5c66ce8000-7f5c66d68000 rw-p 00000000 00:00 0
7f5c66d6c000-7f5c66dec000 rw-p 00000000 00:00 0
7f5c66df0000-7f5c66e70000 rw-p 00000000 00:00 0
7f5c66e74000-7f5c66ef4000 rw-p 00000000 00:00 0
7f5c66ef8000-7f5c66f78000 rw-p 00000000 00:00 0

=================================================================
        Native stacktrace:
=================================================================
        0x55c51e128be5 - ./vsm_test_server : (null)
        0x55c51e128f81 - ./vsm_test_server : (null)
        0x55c51e11b4d1 - ./vsm_test_server : (null)
        0x55c51e099ea1 - ./vsm_test_server : (null)
        0x55c51e4ea302 - ./vsm_test_server : _ZN24WebSocketMultiplayerPeer13_ser                                                                                                                                                             ver_relayEiiPKhj
        0x7ffce2eb2120 - Unknown

=================================================================
        Telemetry Dumper:
=================================================================
Pkilling 0x7f5c7575f700 from 0x7f5c77150780
Entering thread summarizer pause from 0x7f5c77150780
Finished thread summarizer pause from 0x7f5c77150780.

Waiting for dumping threads to resume

=================================================================
        External Debugger Dump:
=================================================================
mono_gdb_render_native_backtraces not supported on this platform, unable to find                                                                                                                                                              gdb or lldb

=================================================================
        Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x55c51e4ea302):0x55c51e4ea2f2  ff 90                                                                                                                                                              38 01 00 00 48 8b 3c 24 44 89 ea 4c 89 e6  ..8...H.<$D..L..
0x55c51e4ea302  48 8b 07 ff 90 b8 00 00 00 48 8b 3c 24 89 c3 48  H........H.<$..                                                                                                                                                             H
0x55c51e4ea312  85 ff 74 5c e8 45 14 3d 01 84 c0 74 53 48 8b 2c  ..t\.E.=...tSH.                                                                                                                                                             ,
0x55c51e4ea322  24 48 89 ef e8 e5 6f 3a 01 84 c0 74 43 48 8b 45  $H....o:...tCH.                                                                                                                                                             E

=================================================================
        Managed Stacktrace:
=================================================================
=================================================================
Aborted (core dumped)

@Faless
Copy link
Collaborator

Faless commented Nov 28, 2019

I saw that but we are running a headless mono godot server in release and from you comments (correct me if I am wrong though) it only affected debug enabled exports.

It's more consistent in debug enabled exports, but memory corruption might still exists in release builds.

I'm still not convinced this is the issue, sorry.
Can you try: #33788 and check that you are not getting this error when running your game:
unreference: Reference is locked and can't be freed. You are leaking memory! This will likely cause crashes in release builds.

@JoshLee0915
Copy link
Contributor Author

Sure I think we could try to test it with that fix. We are in the middle of testing it right now though but I think we can do it in the next couple of days or so.

@JoshLee0915
Copy link
Contributor Author

Ok I was able to run a test tonight since we are planning on bringing the server down for a bit anyway. Before I go into the results here is some background info:

We ran a total of 3 tests on our server and we crashed 3 times. Looking through the logs I did not see a notification that freeing references was failing due to them being locked. Here is the shortest of the 3 logs:

preemptive
Mono: Logfile is: /root/.local/share/godot/app_userdata/vsm-godot-server/mono/mono_logs/2019_12_04 01.30.44 (2594).txt
ERROR: set_locale: Unsupported locale 'C', falling back to 'en'.
   At: core/translation.cpp:944.
Starting Network Manager...
Loading config res://config.json...
res://config.json exists. Reading config...
res://config.json successfully parsed...
res://config.json loaded
Starting server in Web mode...
[2019/12/04 01:30:44:8655] NOTICE: Creating Vhost 'default' port 5050, 1 protocols, IPv6 on
Server started on Port: 5050
Game Paused
User: 1056127552 connected
Current player count: 1
Game Resumed
Syncing candy state for player: 1056127552...
State synced
User: 1461180859 connected
Current player count: 2
Game Resumed
Syncing candy state for player: 1461180859...
State synced
User: 1056127552 disconnected
Current player count: 1
Game Resumed
ERROR: get_peer: Condition ' !has_peer(p_id) ' is true. returned: __null
   At: modules/websocket/lws_server.cpp:178.

=================================================================
        Native Crash Reporting
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
/proc/self/maps:
400a6000-400b6000 rwxp 00000000 00:00 0
40252000-40302000 rwxp 00000000 00:00 0
40d95000-40da5000 rwxp 00000000 00:00 0
55b5a4f8b000-55b5a7b96000 r-xp 00000000 fc:01 1553263                    /root/vsm-godot-server/bin/TestHeadlessServer/vsm_test_server
55b5a7d96000-55b5a7e29000 r--p 02c0b000 fc:01 1553263                    /root/vsm-godot-server/bin/TestHeadlessServer/vsm_test_server
55b5a7e29000-55b5a7e33000 rw-p 02c9e000 fc:01 1553263                    /root/vsm-godot-server/bin/TestHeadlessServer/vsm_test_server
55b5a7e33000-55b5a7e61000 rw-p 00000000 00:00 0
55b5a87cf000-55b5aaaf1000 rw-p 00000000 00:00 0                          [heap]
7f0dcc000000-7f0dcc021000 rw-p 00000000 00:00 0
7f0dcc021000-7f0dd0000000 ---p 00000000 00:00 0
7f0dd4000000-7f0dd4025000 rw-p 00000000 00:00 0
7f0dd4025000-7f0dd8000000 ---p 00000000 00:00 0
7f0dd9dfe000-7f0dd9dff000 ---p 00000000 00:00 0
7f0dd9dff000-7f0dd9e00000 rw-p 00000000 00:00 0
7f0dd9e00000-7f0dd9e08000 ---p 00000000 00:00 0
7f0dd9e08000-7f0ddafff000 rw-p 00000000 00:00 0
7f0ddafff000-7f0ddb000000 ---p 00000000 00:00 0
7f0ddb000000-7f0ddbc00000 rw-p 00000000 00:00 0
7f0ddbdff000-7f0ddbe00000 ---p 00000000 00:00 0
7f0ddbe00000-7f0ddbe01000 rw-p 00000000 00:00 0
7f0ddbe01000-7f0ddbe09000 ---p 00000000 00:00 0
7f0ddbe09000-7f0ddc000000 rw-p 00000000 00:00 0
7f0ddc000000-7f0ddc021000 rw-p 00000000 00:00 0
7f0ddc021000-7f0de0000000 ---p 00000000 00:00 0
7f0de0000000-7f0de0021000 rw-p 00000000 00:00 0

=================================================================
        Native stacktrace:
=================================================================
        0x55b5a5760585 - ./vsm_test_server : (null)
        0x55b5a5760921 - ./vsm_test_server : (null)
        0x55b5a5752e71 - ./vsm_test_server : (null)
        0x55b5a56d1841 - ./vsm_test_server : (null)
        0x55b5a5c61172 - ./vsm_test_server : _ZN24WebSocketMultiplayerPeer13_server_relayEiiPKhj
        0x7ffc5cdf57a0 - Unknown

=================================================================
        Telemetry Dumper:
=================================================================
Pkilling 0x7f0dd9ffe700 from 0x7f0de9fcb780
Pkilling 0x7f0ddbfff700 from 0x7f0de9fcb780
Entering thread summarizer pause from 0x7f0de9fcb780
Finished thread summarizer pause from 0x7f0de9fcb780.
debugger-agent: Unable to listen on 7

Waiting for dumping threads to resume

=================================================================
        External Debugger Dump:
=================================================================
mono_gdb_render_native_backtraces not supported on this platform, unable to find gdb or lldb

=================================================================
        Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x55b5a5c61172):0x55b5a5c61162  ff 90 38 01 00 00 48 8b 3c 24 44 89 ea 4c 89 e6  ..8...H.<$D..L..
0x55b5a5c61172  48 8b 07 ff 90 b8 00 00 00 48 8b 3c 24 89 c3 48  H........H.<$..H
0x55b5a5c61182  85 ff 74 3c e8 55 5a 3f 01 84 c0 74 33 48 8b 3c  ..t<.UZ?...t3H.<
0x55b5a5c61192  24 e8 b8 e6 ff ff eb 28 66 0f 1f 44 00 00 48 8b  $......(f..D..H.

=================================================================
        Managed Stacktrace:
=================================================================
=================================================================
Aborted (core dumped)

We did see these messages starting to popup with debug enabled though:

NOTICE: Creating Vhost 'default' port 5050, 1 protocols, IPv6 on

NOTICE: error on reading from skt : 104

NOTICE: forbidding on uri sanitation

Here are all 3 full logs for you to review if you wish.

crashLog1.txt
crashLog2.txt
crashLog3.txt

@Faless
Copy link
Collaborator

Faless commented Dec 4, 2019

In your log I see:

=================================================================
        Native stacktrace:
=================================================================
        0x55b5a5760585 - ./vsm_test_server : (null)
        0x55b5a5760921 - ./vsm_test_server : (null)
        0x55b5a5752e71 - ./vsm_test_server : (null)
        0x55b5a56d1841 - ./vsm_test_server : (null)
        0x55b5a5c61172 - ./vsm_test_server : _ZN24WebSocketMultiplayerPeer13_server_relayEiiPKhj
        0x7ffc5cdf57a0 - Unknown

WebSocketMultiplayerPeer::_server_relay is the function that was fixed in #31482.
That PR was only cherry picked to the 3.1 branch via 53c2e2d on November 8th.
What commit was your test server based on?

@JoshLee0915
Copy link
Contributor Author

Oh sorry, I was in a rush to test it as I was going on vacation and I think I used an older branch that I forked by mistake. Ill double check it as soon as I get back from it.

@Faless
Copy link
Collaborator

Faless commented Jan 14, 2020

Closing, this does not address the real problem, which is related to #33290 and references being deleted during signal emission.

@Faless Faless closed this Jan 14, 2020
@JoshLee0915
Copy link
Contributor Author

Sorry I was unable to get back to this. I am happy the root issue was found though. Thank you for your time on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants