Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: android fdsan SIGABRT inside tor #2405

Closed
bassosimone opened this issue Feb 2, 2023 · 4 comments
Closed

engine: android fdsan SIGABRT inside tor #2405

bassosimone opened this issue Feb 2, 2023 · 4 comments
Assignees
Labels
bug Something isn't working interrupt needs investigation This issue needs extra data and investigation ooni/probe-engine ooni/probe-mobile issues related to OONI Probe mobile platform/android priority/high technical task technical tasks e.g. deployment testing

Comments

@bassosimone
Copy link
Contributor

bassosimone commented Feb 2, 2023

After fixing #2404, I noticed a crash inside the tor's codebase. The following is an edited version of the crash that occurred when running OONI Probe Android on my phone:

[notice] Catching signal TERM, exiting cleanly.
fdsan: attempted to close file descriptor 104, expected to be unowned, \
    actually owned by unique_fd 0x70b7e5f19c

[...]

ABI: 'arm64'
Timestamp: 2023-02-02 11:39:31.181519664+0100
Cmdline: org.openobservatory.ooniprobe.experimental
pid: 16472, tid: 16593, name: AsyncTask #1  >>> org.openobservatory.ooniprobe.experimental <<<
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
[...]
backtrace:
  #00 pc 0000000000055c48  .../lib64/bionic/libc.so (fdsan_error(char const*, ...)+556) (...)
  #01 pc 0000000000055954  .../lib64/bionic/libc.so (android_fdsan_close_with_tag+732) (...)
  #02 pc 00000000000560a8 .../lib64/bionic/libc.so (close+16) (...)
  #03 pc 00000000012ad08c  [...]/lib/arm64/libgojni.so (tor_main_configuration_free+128)

The crash occurs because fdsan notices that a file descriptor has been closed twice, and hence it aborts. We may want to relax this behavior, considering that we're integrating a large chunk of C code that we don't control directly.

The crash happens quite frequently at the end of vanilla_tor (but not after torsf).

To investigate the crash, I applied this diff:

diff --git a/src/core/mainloop/connection.c b/src/core/mainloop/connection.c
index cf25213cb1..d690de3892 100644
--- a/src/core/mainloop/connection.c
+++ b/src/core/mainloop/connection.c
@@ -149,6 +149,8 @@
 
 #include "core/or/congestion_control_flow.h"
 
+#include <stdio.h>
+
 /**
  * On Windows and Linux we cannot reliably bind() a socket to an
  * address and port if: 1) There's already a socket bound to wildcard
@@ -949,6 +951,7 @@ connection_free_minimal(connection_t *conn)
 
   if (SOCKET_OK(conn->s)) {
     log_debug(LD_NET,"closing fd %d.",(int)conn->s);
+    fprintf(stderr, "SBSDEBUG: connection_free_minimal %lld\n", (long long)conn->s);
     tor_close_socket(conn->s);
     conn->s = TOR_INVALID_SOCKET;
   }
diff --git a/src/feature/api/tor_api.c b/src/feature/api/tor_api.c
index 88e91ebfd5..fb49d92ad7 100644
--- a/src/feature/api/tor_api.c
+++ b/src/feature/api/tor_api.c
@@ -116,6 +116,11 @@ tor_main_configuration_setup_control_socket(tor_main_configuration_t *cfg)
   cfg_add_owned_arg(cfg, "__OwningControllerFD");
   cfg_add_owned_arg(cfg, buf);
 
+  fprintf(
+    stderr, "SBSDEBUG: tor_main_configuration_setup_control_socket %lld %lld\n",
+    (long long)fds[0], (long long)fds[1]
+  );
+
   cfg->owning_controller_socket = fds[1];
   return fds[0];
 }
@@ -132,6 +137,10 @@ tor_main_configuration_free(tor_main_configuration_t *cfg)
     raw_free(cfg->argv_owned);
   }
   if (SOCKET_OK(cfg->owning_controller_socket)) {
+    fprintf(
+      stderr, "SBSDEBUG: tor_main_configuration_free %lld\n",
+      (long long)cfg->owning_controller_socket
+    );
     raw_closesocket(cfg->owning_controller_socket);
   }
   raw_free(cfg);

This is the output in the logcat after running vanilla_tor (in a run in which fdsan did not crash):

SBSDEBUG: tor_main_configuration_setup_control_socket 94 98

[...]

[notice] Catching signal TERM, exiting cleanly.
SBSDEBUG: connection_free_minimal 141
SBSDEBUG: connection_free_minimal 98
SBSDEBUG: connection_free_minimal 116
SBSDEBUG: connection_free_minimal 152
SBSDEBUG: tor_main_configuration_free 98

The above output shows that we create a socketpair consisting of 94 (passed to the OONI engine) and 98 (used internally by tor). We also see how, after OONI closes the control connection, tor emulates receiving a SIGTERM and proceeds to close connections, including 98. Then, we see how 98 is closed again when freeing the config.

I am now wondering whether this issue is also causing issues in production. It's true that I recently rewrote how we compile tor, but this issue seems to be quite independent of the build mechanism we're using. It might be interesting to check whether the version of tor we were previously using also closed the fds when freeing the config.

@bassosimone bassosimone added bug Something isn't working testing ooni/probe-mobile issues related to OONI Probe mobile priority/high technical task technical tasks e.g. deployment platform/android interrupt needs investigation This issue needs extra data and investigation ooni/probe-engine labels Feb 2, 2023
@bassosimone bassosimone self-assigned this Feb 2, 2023
@bassosimone
Copy link
Contributor Author

I opened this issue in core/tor to discuss with the developer and see if I've got it correctly: https://gitlab.torproject.org/tpo/core/tor/-/issues/40747.

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 2, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
Discovered while trying to understand the reason why we
have an abort caused by Android's fdsan.

This is not the reason why we abort, but it seems correct
to fix this issue anyway.

Part of ooni/probe#2405
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
Discovered while trying to understand the reason why we have an abort caused by Android's fdsan.

This is not the reason why we abort, but it seems correct to fix this issue anyway.

While there, notice that stopping to run libtorlinux for every PR (which we just did in 5ebcca2) was wrong. We need to run this workflow for every PR because it runs tests for the libtor integration.

Part of ooni/probe#2405
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
This diff patches tor to avoid closing the controller socket twice.

See ooni/probe#2405.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
I want to understand why we did not see this error previously. My
initial hypothesis is that the code at ooni/go-libtor behaved behaved
differently than #1052.

To understand whether this is the case, in this patch I have copied
the code at libtor/libtor.go and adapted it such that we can use
it inside the current tree.

Part of ooni/probe#2405.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
It turns out previously we were not using the control conn. Using
such a conn is not the default and needs to be configured.

However, the code at #1052
tries to be immutable and hence has a different algorithm than the
one inside go-libtor. In particular, this new code always creates
a control connection and then closes it. So, definitely the new
code I wrote is such that it triggers the tor issue.

On the other end, to hope to see the issue with go-libtor we must
enable using the control conn. Let us do that and see.

Part of ooni/probe#2405.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
To observe any unexpected behavior using go-libtor-like code, we
need to stop fixing the issue inside of tor itself.

Let us instead replace such a patch with the testing patch adding
useful debugging statements while tor is running.

Part of ooni/probe#2405.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
It seems this diff is not necessary to see the real bug, but it
still seems correct to test with better code

Part of ooni/probe#2405
@bassosimone
Copy link
Contributor Author

bassosimone commented Feb 3, 2023

Prodded by @aanorbel, I investigated why we did not see this crash before. To this end, I authored and tested a draft pull request containing changes I am going to explain in a moment: ooni/probe-cli#1073.

My first attempt was ooni/probe-cli@49e22c8. In this commit, I copied the go-libtor implementation and arranged for using it instead of the implementation we have introduced in ooni/probe-cli#1052. With this change the code was working as intended. So, I tried to investigate why.

It turns out that with https://github.com/cretz/bine, you need to explicitly enable using the embedded control connection and we were not doing that when embedding. Because of how go-libtor is written, we do not create an embedded control connection in such a case. Conversely, when writing the code at ooni/probe-cli#1052, I wanted to write immutable code that did not store the context inside a structure, so my algorithm is different. The end result is that the code at ooni/probe-cli#1052 always creates (and closes) a control connection. This fact explains why we noticed the tor issue when we introduced the code at ooni/probe-cli#1052. Still, even after this commit, the code was working as intended. So, I tried to investigate why.

It turns out the patch for tor introduced at ooni/probe-cli#1070 was probably also mitigating any issue that could arise inside of go-libtor when using the embedded control conn. To be sure about this, I authored ooni/probe-cli@727b8fd, which replaced the fixing patch with another patch useful for debugging this kind of issues. At last, the code was crashing:

2023-02-03 16:33:22.976 32318-32474 GoLog pid-32318 [...] \
    SBSDEBUG: tor_main_configuration_setup_control_socket 141 142

[...]

[...] [notice] Catching signal TERM, exiting cleanly.
[...] SBSDEBUG: connection_free_minimal 142
[...] SBSDEBUG: connection_free_minimal 111
[...] fdsan: attempted to close file descriptor 142, expected to be unowned, \
    actually owned by unique_fd 0x70b7e7fc7c
[...] tor_main_configuration_free 142
[...] fdsan: attempted to close file descriptor 142, expected to be unowned, \
    actually owned by unique_fd 0x70b7e7fc7c

The app crashed and now we have a good explanation of why the previous code was not crashing. It would have sufficed to enable using the embedded control conn (as it ought to be) to experience a crash.

Speaking of this issue, it's interesting to see how the code I introduced in ooni/probe-cli#1052 exposed the crash but is also flexible enough to allow for patching tor when building it. This approach would not have been very easy when using go-libtor. All in all, this seems to me an extra argument in favor of running a traditional build rather than following the approach of go-libtor. Introducing patching inside the go-libtor build script would have been a bit more difficult than it has been with this set of build scripts.

FWIW, I also noticed that the original go-libtor code was leaking a file descriptor not closing the file returned by os.NewFile, so I additionally fixed the issue in ooni/probe-cli@aedceaf. Rerunning the test case in this configuration led to the same crashy outcome as before:

2023-02-03 14:58:31.469 30442-7552 GoLog pid-30442 [...] \
    SBSDEBUG: tor_main_configuration_setup_control_socket 65 111

[...]

[...] Bootstrapped 100% (done): Done
[...] Catching signal TERM, exiting cleanly.
[...] SBSDEBUG: connection_free_minimal 111
[...] SBSDEBUG: connection_free_minimal 105
[...] SBSDEBUG: connection_free_minimal 146
[...] fdsan: attempted to close file descriptor 111, expected to be unowned, \
    actually owned by unique_fd 0x70b7e6301c
[...] SBSDEBUG: tor_main_configuration_free 111
[...] fdsan: attempted to close file descriptor 111, expected to be unowned, \
    actually owned by unique_fd 0x70b7e6301c
[...] fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 7544 (AsyncTask #1), \
    pid 30442 (be.experimental)

I expected this outcome, but I chose to run the test case nonetheless for completeness.

@bassosimone
Copy link
Contributor Author

Now that we have fully explained what happened, it is time to enable using the control connection for android and test whether this works as intended. I am not going to change the way in which go-libtor code works for iOS, but I am going to add a warning message inside the iOS codebase mentioning this issue.

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
Discovered while trying to understand the reason why we have an abort caused by Android's fdsan.

This is not the reason why we abort, but it seems correct to fix this issue anyway.

While there, notice that stopping to run libtorlinux for every PR (which we just did in 5ebcca2) was wrong. We need to run this workflow for every PR because it runs tests for the libtor integration.

Part of ooni/probe#2405
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 3, 2023
This diff patches tor to avoid closing the controller socket twice.

See ooni/probe#2405.
@bassosimone
Copy link
Contributor Author

The current stable version of ooniprobe, v3.17.5, builds Tor for Android with a patch that mitigates this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working interrupt needs investigation This issue needs extra data and investigation ooni/probe-engine ooni/probe-mobile issues related to OONI Probe mobile platform/android priority/high technical task technical tasks e.g. deployment testing
Projects
None yet
Development

No branches or pull requests

1 participant