-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engine: android fdsan SIGABRT inside tor #2405
Comments
I opened this issue in core/tor to discuss with the developer and see if I've got it correctly: https://gitlab.torproject.org/tpo/core/tor/-/issues/40747. |
Discovered while trying to understand the reason why we have an abort caused by Android's fdsan. This is not the reason why we abort, but it seems correct to fix this issue anyway. Part of ooni/probe#2405
Discovered while trying to understand the reason why we have an abort caused by Android's fdsan. This is not the reason why we abort, but it seems correct to fix this issue anyway. While there, notice that stopping to run libtorlinux for every PR (which we just did in 5ebcca2) was wrong. We need to run this workflow for every PR because it runs tests for the libtor integration. Part of ooni/probe#2405
This diff patches tor to avoid closing the controller socket twice. See ooni/probe#2405.
I want to understand why we did not see this error previously. My initial hypothesis is that the code at ooni/go-libtor behaved behaved differently than #1052. To understand whether this is the case, in this patch I have copied the code at libtor/libtor.go and adapted it such that we can use it inside the current tree. Part of ooni/probe#2405.
It turns out previously we were not using the control conn. Using such a conn is not the default and needs to be configured. However, the code at #1052 tries to be immutable and hence has a different algorithm than the one inside go-libtor. In particular, this new code always creates a control connection and then closes it. So, definitely the new code I wrote is such that it triggers the tor issue. On the other end, to hope to see the issue with go-libtor we must enable using the control conn. Let us do that and see. Part of ooni/probe#2405.
To observe any unexpected behavior using go-libtor-like code, we need to stop fixing the issue inside of tor itself. Let us instead replace such a patch with the testing patch adding useful debugging statements while tor is running. Part of ooni/probe#2405.
It seems this diff is not necessary to see the real bug, but it still seems correct to test with better code Part of ooni/probe#2405
Prodded by @aanorbel, I investigated why we did not see this crash before. To this end, I authored and tested a draft pull request containing changes I am going to explain in a moment: ooni/probe-cli#1073. My first attempt was ooni/probe-cli@49e22c8. In this commit, I copied the go-libtor implementation and arranged for using it instead of the implementation we have introduced in ooni/probe-cli#1052. With this change the code was working as intended. So, I tried to investigate why. It turns out that with https://github.com/cretz/bine, you need to explicitly enable using the embedded control connection and we were not doing that when embedding. Because of how go-libtor is written, we do not create an embedded control connection in such a case. Conversely, when writing the code at ooni/probe-cli#1052, I wanted to write immutable code that did not store the context inside a structure, so my algorithm is different. The end result is that the code at ooni/probe-cli#1052 always creates (and closes) a control connection. This fact explains why we noticed the tor issue when we introduced the code at ooni/probe-cli#1052. Still, even after this commit, the code was working as intended. So, I tried to investigate why. It turns out the patch for tor introduced at ooni/probe-cli#1070 was probably also mitigating any issue that could arise inside of go-libtor when using the embedded control conn. To be sure about this, I authored ooni/probe-cli@727b8fd, which replaced the fixing patch with another patch useful for debugging this kind of issues. At last, the code was crashing:
The app crashed and now we have a good explanation of why the previous code was not crashing. It would have sufficed to enable using the embedded control conn (as it ought to be) to experience a crash. Speaking of this issue, it's interesting to see how the code I introduced in ooni/probe-cli#1052 exposed the crash but is also flexible enough to allow for patching tor when building it. This approach would not have been very easy when using go-libtor. All in all, this seems to me an extra argument in favor of running a traditional build rather than following the approach of go-libtor. Introducing patching inside the go-libtor build script would have been a bit more difficult than it has been with this set of build scripts. FWIW, I also noticed that the original go-libtor code was leaking a file descriptor not closing the file returned by os.NewFile, so I additionally fixed the issue in ooni/probe-cli@aedceaf. Rerunning the test case in this configuration led to the same crashy outcome as before:
I expected this outcome, but I chose to run the test case nonetheless for completeness. |
Now that we have fully explained what happened, it is time to enable using the control connection for android and test whether this works as intended. I am not going to change the way in which go-libtor code works for iOS, but I am going to add a warning message inside the iOS codebase mentioning this issue. |
Discovered while trying to understand the reason why we have an abort caused by Android's fdsan. This is not the reason why we abort, but it seems correct to fix this issue anyway. While there, notice that stopping to run libtorlinux for every PR (which we just did in 5ebcca2) was wrong. We need to run this workflow for every PR because it runs tests for the libtor integration. Part of ooni/probe#2405
This diff patches tor to avoid closing the controller socket twice. See ooni/probe#2405.
The current stable version of ooniprobe, v3.17.5, builds Tor for Android with a patch that mitigates this issue. |
After fixing #2404, I noticed a crash inside the tor's codebase. The following is an edited version of the crash that occurred when running OONI Probe Android on my phone:
The crash occurs because fdsan notices that a file descriptor has been closed twice, and hence it aborts. We may want to relax this behavior, considering that we're integrating a large chunk of C code that we don't control directly.
The crash happens quite frequently at the end of
vanilla_tor
(but not aftertorsf
).To investigate the crash, I applied this diff:
This is the output in the logcat after running
vanilla_tor
(in a run in whichfdsan
did not crash):The above output shows that we create a socketpair consisting of
94
(passed to the OONI engine) and98
(used internally bytor
). We also see how, after OONI closes the control connection,tor
emulates receiving aSIGTERM
and proceeds to close connections, including98
. Then, we see how98
is closed again when freeing the config.I am now wondering whether this issue is also causing issues in production. It's true that I recently rewrote how we compile
tor
, but this issue seems to be quite independent of the build mechanism we're using. It might be interesting to check whether the version oftor
we were previously using also closed the fds when freeing the config.The text was updated successfully, but these errors were encountered: