-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544
Labels
bug
Something isn't working
needs investigation
This issue needs extra data and investigation
ooni/probe-engine
priority/high
releaseBlocker
This issue blocks releasing
Comments
bassosimone
added
bug
Something isn't working
priority/high
needs investigation
This issue needs extra data and investigation
ooni/probe-engine
releaseBlocker
This issue blocks releasing
labels
Sep 29, 2023
bassosimone
changed the title
engineresolver: don't lookup if context is canceled
engineresolver: bootstrap failure because of lookup w/ canceled context
Sep 29, 2023
bassosimone
changed the title
engineresolver: bootstrap failure because of lookup w/ canceled context
engineresolver: bootstrap failure b/c of lookup w/ canceled context
Sep 29, 2023
bassosimone
added a commit
to ooni/probe-cli
that referenced
this issue
Oct 9, 2023
Not logging when an operation is starting complicates reasoning about how much time it took to fail, which is something that caused issues investigating ooni/probe#2544.
bassosimone
added a commit
to ooni/probe-cli
that referenced
this issue
Oct 9, 2023
Not logging when an operation is starting complicates reasoning about how much time it took to fail, which is something that caused issues investigating ooni/probe#2544.
bassosimone
added a commit
to ooni/probe-cli
that referenced
this issue
Oct 9, 2023
This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544. Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551. This diff has been tested in a problematic network under two distinct conditions: 1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority; 2. hot bootstrap with fscked up config, where the correct fix is to use a large coarse grained timeout. The fix that causes a canceled context to prevent marking a resolver as failed helps in both scenarios. As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting what I'd do had I had more time to work on this topic.
bassosimone
added a commit
to ooni/probe-cli
that referenced
this issue
Oct 9, 2023
This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544. Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551. This diff has been tested in a problematic network under two distinct conditions: 1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority (i.e., 0.5); 2. hot bootstrap with very f***** up `sessionresolver.state`, where the correct fix is to use a larger coarse grained timeout, such that _eventually_ we try the system resolver. The fix that prevents marking a resolver when the external context has been canceled is required in both of these scenarios. As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting in ooni/probe#2551 what I'd do had I had more time to work on this topic. BTW, I had to create an overall 45 seconds timeout for IP lookups because we have 7 DNS over HTTPS resolvers plus the system resolver. This means that, in the worst case where the system resolver has the least priority, we expect 7*4 = 28 seconds of timeout before eventually using the system resolver. The rest of the 45s timeout accounts for operations happening after the DNS lookup has completed.
I committed an hotfix in ooni/probe-cli#1351 and documented a more comprehensive fix in #2551. We can now mark this issue as closed. |
Murphy-OrangeMud
pushed a commit
to Murphy-OrangeMud/probe-cli
that referenced
this issue
Feb 13, 2024
Not logging when an operation is starting complicates reasoning about how much time it took to fail, which is something that caused issues investigating ooni/probe#2544.
Murphy-OrangeMud
pushed a commit
to Murphy-OrangeMud/probe-cli
that referenced
this issue
Feb 13, 2024
This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544. Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551. This diff has been tested in a problematic network under two distinct conditions: 1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority (i.e., 0.5); 2. hot bootstrap with very f***** up `sessionresolver.state`, where the correct fix is to use a larger coarse grained timeout, such that _eventually_ we try the system resolver. The fix that prevents marking a resolver when the external context has been canceled is required in both of these scenarios. As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting in ooni/probe#2551 what I'd do had I had more time to work on this topic. BTW, I had to create an overall 45 seconds timeout for IP lookups because we have 7 DNS over HTTPS resolvers plus the system resolver. This means that, in the worst case where the system resolver has the least priority, we expect 7*4 = 28 seconds of timeout before eventually using the system resolver. The rest of the 45s timeout accounts for operations happening after the DNS lookup has completed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Something isn't working
needs investigation
This issue needs extra data and investigation
ooni/probe-engine
priority/high
releaseBlocker
This issue blocks releasing
TL;DR This issue is about a bug where the
engineresolver
performs a lookup even if the context is canceled. This leads to the component believing that a given resolver is broken where instead a parent-enforced timeout just occurred. This is the root cause preventing ooniprobe from bootstrapping in certain scenarios, including one described below. However, there is also a secondary issue in the code organization (discussed at the end).Scenario Consider the case where OONI is doing a cold bootstrap. In such a case, the
$HOME/.ooniprobe/engine/
directory is empty, so we do not have information on which resolvers are WAI. Let us also assume that all the DNS-over-HTTPS resolvers we may want to use are censored. In such a case, the bootstrap would fail during the iplookup.In this fictional scenario, you would see the following logs. The logs have been reordered for readability and reorganized to explain what happens. I will guide you—as well as myself—through the logs.
We start by attempting to lookup using the
cloudflare
method. Below, I have reordered the logs (which are out of order because different goroutines may be scheduled at different times) to group related lookups together (I have have removed irrelevant logs that would impact on readability and on the ability to understand what happens):So, overall the lookup lasts 7 seconds. This is because we have a 7 seconds timeout, which was included as a reliability fix according to a comment, and which was introduced by ooni/probe-cli#678, addressing issues #1988 and #2013.
You see that
lookup#1
,lookup#2
andlookup#3
all fail withconnection_refused
. Insteadlookup#4
fails with ageneric_timeout_error
caused by a four-second per-lookup timeout.But
lookup#5
andlookup#6
fail before we hit the four seconds mark, because of the overall 7 seconds timeout that bounds the total runtime of the whole iplookup operation. Here we're clearly witnessing a conflict between different timeout policies. (See the end for more thoughts about the timeout conflict.)Also,
lookup#8
andlookup#9
have timestamps after the log where theiplookup
component tells us it has given up, which is an instance of logs being slightly out of order due to the scheduling of goroutines that I mentioned above.However, and this is particularly important,
lookup#9
happens to be the system resolver, which I manually tested and know to be working in this fictional network. Unfortunately, because theengineresolver
component schedules lookups even when its context deadline has expired, this means that theengineresolver
believes that the system resolver is not working as intended and marks it as broken. Had not that happened, we would at least have been able to proceed with the bootstrap. This reckoning is what led me to label this issue as the root cause of the bootstrap failing in this scenario.If you're curious, the bootstrap ends like this:
Also, this is the content of
sessionresolver.state
after this run:Code organization considerations We have several blocks of code that are synchronous and return a result to the caller or time out. The
model.Resolver
is a classical example. However, because of this code organization (which, to my defense looks natural in Go and works fine if the network is not working against you), we need to enforce timeouts inside several code blocks, to make the overall runtime bounded. It should be investigated whether a code organization based on pipelines returning information to callers through channels could be more robust and would require less timeouts. I am leaving this comment here just to document my thoughts while I was analyzing this issue, but it's clear that probably this is a different task that may require us to open and discuss on a separate issue.I covered this issue more in detail in #2545.
The text was updated successfully, but these errors were encountered: