engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544

bassosimone · 2023-09-29T08:35:52Z

TL;DR This issue is about a bug where the engineresolver performs a lookup even if the context is canceled. This leads to the component believing that a given resolver is broken where instead a parent-enforced timeout just occurred. This is the root cause preventing ooniprobe from bootstrapping in certain scenarios, including one described below. However, there is also a secondary issue in the code organization (discussed at the end).

Scenario Consider the case where OONI is doing a cold bootstrap. In such a case, the $HOME/.ooniprobe/engine/ directory is empty, so we do not have information on which resolvers are WAI. Let us also assume that all the DNS-over-HTTPS resolvers we may want to use are censored. In such a case, the bootstrap would fail during the iplookup.

In this fictional scenario, you would see the following logs. The logs have been reordered for readability and reorganized to explain what happens. I will guide you—as well as myself—through the logs.

We start by attempting to lookup using the cloudflare method. Below, I have reordered the logs (which are out of order because different goroutines may be scheduled at different times) to group related lookups together (I have have removed irrelevant logs that would impact on readability and on the ability to understand what happens):

[      0.279993] <info> iplookup: using cloudflare... started

[      0.279918] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... started
[      0.303753] <info> sessionresolver: lookup#1 www.cloudflare.com using https://dns.quad9.net/dns-query... connection_refused

[      0.304111] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... started
[      0.325122] <info> sessionresolver: lookup#2 www.cloudflare.com using https://dns.google/dns-query... connection_refused

[      0.325293] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... started
[      0.352242] <info> sessionresolver: lookup#3 www.cloudflare.com using https://mozilla.cloudflare-dns.com/dns-query... connection_refused

[      0.352597] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... started
[      0.853615] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... in progress
[      4.353289] <info> sessionresolver: lookup#4 www.cloudflare.com using http3://cloudflare-dns.com/dns-query... generic_timeout_error

[      4.353522] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... started
[      4.854381] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... in progress
[      7.280025] <info> sessionresolver: lookup#5 www.cloudflare.com using http3://dns.google/dns-query... generic_timeout_error

[      7.280284] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... started
[      7.280443] <info> sessionresolver: lookup#6 www.cloudflare.com using https://cloudflare-dns.com/dns-query... generic_timeout_error

[      7.280683] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... started
[      7.281053] <info> sessionresolver: lookup#8 www.cloudflare.com using http3://mozilla.cloudflare-dns.com/dns-query... generic_timeout_error

[      7.281274] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... started
[      7.281305] <info> sessionresolver: lookup#9 www.cloudflare.com using system:///... generic_timeout_error

[      7.279903] <info> iplookup: using cloudflare... Get "https://www.cloudflare.com/cdn-cgi/trace": generic_timeout_error

So, overall the lookup lasts 7 seconds. This is because we have a 7 seconds timeout, which was included as a reliability fix according to a comment, and which was introduced by ooni/probe-cli#678, addressing issues #1988 and #2013.

You see that lookup#1, lookup#2 and lookup#3 all fail with connection_refused. Instead lookup#4 fails with a generic_timeout_error caused by a four-second per-lookup timeout.

But lookup#5 and lookup#6 fail before we hit the four seconds mark, because of the overall 7 seconds timeout that bounds the total runtime of the whole iplookup operation. Here we're clearly witnessing a conflict between different timeout policies. (See the end for more thoughts about the timeout conflict.)

Also, lookup#8 and lookup#9 have timestamps after the log where the iplookup component tells us it has given up, which is an instance of logs being slightly out of order due to the scheduling of goroutines that I mentioned above.

However, and this is particularly important, lookup#9 happens to be the system resolver, which I manually tested and know to be working in this fictional network. Unfortunately, because the engineresolver component schedules lookups even when its context deadline has expired, this means that the engineresolver believes that the system resolver is not working as intended and marks it as broken. Had not that happened, we would at least have been able to proceed with the bootstrap. This reckoning is what led me to label this issue as the root cause of the bootstrap failing in this scenario.

If you're curious, the bootstrap ends like this:

[     28.288537] <info> iplookup: using stun_google... sessionresolver: LookupHost failed: [ <https://mozilla.cloudflare-dns.com/dns-query> connection_refused; <https://dns.google/dns-query> connection_refused; <https://dns.quad9.net/dns-query> connection_refused; <http3://cloudflare-dns.com/dns-query> generic_timeout_error; <http3://dns.google/dns-query> generic_timeout_error; <https://cloudflare-dns.com/dns-query> generic_timeout_error; <http3://mozilla.cloudflare-dns.com/dns-query> generic_timeout_error; <system:///> generic_timeout_error;]

Also, this is the content of sessionresolver.state after this run:

[
  {
    "URL": "https://mozilla.cloudflare-dns.com/dns-query",
    "Score": 7.424810779254613E-13
  },
  {
    "URL": "https://dns.google/dns-query",
    "Score": 8.105952955347611E-13
  },
  {
    "URL": "https://dns.quad9.net/dns-query",
    "Score": 8.624602747864999E-13
  },
  {
    "URL": "http3://cloudflare-dns.com/dns-query",
    "Score": 5.73807306573324E-13
  },
  {
    "URL": "http3://dns.google/dns-query",
    "Score": 4.0903976870662584E-13
  },
  {
    "URL": "https://cloudflare-dns.com/dns-query",
    "Score": 1.0603838730495632E-13
  },
  {
    "URL": "http3://mozilla.cloudflare-dns.com/dns-query",
    "Score": 2.1736065526426893E-14
  },
  {
    "URL": "system:///",
    "Score": 0
  }
]

Code organization considerations We have several blocks of code that are synchronous and return a result to the caller or time out. The model.Resolver is a classical example. However, because of this code organization (which, to my defense looks natural in Go and works fine if the network is not working against you), we need to enforce timeouts inside several code blocks, to make the overall runtime bounded. It should be investigated whether a code organization based on pipelines returning information to callers through channels could be more robust and would require less timeouts. I am leaving this comment here just to document my thoughts while I was analyzing this issue, but it's clear that probably this is a different task that may require us to open and discuss on a separate issue.

I covered this issue more in detail in #2545.

The text was updated successfully, but these errors were encountered:

Not logging when an operation is starting complicates reasoning about how much time it took to fail, which is something that caused issues investigating ooni/probe#2544.

This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544. Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551. This diff has been tested in a problematic network under two distinct conditions: 1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority; 2. hot bootstrap with fscked up config, where the correct fix is to use a large coarse grained timeout. The fix that causes a canceled context to prevent marking a resolver as failed helps in both scenarios. As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting what I'd do had I had more time to work on this topic.

This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544. Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551. This diff has been tested in a problematic network under two distinct conditions: 1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority (i.e., 0.5); 2. hot bootstrap with very f***** up `sessionresolver.state`, where the correct fix is to use a larger coarse grained timeout, such that _eventually_ we try the system resolver. The fix that prevents marking a resolver when the external context has been canceled is required in both of these scenarios. As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting in ooni/probe#2551 what I'd do had I had more time to work on this topic. BTW, I had to create an overall 45 seconds timeout for IP lookups because we have 7 DNS over HTTPS resolvers plus the system resolver. This means that, in the worst case where the system resolver has the least priority, we expect 7*4 = 28 seconds of timeout before eventually using the system resolver. The rest of the 45s timeout accounts for operations happening after the DNS lookup has completed.

bassosimone · 2023-10-09T15:09:42Z

I committed an hotfix in ooni/probe-cli#1351 and documented a more comprehensive fix in #2551.

We can now mark this issue as closed.

Not logging when an operation is starting complicates reasoning about how much time it took to fail, which is something that caused issues investigating ooni/probe#2544.

This diff ensures that the boostrap is more robust in the scenario described by ooni/probe#2544. Because we're close to a release, I did not want to write a heavy refactor of the boostrap, and specifically of the iplookup part, but I documented what is needed at ooni/probe#2551. This diff has been tested in a problematic network under two distinct conditions: 1. cold bootstrap (i.e., w/o prior knowledge), where the correct fix is to give the system resolver intermediate priority (i.e., 0.5); 2. hot bootstrap with very f***** up `sessionresolver.state`, where the correct fix is to use a larger coarse grained timeout, such that _eventually_ we try the system resolver. The fix that prevents marking a resolver when the external context has been canceled is required in both of these scenarios. As you can see, I removed a "reliability fix", which actually was more of an optimization. This removal means that the probe bootstrap is now slower than it could be, hence my choice of documenting in ooni/probe#2551 what I'd do had I had more time to work on this topic. BTW, I had to create an overall 45 seconds timeout for IP lookups because we have 7 DNS over HTTPS resolvers plus the system resolver. This means that, in the worst case where the system resolver has the least priority, we expect 7*4 = 28 seconds of timeout before eventually using the system resolver. The rest of the 45s timeout accounts for operations happening after the DNS lookup has completed.

bassosimone added bug Something isn't working priority/high needs investigation This issue needs extra data and investigation ooni/probe-engine releaseBlocker This issue blocks releasing labels Sep 29, 2023

bassosimone self-assigned this Sep 29, 2023

bassosimone changed the title ~~engineresolver: don't lookup if context is canceled~~ engineresolver: bootstrap failure because of lookup w/ canceled context Sep 29, 2023

bassosimone changed the title ~~engineresolver: bootstrap failure because of lookup w/ canceled context~~ engineresolver: bootstrap failure b/c of lookup w/ canceled context Sep 29, 2023

bassosimone mentioned this issue Oct 9, 2023

fix(logx): log when an operation is starting ooni/probe-cli#1350

Merged

bassosimone mentioned this issue Oct 9, 2023

iplookup: rewrite to use overlapped operations and fine grained timeouts #2551

Open

bassosimone mentioned this issue Oct 9, 2023

fix: make bootstrap more robust (and slower 😢) ooni/probe-cli#1351

Merged

bassosimone closed this as completed Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544

engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544

bassosimone commented Sep 29, 2023 •

edited

Loading

bassosimone commented Oct 9, 2023 •

edited

Loading

engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544

engineresolver: bootstrap failure b/c of lookup w/ canceled context #2544

Comments

bassosimone commented Sep 29, 2023 • edited Loading

bassosimone commented Oct 9, 2023 • edited Loading

bassosimone commented Sep 29, 2023 •

edited

Loading

bassosimone commented Oct 9, 2023 •

edited

Loading