Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Refactor TCP Monitor #17549

Merged
merged 37 commits into from
Apr 20, 2020
Merged

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Apr 6, 2020

What does this PR do?

Refactors the TCP monitor to make the code easier to follow, more testable, and fixes #17123 where TLS server name was not correctly sent. This is important because the code had accrued a lot of cruft and become very hard to follow. There were many wrappers and intermediate variable names that often subtly changed names as they crossed various functions. When debugging #17123 I frequently found myself lost tracing the execution.

This new code should be simpler to understand for a few reasons:

  1. Less code (almost a 2x reduction) and fewer, simpler, better organized functions
  2. Less variable passing/renaming due to use of struct for key config variables
  3. More consistent and descriptive variable names
  4. Creation of the dialer as late as possible, to remove the confusing partial states, and clarity as to when which dialer layers are used.
  5. Adds (frustratingly tricky) integration tests for Heartbeat TLS servername set incorrectly #17123 using mismatched TLS certs, and also against a real SOCKS5 proxy
  6. Adds, for testing only, the ability to override the real network resolver for TCP checks which is invaluable in debugging TLS checks that depend on setting hostnames correctly. In the future if we decide to let users use a custom DNS resolver this will be nice.
  7. Reorganized giant TCP test file into multiple files

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run the sample config in #17123

@andrewvc andrewvc added bug in progress Pull request is currently in progress. Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team technical debt labels Apr 6, 2020
@andrewvc andrewvc self-assigned this Apr 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (:uptime)

@andrewvc andrewvc changed the title [Heartbeat] Refactor tcp [Heartbeat] Refactor TCP Monitor Apr 6, 2020
@andrewvc andrewvc removed the in progress Pull request is currently in progress. label Apr 6, 2020
@urso
Copy link

urso commented Apr 14, 2020

Loads of complexity comes from the amount of jobs (permutations) created per monitor configuration. Settings contributing to this are mode: all, hosts, and ports. For 8.0 we might consider to deprecate some?

  • mode:all: Most complexity and dialer chain have been introduced thanks to this setting. If all is set a DNS is request is made to collect all known IPs for a hostname, and create a dialer and follow up task per IP. Initially this has been a hard requirement. Is this still the case?

  • hosts: we allow multiple hosts to be configured per monitor. hosts instead of host was introduced for consistency with other Beats. Users have only one config, at the cost of some more 'complexity' in code on the initial setup. I think that's ok and I would keep the setting.

  • ports: This is the one I regret the most. If one only wants to check if TCP connectivity is there, ports might be convenient for users. But once you introduce 'contents' checks to your configuration, you might prefer to test one port anyways. Maybe we can deprecate this in favor of port, and move the responsibility to the user/UI/configuration template to create multiple monitoring configs if this is required.

if err != nil {
return nil, err
}
jobs = append(jobs, wrappers.WithURLField(url, endpointJob))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is a change in semantics introduced here or a while ago already. By creating a job per complete URL, we will have each Job do it's own DNS request. Originally only one job was created to do the initial DNS lookup, with follow up tasks per IP:Port combination. On Windows/MacOS DNS is cached automatically, but we have had incidents on Linux overloading DNS servers, due to missing caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the code here, and I do think we can fix this so we restore the behavior before this patch. However, I agree with you that it's not a clear win for ports because that code is tricky. I think for mode: all it still makes sense to keep the behavior, because more people use that (and it provides benefits for icmp/tcp/http not just tcp).

However, for TCP, given that this is a very rarely used feature AFAICT I don't think the code complexity is worth maintaining the single DNS lookup. Additionally, there are workarounds in the rare case where DNS load is too high. One could always install a caching resolver on the same box as heartbeat in the worst case.

mapping map[string][]net.IP
}

func CreateStaticResolver() staticResolver {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not return private types from an exported function/method. Why do you need CreateStaticResolver to be exported? If it is only for testing, how about having a special package with testing utilities, to make the intent more clear?

When introducing something like this resolver (so we don't have to modify the /etc/hosts file), then the resolver should become 'stackable', so we can fallback to another resolver if the hostname has not been configured.

If we use this for testing, how about accepting a map[string][]net.IP as parameter to CreateStaticResolver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be exported, just an oversight :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it back, it does need to be exported because it's used by the tcp monitor package tests. I could move it there, but I think it makes more sense to keep it here since we'll eventually use it in tests for http/icmp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively I could extract all this resolver stuff to its own package, but I'm indifferent on that matter.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's purpose is to be used for testing only, then we should at least have this reflected in the package name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, StdResolver is used for production, StaticResolver is used in another package for test purposes. I could break it out to its own package, but that seems a bit excessive, since that package would have to export it.

We could put StaticResolver in a test file in the tcp package, but in the future we'll probably have tests that want to reuse it from the http and icmp monitors.

What do you think the best way forward is?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think of other testing helpers? E.g. some helper that would drive a 'job' with it's followup tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been moved to the tcp test package for now.

@andrewvc
Copy link
Contributor Author

@urso thanks for the review. I'd rather not make too many changes in this PR re mode/ports/hosts . The original purpose here was fixing the TLS hostname bug, and we've already gone far past that point, and cleaned up the code substantially.

I agree that it's worth considering deprecating ports, but I'd ask that we have that discussion in a separate issue since we'd be waiting till 8.0 to remove the code.

Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only two nit-picks. Ship it!

@@ -45,7 +45,7 @@ import (
// }
// }
func TCPDialer(to time.Duration) NetDialer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I see that the file net.go contains only dialers. Do you think we can rename it to dialers.go?
Also, there are some functions in util.go that refer to transport.Dialer, I wonder if we can divide the stuff in these two files into dialers.go and layers.go. We could get rid of utils ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good call.

"strconv"
"testing"
"time"

"github.com/elastic/go-lookslike/validator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please sort imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, I reset my IDE and lost my auto-sorter.

@andrewvc andrewvc merged commit dfe8c4b into elastic:master Apr 20, 2020
@andrewvc andrewvc deleted the refactor-tcp branch April 20, 2020 21:17
andrewvc added a commit to andrewvc/beats that referenced this pull request Apr 20, 2020
Refactors the TCP monitor to make the code easier to follow, more testable, and fixes elastic#17123 where TLS server name was not correctly sent. This is important because the code had accrued a lot of cruft and become very hard to follow. There were many wrappers and intermediate variable names that often subtly changed names as they crossed various functions. When debugging elastic#17123 I frequently found myself lost tracing the execution.

This new code should be simpler to understand for a few reasons:

    Less code (almost a 2x reduction) and fewer, simpler, better organized functions
    Less variable passing/renaming due to use of struct for key config variables
    More consistent and descriptive variable names
    Creation of the dialer as late as possible, to remove the confusing partial states, and clarity as to when which dialer layers are used.
    Adds (frustratingly tricky) integration tests for elastic#17123 using mismatched TLS certs, and also against a real SOCKS5 proxy
    Adds, for testing only, the ability to override the real network resolver for TCP checks which is invaluable in debugging TLS checks that depend on setting hostnames correctly. In the future if we decide to let users use a custom DNS resolver this will be nice.
    Reorganized giant TCP test file into multiple files

(cherry picked from commit dfe8c4b)
andrewvc added a commit that referenced this pull request Apr 21, 2020
Refactors the TCP monitor to make the code easier to follow, more testable, and fixes #17123 where TLS server name was not correctly sent. This is important because the code had accrued a lot of cruft and become very hard to follow. There were many wrappers and intermediate variable names that often subtly changed names as they crossed various functions. When debugging #17123 I frequently found myself lost tracing the execution.

This new code should be simpler to understand for a few reasons:

    Less code (almost a 2x reduction) and fewer, simpler, better organized functions
    Less variable passing/renaming due to use of struct for key config variables
    More consistent and descriptive variable names
    Creation of the dialer as late as possible, to remove the confusing partial states, and clarity as to when which dialer layers are used.
    Adds (frustratingly tricky) integration tests for #17123 using mismatched TLS certs, and also against a real SOCKS5 proxy
    Adds, for testing only, the ability to override the real network resolver for TCP checks which is invaluable in debugging TLS checks that depend on setting hostnames correctly. In the future if we decide to let users use a custom DNS resolver this will be nice.
    Reorganized giant TCP test file into multiple files

(cherry picked from commit dfe8c4b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Heartbeat TLS servername set incorrectly
6 participants