Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Proposal For Terraform Logging #695

Closed
paddycarver opened this issue Feb 5, 2021 · 9 comments
Closed

A Proposal For Terraform Logging #695

paddycarver opened this issue Feb 5, 2021 · 9 comments
Labels
enhancement New feature or request proposal subsystem/observability Issues and feature requests related to observability (traces, logging, etc.) inside of providers.

Comments

@paddycarver
Copy link
Contributor

paddycarver commented Feb 5, 2021

Hey everyone. This is going to be a long proposal; we're going to try and unpack a thorny problem and propose a solution to it all in one. But we're incredibly interested in feedback from our ecosystem on this proposal, and would love to hear more from you all.

The Problem Description

Logging in Terraform presents a problem at the moment. Logs are currently represented as a stream of text, in which the provider's logs, the SDK's logs, and Terraform's logs all get intermingled. Unfortunately, once this happens, separating them is difficult. You get information about graph walks next to information about shims next to warnings about deprecated behavior next to HTTP request logs from the provider next to log lines printed in the provider's code. Multiple RPC calls are happening and logging in parallel, and aren't distinguished from each other at all. For a lot of users, grepping through the log for unique strings is the best they can do. Reading logs to get an accurate picture of what's happening is a difficult or impossible task, even for seasoned Terraform provider developers. Worse, this situation actively discourages logging information that could be helpful in resolving bugs because subsystems can easily drown out more relevant logs if they log too verbosely.

How Logging Works Today

Terraform's logging subsystem works today as a cooperation between several different pieces of Terraform. It all starts with go-plugin, the library that Terraform uses to start provider processes. go-plugin sends anything written to os.Stderr or os.Stdout to Terraform. This moves the output somewhere that Terraform can process it.

One level up, in the SDK, we complicate the abstraction a little bit more. We call log.SetOutput to override the default output of the log package, so calls to log.Println get sent somewhere else. That "somewhere else" ends up being through go-hclog, which transforms the unstructured log to a structured log, by inferring which level it's at. Prefixing with [TRACE] sets the level to TRACE, and so on. If no known prefix is matched, the log is assumed to be at INFO level. This structured output is then written to os.Stderr as JSON, and os.Stderr is redirected to send it to Terraform by go-plugin.

Finally, one more level up, the provider developer writes something like log.Println("[DEBUG] hello world") in their provider. When run, this coordinates with the SDK and go-plugin to get transformed into JSON output of a structured log message sent to Terraform:

{"@level":"debug","@message":"hello world","@timestamp":"2009-11-10T23:00:00.000000Z"}

Terraform now has this JSON, and can parse it and use it appropriately. In 0.15, Terraform core will be able to separate provider logs from core logs, and will use structured logging throughout, making the filtering based on log level much more reliable.

Unfortunately, this still leaves too many subsystems interleaving their logs together, making it difficult to sort them out.

Involved Parties

As the above section hopefully makes clear, there are multiple interested parties in how logging in Terraform providers works.

  • The go-plugin maintainers have their own compatibility and general applicability requirements they have a responsibility to.
  • The Terraform Core team is the final barrier between logs and the user, but also has their own initiatives and responsibilities.
  • The Plugin SDK team has a responsibility to the existing provider codebases and interfaces, and to keep existing logging working, and has a responsibility to implement a logging solution that works for all providers, not just some or even most.
  • The provider developer has their own deadlines to meet and work to do, and needs tools and information to help them do their jobs.
  • The provider user ("practitioner") needs a straightforward way to surface information to provider developers that will help provider developers understand and resolve an issue the practitioner is having.

Challenges

A number of challenges stand in the way of us shipping a better experience for debugging providers using logs:

  • The number of people and teams, with different needs and priorities, the issue touches means it needs careful consideration and design, and investment from people who have their own roadmaps to execute on.
  • The large existing body of code in the ecosystem means breaking existing code is an untenable proposition.
  • The large number of subsystems at play and the level of complexity in them (providers themselves, the different pieces of the SDK, Terraform itself) generate a massive amount of debug logging, but provider developers are only interested in some of it and need control over what complexity is surfaced.
  • The large number of different combinations of output means configuring what complexity is surfaced is, in itself, a complex problem.
  • Practitioners shouldn't be expected to read logs, but do surface them in issue reports. Provider developers need a way to filter the complexity of the logs surfaced in these issue reports, so the filtering can't happen only at runtime.
  • The Terraform provider development experience is evolving beyond a monolithic SDK, and the solution should work whether you're using terraform-plugin-mux, terraform-plugin-go, terraform-plugin-sdk, some as-yet unreleased other tool, or a combination of them.
  • Most provider developers will reach for fmt or log as their first instinct for logging from providers, both because that's a standard solution within Go and because that is how providers have traditionally handled logging. Any solution that doesn't use those two packages as the entrypoint needs to address how the different approach will be discovered by users, rather than failing silently.

The Proposed Solution

We've spent some time researching this problem and have a proposal for a path forward. We think it addresses the shortcomings of the current logging situation, navigates the challenges well, and is achievable within the timeframe and bandwidth that can be devoted to the project right now, given the other constraints everyone is working under and the projects that are in motion.

It is important to note that this is a proposal and design notes. What ends up being shipped may change as new constraints or goals are discovered, as limitations prove the desired user experience impossible, or as research, prototyping, and experimentation prove a designed user experience isn't as intuitive or useful as expected and an alternative that better achieves the goals is used. This is also not a guarantee that anything will be shipped.

terraform-plugin-log

The proposal centers around a new Go module, github.com/hashicorp/terraform-plugin-log. The package name when imported will be tflog and will provide a logging subsystem that is native to and heavily opinionated for Terraform development. The use of a new module instead of a package in the SDK allows logging to be used in terraform-plugin-go and terraform-plugin-mux without needing to import the SDK, potentially unnecessarily. It also allows us to iterate on and version the logger separately from the SDK. This does, however, mean that discovering the module will be more difficult for provider developers.

As an implementation detail, terraform-plugin-log will be implemented as a wrapper around the go-hclog library and tailor it to Terraform's specific use cases. This could change in the future, but it is a useful shorthand for illustrating how terraform-plugin-log will work on day one. The use of a structured logger is a departure from the unstructured logging of the past, but offers a richer logging experience that lets the SDK enrich log output for users. It also matches Terraform's now-consistent use of structured logs only internally.

Unlike the current situation, terraform-plugin-log will not use a process-global logger. This has caused problems in the past: other modules overwriting that logger, multiple provider servers being started in the same process for testing and other non-standard operation modes, etc. Rather than relying on this global mutable state, provider loggers will be injected into an RPC call's context.Context. Provider developers will be able to write logs as follows:

func resourceFooCreate(ctx context.Context, d *schema.ResourceData, meta interface{}) diag.Diagnostics{
    someVar := 123
    tflog.Info(ctx, "hello, world", "includedValue", someVar)
    // "hello, world" will be used as the message
    // includedValue=123 will be included as a value on the log line
}

This does the mean that the deprecated non-context.Context-aware functions will not be able to take advantage of terraform-plugin-log.

Because we're using structured logging and passing it using context.Context, provider developers can set request-scoped values to be used in every log line following that point for the RPC call:

ctx = tflog.With(ctx, "user_id", "[email protected]")
makeMyClientCall(ctx)

The SDK can also do this on behalf of users, injecting useful variables as part of the SDK. For example, if the SDK injects a request ID at the gRPC boundary:

ctx = tflog.With(ctx, "request_id", "54f65188-0d10-425d-9427-c778edeffcf9")
provider.CreateContextFunc(ctx, d, meta)

Now when the provider calls tflog.Info(ctx, "hello, world") it'll automatically include request_id=54f65188-0d10-425d-9427-c778edeffcf9, too, helping provider developers understand what is happening in RPC calls, even when calls are happening in parallel:

{"@level":"info","@message":"hello, world","@timestamp":"2009-11-10T23:00:00.000000Z","request_id":"54f65188-0d10-425d-9427-c778edeffcf9"}

context.Contexts that do not have a logger in them (most likely to come up in unit tests or terraform-plugin-go usage, as the framework will be injecting loggers for provider developers) will by default use a null logger, which does not write any output. We could also take the same approach covered below in Encouraging Discovery to make this behavior more obvious to the user.

Named Loggers

go-hclog offers the ability to create named loggers as sub-loggers of an existing logger. These named loggers allow subsystems to have their logs partitioned. We propose to have named loggers for the different subsystems at play in the provider:

  • The provider developer's code would get its own logger, and the ability to generate sub-loggers of that logger.
  • The SDK would get its own sub-logger, with sub-loggers for:
    • terraform-plugin-go, which the SDK relies on to handle the interaction with and parsing of Terraform's protocol
    • The SDK itself
    • The test framework
    • terraform-plugin-mux, which may not be used in every provider, but will additively latch on to this logging approach when it is detected.
  • The http.Transport implementation that logs HTTP requests and responses would get its own sub-logger.

By separating out loggers in this manner, provider developers will be able to filter their output and separate out the subsystems that are not under investigation. This frees up the subsystems in question to log more verbosely, knowing they won't overwhelm other, more important logs.

Encouraging Discovery

The use of a separate module makes it harder for provider developers, especially those used to reaching for log or fmt, to discover the package. As a solution to that, we propose to use log.SetOutput to override the log package's default Logger with one that will panic when it's used, including a message pointing developers to the terraform-plugin-log module. This is an aggressive approach, but we believe that failing loudly is better than failing silently.

Not all providers may want this behavior, and its purpose is meant to be informative, not coercive. For example, many providers using the SDK may have many existing log lines and don't wish to upgrade them. Similarly, providers may wish to use their own logger implementations instead, and while this isn't recommended, that is their choice to make. To help with this, the new panic will only be introduced in new projects like terraform-plugin-go and any new tools that we build from here out. The SDK will disable the panic, even when combined with a terraform-plugin-go provider using terraform-plugin-mux. The same method the SDK uses to disable the panic will be available to provider developers to call themselves in their main function if they want to use their own loggers, and instructions on how to disable the panic will be included in the panic message itself.

We will also update terraform-provider-scaffolding, terraform-provider-hashicups, and the Learn guides to use terraform-plugin-log, as well as the relevant documentation on terraform.io.

Configuration

With this many subsystems logging, and each being able to be set to their own log level for filtering log output, the number of knobs start to add up. When considering that each provider in use will have their own set of these, the number of knobs starts to be overwhelming.

Fortunately, the job here isn't to surface all the complexity the system enables, it's to surface the useful complexity that the system enables.

To do so, it's worth thinking about what subsystem owns which parts of the logs:

  • Terraform Core has the last word on logs. It stands between the provider and the user's stdout and stderr, so it is the gatekeeper of the final log output. But it really only cares about two things: do we want logs from core? And do we want logs from providers? The specific subsystems the provider has aren't really any of Core's concern, so its existing TF_LOG_CORE and TF_LOG_PROVIDER environment variables are sufficient to control what it cares about. Ideally, core would be able to control which providers can send logs, but specifying which levels apply to which providers at that point gets complicated. Nobody wants to type TF_LOG_PROVIDER=registry.terraform.io/hashicorp/random=debug,registry.terraform.io/hashicorp/null=off. To keep this simple, we'll limit core's configuration to just the configuration that exists in Terraform 0.15.
  • The Terraform SDK knows the most about the logs for the SDK itself, and so should decide what environment variables control them. TF_LOG_SDK_PLUGIN=trace could be used to control terraform-plugin-go. TF_LOG_SDK=off could be used to turn off all SDK logging.
  • The Terraform provider developer knows the most about their provider, and therefore can control which environment variables determine its logging level. TF_LOG_PROVIDER_RANDOM=off, for example, could turn the logs for the random provider off. There is a risk that provider developers will choose non-unique environment variables, but since the environment variable is mostly only useful to the developers of that provider, the people most impacted by collisions are the people best suited to resolve them. The SDK will plan to surface helpers that will register common environment variables to help drive consistency of this.

As an illustrative table:

Environment Variable Where it's read What it does
TF_LOG=off Terraform core Disables all logging
TF_LOG_PROVIDER=WARN Terraform core Outputs any warning or error logs that providers decide to send, even if the providers are configured at a more verbose logging setting. If providers are configured at a less verbose logging setting, the least verbose logging setting is used.
TF_LOG_SDK=WARN terraform-plugin-sdk Outputs warning or error logs for the SDK. If TF_LOG_PROVIDER or TF_LOG is set to a less verbose logging setting, the less verbose logging setting is used.
TF_LOG_PROVIDER_RANDOM=TRACE terraform-provider-random Outputs all logs from the random provider. If TF_LOG_PROVIDER or TF_LOG is set to a less verbose logging setting, the less verbose logging setting is used.
TF_LOG_CORE=off Terraform core Turns off all core logging.

This also nicely mimics how environment variables are used today for, for example, authentication: the level of abstraction that is concerned with it controls which environment variable is used.

The other nice part about this proposal is that the behavior users have come to rely on--that TF_LOG is the ultimate arbiter--holds true. If a user sets TF_LOG or TF_LOG_PROVIDER to WARN, it doesn't matter what the provider logger is configured to send, Terraform will still only show WARN and ERROR messages. These new environment variables can only further limit the information being exposed, they can never expand it. And users (and provider developers) don't need to know anything about them to retain the behavior they're getting today, it's just an extra tool for developers that need that enhanced complexity.

Logs as Artifacts

Provider developers often need to debug issues that happen on systems they don't have access to, with configurations and states they don't have access to. The available information is limited, and largely provider developers need to rely on logs and reproductions to understand and get to the bottom of issues. As such, the experience of capturing a log to pass to a provider developer needs to be simple and straightforward, and the experience of working with a captured log should offer as much power as debugging a provider running on your own system.

We propose to add JSON as an acceptable value for TF_LOG. When set, it would capture any and all information the provider sends (effectively setting TF_LOG=trace) but also would output the log lines in their JSON, machine-readable format. The recommendation would then be for provider developers to ask users for the TF_LOG=JSON output, not the TF_LOG=trace output.

We could then provide a binary tool that could consume that JSON output, filter it using the same tools and environment variables as if the log lines were being generated instead of read, and output a new human-readable version of the log file filtered to the provider developer's specifications.

Design and implementation of this tooling is a future project and not considered part of shipping this improvement. The important part is that this functionality is available when the tool is ready, and until then bug reports are not less useful.

Future Directions

A number of possibilities were discussed as we researched this proposal, and while they don't fit into the current scope of the proposal, they are things we have discussed and would like to implement when the prerequisites are met or we have the bandwidth to implement them productively.

Log Filtering

A lot of people are very interested in keeping sensitive information out of log files. As part of creating structured logging and taking more control over how logging happens, we gain the ability to do some more log filtering. The idea of a function call like tflog.RegisterFilter(ctx, func(msg string, args []interface{}) (string, []interface{})) was proposed, which would allow provider developers to register a callback that would be called before each log line was output and modify the message or string of the line. This would allow providers to register callbacks that did things like remove API keys, resource-specific passwords, and other sensitive information from provider log output automatically. It would not impact the log output of core, but would help to address the concerns about potentially exposing sensitive information in logs.

We are still interested in doing this, but given the current period of flux the SDK and Terraform provider development in general is undergoing, we wanted to introduce this in a thoughtful and ergonomic way, and didn't want to block the other improvements in this project on that implementation.

Open Telemetry

We're very interested in integrating with OpenTelemetry and considered building terraform-plugin-log on top of the opentelemetry-go library, or just using the opentelemetry-go library itself. Unfortunately, the opentelemetry-go library does not yet support logging. As the project develops, we're interested in keeping an eye on it and seeing if we can integrate better with existing standards.

Next Steps

Now that we've published this proposal, we're seeking input, comments, and feedback for our ecosystem and community. Would this help you? Are there things you need to do that we didn't take into consideration?

We're going to leave this issue open to collect comments until the end of February. On March 1st, we will close this issue. At that point, this will graduate to an internal HashiCorp RFC. At that point, the rest of HashiCorp will have a chance to offer their feedback and expertise, which may change the details of the proposal as we incorporate new information.

Once the RFC process is complete, we will begin implementing and integrating this work into our existing and future codebases, allowing provider developers to start taking advantage of it.

@paddycarver paddycarver added enhancement New feature or request proposal subsystem/observability Issues and feature requests related to observability (traces, logging, etc.) inside of providers. labels Feb 5, 2021
@paddycarver paddycarver pinned this issue Feb 5, 2021
@jacobbednarz
Copy link

This proposal looks really nice @paddycarver, thanks for putting it up ❤️

As a provider maintainer, this covers all the use cases I've hit and wraps it all up in a nice interface for use. Not only that, I think the configuration guideline you've outlined hits the mark on flexibility for all parties involved to avoid fixing the problem for some but making it more for others so big kudos there.

I only have single point for discussion (which may be moot after some of these improvements roll out) but a request to be able to filter on the "layer" of log entry generated. For instance, providing a way to filter in/out HTTP traffic as a separate stream while still seeing the {core,sdk,provider} logging would be handy. Right now, I use a third party proxy to grab this data when it's needed as I don't often need the logging to include all but a subset of the provider communication. Depending on where you'd take this, the HTTP filtering could be a top level environment variable TF_LOG_HTTP=off type of thing while using a combination of the others you've proposed or you could include it in the method signature for the log entries themselves.

@paddycarver
Copy link
Contributor Author

Hey @jacobbednarz!

So I tried to address that in the proposal, but may have not been clear or left out key elements--sorry, drafting it was like herding cats a bit. :)

It sounds like this bit would address your ask:

The http.Transport implementation that logs HTTP requests and responses would get its own sub-logger.

Though perhaps the part that needs called out here is that each of these sub-loggers is intended to be configurable to be turned on/turned off with environment variables. So you should be able to set the log level of each of those sub-loggers, including turning them off, on a logger-by-logger basis using environment variables.

I think that gets you what you're looking for? But if not, I'd love to hear more about what it is you're after!

@jacobbednarz
Copy link

Oh awesome! I must not have fully taken that in on the first pass so thanks for the highlight 🙇

@ryanking
Copy link
Contributor

I have one note about this, and maybe more after I digest it more.

I find that the pattern illustrated by this line leads to poor discoverability–

    tflog.Info(ctx, "hello, world", "includedValue", someVar)

I really want that context.Context or some other interface to have something on it (func, struct member, etc) that indicates that it can be used for logging. Unfortunately this pattern is common in Go, which means that other people have probably encountered this situation and failed to come up with a better idea.

@paddycarver
Copy link
Contributor Author

Hey @ryanking, thanks for the feedback. When you say

I want that context.Context or some other interface to have something on it ([...]) that indicates it can be used for logging.

Do you have any examples of what you mean? I'm having trouble visualizing what you're getting at.

I think the proposal is using a method of exposing this that's not incredibly foreign to Go, and is a common pattern for request-scoped things, but that's definitely not to say it's the only way to do it, and I'm interested in exploring alternatives that people think may be better.

@ryanking
Copy link
Contributor

@paddycarver thinking about this more I think it would be best to do what I would want in the provider, rather than in the SDK

@paddycarver
Copy link
Contributor Author

@ryanking that's my line! 😁 I'd be really interested to hear what you would want, though, to see if we could make that easier or if it's worth generalizing and offering a more paved path to achieving it. No pressure if you don't want to share, but I am interested if you do.

@paddycarver
Copy link
Contributor Author

Thank you all for all your feedback and input on this! We're going to move forward with the design as outlined, and hopefully with have more news and updates to share with you soon.

@paddycarver paddycarver unpinned this issue Mar 3, 2021
@ghost
Copy link

ghost commented Apr 3, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request proposal subsystem/observability Issues and feature requests related to observability (traces, logging, etc.) inside of providers.
Projects
None yet
Development

No branches or pull requests

3 participants