Sensitive Data Handling #100

johnbley · 2020-04-27T23:59:52Z

I present a proposal for a design philosophy around handling potentially-sensitive data in our libraries, using SQL as an example throughout.

text/0100-sensitive-data-handling.md

carlosalberto · 2020-04-28T13:23:27Z

Overall great. This feels like a good start 👍

arminru · 2020-04-28T13:38:12Z

text/0100-sensitive-data-handling.md

@@ -0,0 +1,103 @@
+#  Sensitive Data Handling
+
+By default, OpenTelemetry libraries never capture potentially-sensitive data, except for the full URL.


That's a pretty bold statement. Are we sure we will be able to ensure this 100%?

It's a design goal. Bugs are probably inevitable 😄

I think armin was asking if this is a design goal we want. The term libraries could mean anything from API to SDK to Instrumentation Libraries. I think the strongest statement we could make would be:

Suggested change

By default, OpenTelemetry libraries never capture potentially-sensitive data, except for the full URL.

By default, instrumentation libraries provided by OpenTelemetry make a best effort to never capture potentially-sensitive data, except in cases like the full URL where the data is necessary to the observability of the system.

arminru · 2020-04-28T13:39:42Z

text/0100-sensitive-data-handling.md

@@ -0,0 +1,103 @@
+#  Sensitive Data Handling
+
+By default, OpenTelemetry libraries never capture potentially-sensitive data, except for the full URL.


What are "OpenTelemetry libraries"? Are these the API + SDK implementations or instrumentations/integrations that are provided by the OpenTelemetry project?

This would include any current and future auto-instrumenter (like Java, .NET, node.js) as well as any manually wrapped libraries(e.g., in a contrib package).

I see. It would make sense to explicitly mention the scope to which this OTEP and its design goals apply to in the document.

arminru · 2020-04-28T13:40:34Z

text/0100-sensitive-data-handling.md

+
+## Explanation
+
+By default, our java auto-instrumentation library does not capture your app's credit card numbers, taxpayer IDs, 


Is the Java auto-instrumentation mentioned as an example or does this only apply to the Java auto-instrumentation? Please clarify.
If this is about all kinds of instrumentations, do you intend to redact all set attributes by default?

Sticking with the sql example, it would also apply to the node.js instrumentation in opentelemetry-js, including the opentelemetry-plugin-mysql and opentelemetry-plugin-postgres packages. It doesn't look like the nascent .NET auto agent would be affected (yet).
No, this wouldn't apply a single rule to all set attributes. We assume manual instrumentation knows what it is doing. Since we are writing the instrumentation for auto-instrumenters, we are aware of the semantics of the attribute values and can apply appropriate rules (e.g., we know that the numbers in http.status_code are not risks, but that a database connection string might contain a password.

arminru · 2020-04-28T13:43:03Z

text/0100-sensitive-data-handling.md

+sensitive, and some of it might not.  What we do is remove all 
+*potentially* sensitive values.  This means that sql like `SELECT * FROM USERS WHERE ID=123456789` gets turned into
+`SELECT * FROM USERS WHERE ID=?` when it gets put into our `db.statement` attribute.  Since our library doesn't know anything about the 
+meaning of `ID` values, we decided to be "safe by default" and drop all numeric and string values entirely from all SQL.  So what gets 


Does this work for all sorts of SQL dialects out there?

Yes, a sql lexer can easily broadly apply to many vendors' dialects, because it's only interested in tokenization and not in syntax/grammar.

This means that sql like SELECT * FROM USERS WHERE ID=123456789 gets turned into
SELECT * FROM USERS WHERE ID=? when it gets put into our db.statement attribute.

I don't see how this can possibly be done efficiently inside the language SDKs, as opposed to scrubbing this in the ingestion pipeline.

Only seeing this otep now as it was linked to from another newer otep -- I'd like to see something like this added so going over this one now.

I don't think this is a good example since the db statement should already be parameterized, like an extended query in postgres. Are there many databases that don't support a form of query + args like found in SQL databases?

So for db instrumentation we should document to the user that they should not manually construct queries with dynamic data in general but most importantly for the case of instrumentation, to not do it with sensitive data.

And instead the query parameter attributes need to be either not included at all or have the ability to be filtered by key.

This would require no SQL parsing library and be doable in all of the implementations.

Of course best practices aren't always an option so having the SQL parsing that can scrub all predicates in the collector would be a must have for some.

I also think this relates to span processors and how they need updating to provide functionality like this. I planned for that to be a follow up that builds on my OTEP #186 but since it has not been met with a positive reception so far I may have to rethink that :)

This OTEP is about a "broad agreement" and I agree with that but put the rest here because I think the broad agreement should include span transformations in the export pipeline that doesn't yet exist, and that one of those transformations should be a required configurable attribute scrubber.

arminru · 2020-04-28T13:45:09Z

text/0100-sensitive-data-handling.md

+we offer configuration options to work around this.
+
+If your app doesn't process any sensitive data at all and you want to see the raw sql values, you can configure that.  The documentation
+will have some bold/red warnings telling you to be sure you know what you're doing.


Is this something to be configured in the UI of a backend or at the place where the instrumentation is set up? Are these bold red warnings supposed to show up in a log produced by the instrumentation or in the tracing backend's UI?

Config would remain local to each library (a manual wrapper might provide a dictionary of options as an API parameter, an auto agent might use an environment variable or config file). The warning would appear in the documentation of this config knob.

dyladan · 2020-04-28T16:05:45Z

In general, the use of "we" here is confusing. Is this describing the approach you've taken at your company, or retroactively describing what you would like to see OpenTelemetry do?

johnbley · 2020-04-28T22:54:05Z

In general, the use of "we" here is confusing. Is this describing the approach you've taken at your company, or retroactively describing what you would like to see OpenTelemetry do?

I was following the recommendation in the template: "Explain the proposed change as though it was already implemented and you were explaining it to a user" and, yes, using "we" to mean "the whole OpenTelemetry community".

jmacd · 2020-04-29T17:47:57Z

text/0100-sensitive-data-handling.md

+the number of values in an `IN` list, presence of an index on fields, etc.
+
+Some might consider doing most of this work at the collector, rather than in-process with the app.  For users who are already exposed to this 
+issue, I don't think the statement "yes, by default we scrub the credit card number immediately after it is sent over the wire to our collector" is a very reassuring one.


I'm skeptical about the performance impact on the host application, somewhat, but I'm even less convinced that we're going to build out scrubbers for PII in every client language.

This seems to suggest a configuration where an OTel collector is co-located in the side-car configuration, so that it resides in the same security domain. Then the statement is "yes, by default we scrub the credit card number immediately after it crosses outside your process a locally running sidecar."

+1. Making the promise that scrubbing will be done in-process in every language is a very large scope creep. Doing it once in a sidecar or central tier is much more sustainable and manageable.

yurishkuro

There is no consensus that we should be making the promise that scrubbing will be done in-process in every language. It's a significant scope creep. Doing it once in a sidecar or central tier is much more sustainable and manageable.

bogdandrutu · 2020-04-30T01:27:29Z

@yurishkuro +1 I think we should promise that we have it in the collector, but not in all the libraries.

johnbley · 2020-05-01T12:10:55Z

I hear and accept the concern around the cost of owning this in each language. I will rework the proposal so that the default system (including the collector) preserves the desired behavior, allowing but not requiring instrumentation libraries to do their own scrubbing. One design concern I have for this, though, is that the collector loses some semantic information that the instrumented process has. For example, we currently use db.statement for "plain" sql and also Mongo, Redis, Geode, Couchbase, etc. queries. Instrumentation libraries of course know which thing they're instrumenting and can apply appropriate logic. Under this design, how will the collector know which semantic transformations to apply?

yurishkuro · 2020-05-02T01:33:58Z

how will the collector know which semantic transformations to apply?

I think that is a matter for semantic data conventions. On of the other db.*** attributes should provide this clarification.

arminru · 2020-05-04T14:45:14Z

how will the collector know which semantic transformations to apply?

I think that is a matter for semantic data conventions. On of the other db.*** attributes should provide this clarification.

Exactly. The current spec has a required attribute db.type, which specifies the type of database being called. This attribute will be passed on to the collector untouched.

…dividual languages/libraries.

lizthegrey · 2020-05-05T18:22:19Z

+1 that data scrubbing, encryption/sanitization (a la https://docs.honeycomb.io/authentication-and-security/secure-tenancy/ or Lightstep's satellites) should be done on client's premises via a collector or satellite, but not necessarily in-process for every telemetry generating SDK.

MovieStoreGuy · 2021-12-02T22:43:44Z

@lizthegrey ,

In some scenarios, data scrubbing or data validation would need to be done in the application itself due to organisation policies that don't allow for the data to be processed by a third party application.

(Sorry, I have a similar OTEP that I am trying to advocate for)

tedsuo · 2023-07-31T16:21:52Z

@johnbley we are cleaning up stale OTEP PRs. If there is no further action at this time, we will close this PR in one week. Feel free to open it again when it is time to pick it back up.

johnbley added 2 commits April 27, 2020 19:47

Sensitive Data Handling OTEP

bc78353

Verbiage tweaks based on another read through.

b95e94c

johnbley requested review from arminru, bogdandrutu, c24t, carlosalberto, iredelmeier, jmacd, reyang, SergeyKanzhelev, tedsuo, tigrannajaryan and yurishkuro as code owners April 27, 2020 23:59

Rename OTEP per PR id (100).

509e83b

Oberon00 reviewed Apr 28, 2020

View reviewed changes

text/0100-sensitive-data-handling.md Outdated Show resolved Hide resolved

carlosalberto approved these changes Apr 28, 2020

View reviewed changes

Explicitly call out URL capturing in the headline.

7349386

arminru reviewed Apr 28, 2020

View reviewed changes

jmacd reviewed Apr 29, 2020

View reviewed changes

yurishkuro requested changes Apr 29, 2020

View reviewed changes

Alter section about the collector providing a default backstop for in…

f142c93

…dividual languages/libraries.

fbogsany mentioned this pull request Aug 5, 2020

Obfuscate query values in db.statement for mysql2 instrumentation. open-telemetry/opentelemetry-ruby-contrib#19

Closed

Base automatically changed from master to main January 27, 2021 20:37

bogdandrutu requested a review from a team January 27, 2021 20:37

trask mentioned this pull request Feb 18, 2021

Enable HttpServer / HttpClient instrumentation to exclude query strings open-telemetry/opentelemetry-java-instrumentation#2302

Open

remram44 mentioned this pull request Feb 7, 2024

Semantic conventions for database should be explicit about parameters / placeholder values open-telemetry/semantic-conventions#711

Closed

MovieStoreGuy mentioned this pull request Nov 29, 2021

Data Classifications for resources and attributes #187

Closed

tedsuo added the triaged label Feb 13, 2023

lmolkova mentioned this pull request Apr 3, 2024

Guidelines for redacting sensitive information open-telemetry/semantic-conventions#877

Open

tedsuo added the stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. label Jul 31, 2023

pyohannes mentioned this pull request Apr 29, 2024

Specific URL query string values should be redacted open-telemetry/semantic-conventions#971

Open

lmolkova mentioned this pull request Apr 30, 2024

Sensitive Data Redaction #255

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensitive Data Handling #100

Sensitive Data Handling #100

johnbley commented Apr 27, 2020

carlosalberto commented Apr 28, 2020

arminru Apr 28, 2020

johnbley Apr 28, 2020

dyladan Apr 28, 2020

arminru Apr 28, 2020

johnbley Apr 28, 2020

arminru Apr 28, 2020

arminru Apr 28, 2020

johnbley Apr 28, 2020

arminru Apr 28, 2020

johnbley Apr 28, 2020

yurishkuro Jan 27, 2021

tsloughter Nov 30, 2021

arminru Apr 28, 2020

johnbley Apr 28, 2020

dyladan commented Apr 28, 2020

johnbley commented Apr 28, 2020

jmacd Apr 29, 2020

yurishkuro Apr 29, 2020

yurishkuro left a comment

bogdandrutu commented Apr 30, 2020

johnbley commented May 1, 2020

yurishkuro commented May 2, 2020

arminru commented May 4, 2020

lizthegrey commented May 5, 2020

MovieStoreGuy commented Dec 2, 2021

tedsuo commented Jul 31, 2023

		@@ -0,0 +1,103 @@
		# Sensitive Data Handling

		By default, OpenTelemetry libraries never capture potentially-sensitive data, except for the full URL.

	By default, OpenTelemetry libraries never capture potentially-sensitive data, except for the full URL.
	By default, instrumentation libraries provided by OpenTelemetry make a best effort to never capture potentially-sensitive data, except in cases like the full URL where the data is necessary to the observability of the system.


		## Explanation

		By default, our java auto-instrumentation library does not capture your app's credit card numbers, taxpayer IDs,

Sensitive Data Handling #100

Are you sure you want to change the base?

Sensitive Data Handling #100

Conversation

johnbley commented Apr 27, 2020

carlosalberto commented Apr 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dyladan commented Apr 28, 2020

johnbley commented Apr 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurishkuro left a comment

Choose a reason for hiding this comment

bogdandrutu commented Apr 30, 2020

johnbley commented May 1, 2020

yurishkuro commented May 2, 2020

arminru commented May 4, 2020

lizthegrey commented May 5, 2020

MovieStoreGuy commented Dec 2, 2021

tedsuo commented Jul 31, 2023