DB sanitization uniform format #717

avzis · 2023-01-26T15:25:29Z

What are you trying to achieve?

According to the db spec, the db.statement value can be sanitized, but it is not defined how to do so.

Currently, the sanitization is being dealt with differently in few places.
I suggest to add a uniform format that will describe how to do the sanitization.
(It will be best if this format will apply to all different DB's and syntaxes)

Different implementations examples:

JS mongo db - implements the sanitization by replacing the information with question marks
Python pymongo - implements the sanitization by deleting the information completely, and leaving the query method name only
Python elasticsearch (WIP) - suggests to replace the information with a string that will explain that the data is sanitized

I suggest a few options to replace the value with:

Keep the method name, and add a sanitized text.
for example: db.statement = "SELECT {query information is sanitized}"
advantages: quite easy to implement, easy to keep consistent across libraries.
dis-advantages: will require to research whether all different libraries can handle this format effectively.
Simple text that will describe that the value is sanitized.
for example: db.statement = "query information is sanitized"
advantages: easy to implement, easy to keep consistent across different libraries.
dis-advantages: doesn't supply basic information about the query that could be useful.
Replace the values with question marks.
for example: db.statement = "SELECT ? FROM ?"
advantages: keeps more amount of information, while still not exposing sensitive or private data.
dis-advantages: harder to implement, harder to keep consistent across libraries.

I would like to hear opinions about the suggested solutions, or hear different ideas.

Additional context.

open-telemetry/opentelemetry-specification#3104 - Issue regarding changing the recommendation to sanitize the information by default.
#708 - Issue about missing examples for sanitization in specs.

The text was updated successfully, but these errors were encountered:

mateuszrzeszutek · 2023-02-01T08:15:44Z

FWIW Java implements the 3rd approach in all DB clients that we currently instrument (SQL and SQL-like, Redis, MongoDb) - enabled by default, disabled by a configuration setting. We attempt to sanitize all variables, while preserving the database schema if possible.

I think that the first two approaches do not provide any useful information (the first one simply contains db.operation, the second one adds no info at all) and do not differ from the "delete db.statement" option.

blumamir · 2023-02-12T11:14:11Z

If the data is sanitzed, and we write some text like "SELECT ? FROM ?" in the attribute value, it could be hard for backends to correctly label this value and process it.

For example, if processors try to extract the table name from the statement, they might try to parse this text and decide the table name is a valid "?" instead of "missing"/"sanitized". If the value is "SELECT {query information is sanitized}" it will probably fail parsing the SQL statement which could be either invalid behavior of the app (requires attention), or a valid "magic text" to signal lack of data.

Another example - SET foo ? - is it setting the value for key foo to ? or was it sanitized? it's impossible to tell for sure. It's another mental step for users to do in order to understand what they are looking at.

I would love to see some deterministic rules that allow the processors to differentiate between sanitized statements and actual statements. If it involve parsing or looking for magic text, that should be precisely speced.
Do you think it make sense to set the db.statement attribute with an empty string "" to signal it was not collected intentionally?

trask · 2024-02-07T16:29:04Z

@open-telemetry/technical-committee can you move this to https://github.com/open-telemetry/semantic-conventions?

trask · 2024-04-24T16:46:35Z

possible language for SQL queries:

replace all literal values with ?

trask · 2024-04-26T15:42:25Z

should normalization be part of this or not? e.g.

normalize in (?, ?, ?, ?, ...) to in (?)

or

normalize in (?, ?, ?, ?, ...) to in (_more10_50_) to give an idea of number of items in the inlist

maryliag · 2024-04-26T16:35:01Z

Keep the method name, and add a sanitized text.
for example: db.statement = "SELECT {query information is sanitized}"
advantages: quite easy to implement, easy to keep consistent across libraries.
dis-advantages: will require to research whether all different libraries can handle this format effectively.

I wouldn't say is necessarily easy, because how do you decide the "main" method on a statement that have multiple methods such as update and select. Or statements that start with WITH, how the method is decided.

Replace the values with question marks.
for example: db.statement = "SELECT ? FROM ?"
advantages: keeps more amount of information, while still not exposing sensitive or private data.
dis-advantages: harder to implement, harder to keep consistent across libraries.

For this one I think is more helpful for the user but agree it will required a harder level to implement and keep consistency. For example, on CockroachDB the replacements were something as follows:
placeholders: $1
strings: '_'
anything else: _

When there was IN clause, it would be replaced by one of the values: __more1_10__, __more10_100__, _moreX__ (with X the between 100 and 1000, such as 100, 200, 300,...) or __more1000_plus__. That created a nice balance of separating groups that would use different plan executions, but at the same time keeping cardinality lower of different possible final strings, since the list can be quite big (I saw cases with 20k+ values in a list)

Another thing to consider on this option is level of sanitization, because is more common for the values to be hidden, but not the columns, so it would need options for sanitazing just the value or the value + columns.

trask · 2024-05-08T16:42:57Z

thoughts from SIG meeting:

general statement for all dbs - replace all string and numeric literals with some placeholder (maybe ?)

trask · 2024-05-17T16:00:07Z

from SIG meeting:

don't sanitize placeholders themselves (e.g. Improve r2dbc statement sanitization opentelemetry-java-instrumentation#11379)
the placeholder ('?') could be different for different database systems
note: parameterized queries aren't sanitized (Parameterized query text does not need to be sanitized by default #976)

github-actions bot assigned SergeyKanzhelev Jan 26, 2023

avzis mentioned this issue Jan 29, 2023

Change db.statement to only be collected if there is sanitization open-telemetry/opentelemetry-specification#3127

Merged

arielvalentin mentioned this issue Feb 1, 2023

SQL Instrumentation Improvements open-telemetry/opentelemetry-ruby-contrib#32

Closed

haddasbronfman mentioned this issue Feb 9, 2023

feat(db.values): add db.bind.values to yaml open-telemetry/opentelemetry-specification#3092

Closed

SergeyKanzhelev removed their assignment Feb 18, 2023

nemoshlag mentioned this issue Feb 22, 2023

Add botocore db.statement sanitization for dynamoDB open-telemetry/opentelemetry-python-contrib#1662

Open

9 tasks

lmolkova mentioned this issue Apr 3, 2024

Guidelines for redacting sensitive information #877

Open

jack-berg transferred this issue from open-telemetry/opentelemetry-specification Feb 7, 2024

github-actions bot assigned reyang Feb 7, 2024

jack-berg mentioned this issue Feb 16, 2024

Evaluate requirement level of db.statement #754

Open

trask mentioned this issue Apr 24, 2024

Add db.statement sanitization/masking examples #708

Closed

trask assigned trask and unassigned reyang May 8, 2024

maryliag mentioned this issue May 21, 2024

DB Grouping variables from statement #1053

Open

trask assigned gregkalapos and unassigned trask May 22, 2024

gregkalapos mentioned this issue May 31, 2024

Define sanitization for db.query.text #1100

Merged

3 tasks

lmolkova closed this as completed in #1100 Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB sanitization uniform format #717

DB sanitization uniform format #717

avzis commented Jan 26, 2023 •

edited

Loading

mateuszrzeszutek commented Feb 1, 2023

blumamir commented Feb 12, 2023

trask commented Feb 7, 2024

trask commented Apr 24, 2024

trask commented Apr 26, 2024

maryliag commented Apr 26, 2024

trask commented May 8, 2024

trask commented May 17, 2024 •

edited

Loading

DB sanitization uniform format #717

DB sanitization uniform format #717

Comments

avzis commented Jan 26, 2023 • edited Loading

mateuszrzeszutek commented Feb 1, 2023

blumamir commented Feb 12, 2023

trask commented Feb 7, 2024

trask commented Apr 24, 2024

trask commented Apr 26, 2024

maryliag commented Apr 26, 2024

trask commented May 8, 2024

trask commented May 17, 2024 • edited Loading

avzis commented Jan 26, 2023 •

edited

Loading

trask commented May 17, 2024 •

edited

Loading