Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403

Closed

Conversation

rnishtala-sumo
Copy link
Contributor

@rnishtala-sumo rnishtala-sumo commented Jan 10, 2024

Description:
Supports hashing multiple capture groups in a replacement string, with a prefix/suffix. Offers UX consistent with masking.

The following formats are now supported

  • prefix-$1-suffix
  • prefix-$1-suffix-$2
  • prefix-$1-$2-suffix

The function is applied as follows

  • prefix-hash($1)-suffix
  • prefix-hash($1)-suffix-hash($2)
  • prefix-hash($1)-hash($2)-suffix

Link to tracking Issue:
#27820
Implements Option 1 from this comment

Testing: Unit tests

Documentation:
Potentially a breaking change since the prefix/suffix of a capture group is no longer a part of the hash. Marked this as an enhancement at present.

The following hashing example is now consistent with masking behavior, where the k8s prefix is preserved and not included in the hash
replace_all_patterns(attributes, "key", "^kube_([0-9A-Za-z]+_)", "k8s.$$1.")

@rnishtala-sumo rnishtala-sumo force-pushed the hashed-repl-fmt branch 2 times, most recently from bcfead8 to 06f6ecc Compare January 16, 2024 23:07
@rnishtala-sumo rnishtala-sumo marked this pull request as ready for review January 16, 2024 23:07
@rnishtala-sumo rnishtala-sumo requested a review from a team January 16, 2024 23:07
@rnishtala-sumo rnishtala-sumo changed the title Only hash the capture group in the replacement string if specified Supports hashing multiple capture groups in a replacement string Jan 17, 2024
@rnishtala-sumo rnishtala-sumo changed the title Supports hashing multiple capture groups in a replacement string Supports hashing multiple capture groups in a replacement string with prefix/suffix Jan 17, 2024
@github-actions github-actions bot requested a review from kentquirk January 17, 2024 13:54
@djaglowski djaglowski changed the title Supports hashing multiple capture groups in a replacement string with prefix/suffix [OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix Jan 17, 2024
@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 17, 2024

@evan-bradley it actually looks like we don't have masking examples in our contrib repository, but rather there are examples in distribution repos like this one - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/docs/collecting-container-logs.md#modifying-log-records

This is how some customers currently use masking at the moment.

Also, our existing tests use the same pattern for masking

.chloggen/hashed-repl-fmt.yaml Outdated Show resolved Hide resolved
Comment on lines 71 to 79
// Extract capture group from replacement value to apply the function on it
r := regexp.MustCompile(`(\$[0-9])`)
matches := r.FindAllStringSubmatch(replacementVal, -1)
if len(matches) > 0 {
for _, match := range matches {
multipleCaptureGroups = append(multipleCaptureGroups, match[0])
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential performance improvement for later: check if the value inside StringGetter is actually a literal and do this work during startup instead. I think we'd need some interface changes in OTTL.

pkg/ottl/ottlfuncs/func_replace_pattern.go Outdated Show resolved Hide resolved
Comment on lines 81 to 109
if len(multipleCaptureGroups) > 0 {
for _, captureGroup := range multipleCaptureGroups {
replacementValStr, err = executeFunction(ctx, tCtx, compiledPattern, fn, originalValStr, captureGroup, submatch)
if err != nil {
return "", err
}
captureGroupMap[captureGroup] = replacementValStr
}
switch {
case len(captureGroupMap) > 1:
for key, value := range captureGroupMap {
replacementVal = strings.ReplaceAll(replacementVal, key, value)
}
updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementVal)

case len(captureGroupMap) == 1:
for key, value := range captureGroupMap {
replacementValStr = strings.ReplaceAll(replacementVal, key, value)
}
updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementValStr)

default:
updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementValStr)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot going on here, can you add describe why we need captureGroupMap and multipleCaptureGroups?

Copy link
Contributor Author

@rnishtala-sumo rnishtala-sumo Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the steps are as below:

  • Add all the unresolved capture groups in the replacement string to a list [$1, $2...]

  • Then as you resolve all the capture groups in the original string, build a map of all the capture groups like so {$1: hash(world1), $2: hash(world2)}

  • If there are entries in the map, it means that there are references to capture groups in the replacement string that need to be resolved before updating the original string

  • The outer loop that loops through all the submatch indices from the original string ensures that all occurrences of capture groups are replaced in the original string

Copy link
Contributor Author

@rnishtala-sumo rnishtala-sumo Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multipleCaptureGroups keeps track of all the unresolved capture groups in the replacement string. Then as we resolve the capture groups we build a captureGroupMap of unresolved to resolved capture groups. This helps us replace all references to unresolved capture groups with the resolved values in the replacement string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TylerHelmuth does the above description make sense? Please let me know if you have any concerns with this approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this pretty hard to follow, can you add comments in the code.

Copy link
Contributor Author

@rnishtala-sumo rnishtala-sumo Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments to this part of the code, please let me know if you have questions on it.

@evan-bradley
Copy link
Contributor

@rnishtala-sumo Thanks for linking to the masking docs. There may be a way around this even with only passing the capture groups, but I'm missing how we can address my concern here: #27820 (comment). A user may want to control the input to the hash function so it includes multiple parts from the input string or some static text, likely so the hash matches with hashes produced in other parts of their system.

Using our test cases, with input string goodbye world1 and world2, pattern (world1) and (world2), and some replacement string equivalent to blue-$1 and blue-$2, it's not clear to me how we could get goodbye hash(blue-world1 and blue-world2). This test case would work with the current implementation of the function.

@rnishtala-sumo rnishtala-sumo force-pushed the hashed-repl-fmt branch 2 times, most recently from 7178973 to fc72d6f Compare January 18, 2024 22:50
@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 18, 2024

@evan-bradley how about giving the user an option to choose the hashing behavior, i've written a test for it.

Since we're calling this a breaking change, the default behavior would be to hash only the capture groups. The user will be able to include the prefix/suffix in the hash by passing in a boolean (includeNonCapture) to mimic the current hashing behavior.

I want to emphasize that if the user doesn't use a prefix/suffix with a capture group then there isn't any change in hashing behavior. All the existing test cases around this do pass.

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 23, 2024

@evan-bradley regarding this comment

I don't have any real-world cases to point to here, but I'm a little worried we would lose too much control over how the hash is created if we do this.

I attempted to address this in the latest commit by giving the user a choice to control how they would hash the string goodbye world1 and world2 with the regex (world1) and (world2) and replacement - blue-$1 and blue-$2

  • Including non capture groups will result in goodbye hash(blue-world1 and blue-world2)
  • The default behavior would be goodbye blue-hash(world1) and blue-hash(world2)

A simpler example for this would be if a user decides to hash a string password=1234, both of the following would be supported with the optional boolean param (IncludeNonCapture)

  • hash(password=1234)
  • [default] password=hash(1234)

@rnishtala-sumo rnishtala-sumo force-pushed the hashed-repl-fmt branch 3 times, most recently from 8a8a620 to 7280a1c Compare January 23, 2024 22:57
@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 26, 2024

@TylerHelmuth @evan-bradley. Do we think the approach mentioned here is reasonable for this issue? The motivation for this approach was this comment. Or do you think we should look at the approach mentioned here to support curried functions in OTTL?

@TylerHelmuth
Copy link
Member

@rnishtala-sumo thanks for being persistent on this issue and sorry it is taking so long. This is definitely the most complex function issue we've had in OTTL and its definitely pushing the language.

I believe this PR is a solution for the problem, but I find it quite complex. Requiring an extra Optional param and the need to keep track of all the different situations within the business logic is not simple and the code reflects that. I am worried about the upkeep of this function and the trouble we'll have adding any future features or performing any future refactors. From an end user perspective the Optional parameters wont be used by most, but for those that do want to use the optional params I worry about understanding exactly how everything is applied.

I think it would be good to do some comparisons with the other options. As I understand it, this PR implements Option 1 from #27820 (comment). I'd like to see Option 2 (format param) implemented as well. Since Option 3 (currying) takes a lot of language changes, I vote we start by looking at Option 2 and if we like it we could discuss taking it a step further with Option 3, which is a more powerful Option 2.

How does that sound?

Again, sorry for all the delays with this issue and thanks for sticking with it. This is an important use case for OTTL/transformprocessor and I want to make sure we get it right.

@evan-bradley
Copy link
Contributor

evan-bradley commented Jan 26, 2024

@rnishtala-sumo Thanks for adding the extra parameter. I think the issue I still see here is that we aren't able to format the output of the optional function. At the end of the day I think we will need to allow the user to format the input to the optional function and how the output of the optional function is formatted before it is used as a replacement in the string.

I also realize that the current implementation optimizes the UX for masking use cases, which may in fact be sufficient for a large majority of users. Without others chiming in to say they also expect the function to be used like this it's a little hard to independently determine whether this is the case. Could you detail more about how you expect end users to use this functionality? I think this could help inform what UX will be best, and when we can optimize for certain cases and suggest more complicated solutions for others.

I agree with what @TylerHelmuth said about the implementation: I think it is great that you've made this work, but it is also complex. I think option 2 would be a good first exploration unless you think option 3 would be better. If we attempt option 3 I think we would want to know that it could be helpful in other places in OTTL just so we're not implementing something for a specific use case.

I also want to echo what @TylerHelmuth said and thank you for your patience on this. I've spent the past few days seriously considering this and still haven't been able to come to a solid conclusion on what the best solution is here. It's certainly an important problem, but it's also a hard one.

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 26, 2024

@TylerHelmuth @evan-bradley , no problem, thank you for the suggestions. Just to refresh our memories on this, Option 2 from this comment was what I started with. This was the PR that was closed, because of this comment which gained traction and led to option 1 🙂

The only reason I added the additional optional param in Option 1 was because we didn't want to lose an existing [use case].(#30403 (comment)). The original intent of option 1 was very much to change the existing hash behavior only when capture groups are used in the replacement string. Other use cases (not involving capture groups) wouldn't change.

Having said that, I don't mind pivoting back to option 2 which is the simplest of all options for this issue and maybe open a separate issue for supporting option 3 where we could track all the use cases for it.

@rnishtala-sumo
Copy link
Contributor Author

Could you detail more about how you expect end users to use this functionality? I think this could help inform what UX will be best

I think the example below is a good starting point. Whether the user would like to mask or hash a card number in the log, they would have the same UX, if we stick to option 1.

https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/docs/collecting-container-logs.md#modifying-log-records

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 26, 2024

Another example

2012-05-16 09:43:39,607 -0700 DEBUG [hostId=prod-cass-raw-8] [module=RAW] [logger=scala.raw.InboundRawProtocolHandler] [[email protected]] [remote_ip=98.248.40.103] [local_ip=192.168.0.1]

Hash the remote IP
With option 1:
replace_pattern(body, "remote_ip=(^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)", "remote_ip=$$1", SHA256)

With option 2:
replace_pattern(body, "remote_ip=(^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)", "$$1", SHA256, "remote_ip=%s")

To Mask the remote IP
replace_pattern(body, "remote_ip=(^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)", "remote_ip=***")

Option 1 is closer to what masking does today and requires fewer params/user inputs

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Jan 26, 2024

@rnishtala-sumo I totally forgot about #27686, thanks for bringing that up. Let's revive that PR with all the other changes/tests that have been made to the functions and see what it looks like. I'm really curious to see how that solution lets us format the output of the optional function.

Also looking at this scenario, I am totally more in favor of Option 2 over Option 1. Option 2 is definitely easier for me to comprehend than Option 1 since I don't have to do any mental math on what is being hashed. @evan-bradley I can also see how currying functions together is maybe even clearer.

@rnishtala-sumo
Copy link
Contributor Author

@TylerHelmuth @evan-bradley Couldn't reopen the same PR, so created a new one for option 2 using the same feature branch.

#30837

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 13, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants