-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Generic logline placeholder replacement and tokenization #12799
Conversation
This code allows us to preprocess generic logs and replace highly variable dynamic data (timestamps, IPs, numbers, UUIDs, hex values, bytesizes and durations) with static placeholders for easier pattern extraction and more efficient and user-friendly matching by the Drain algorithm. Additionally, there is logic that splits generic log lines into discrete tokens that can be used with Drain for better results than just naively splitting the logs on every whitespace. The tokenization here handles quote counting and emits quoted strings as a part of the same token. On the other side, it also handles likely JSON logs without any white spaces in them better, by trying to split `{"key":value}` pairs (without actually parsing the JSON). All of this is done without using regular expressions and without actually parsing the log lines in any specific format. That's why it works very efficiently in terms of CPU usage and alloctions, and should handle all log formats and unformatted logs equally well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tough to review but as far as I can tell you've already covered the vast majority of test cases and it seems to work well. I have some concerns about maintainability but I could follow along & it is well commented so I think it's as good as it can be.
The rest of my comments are relatively minor so I'm happy to approve this. Nice job, very impressive implementation!
var boundryChars = [256]bool{} | ||
|
||
func init() { | ||
for i := 0; i < 128; i++ { // for now, we don't consider non-ASCII chars as boundary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the array is length 256 but the iterator only goes up to 128. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the rest of the code, it looks like it is intentional. I assume this is a performance optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I commented with part of my reasoning on the line here, that "for now, we don't consider non-ASCII chars as boundary". And the reason for that, as you've surmised, is mostly performance, but also avoiding extra complexity for arguably very little benefit.
Processing the string rune by rune would probably be significantly more computationally expensive than going byte by byte, so I think we should avoid it if we can. A lot of Go string processing functions do the same, they have a happy path that works with bytes if they can, just search for utf8.RuneSelf
(which is 128
) in the strings.go
file from the Go stdlib 😅
For our use case, while there are quite a few non-ASCII whitespace and non-word characters in Unicode we could consider as "boundary" characters, it's highly unlikely for any of them to be used in logs in a way that matters for the purposes of this code 🤞 If they are used, it's likely going to be in some quoted strings like msg="unicode “съобщение” 😊"
, not in the actual log message structure.
And in the vanishingly few cases where that assumption doesn't hold, it's unlikely that this code would result in a worse outcome than if we just used Drain + whitespace tokenization without it, so it shouldn't hurt to have it 😅
In other words, I'd prefer to avoid dealing with that mess unless we see that we really need to 😅
} else if n[1] == 's' && (n[0] == 'm' || n[0] == 'n' || n[0] == 'u') && boundryChars[n[2]] { | ||
secsLen = 2 | ||
} else if n[2] == 's' && ((n[0] == 0xC2 && n[1] == 0xB5) || (n[0] == 0xCE && n[1] == 0xBC)) && boundryChars[n[3]] { | ||
// This checks for the unicode "µs" (U+00B5 = micro symbol) and "μs" (U+03BC = Greek letter mu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://pkg.go.dev/time#ParseDuration, only one of these is actually valid. But they look identical so I guess it's good to support both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's where I actually got both from, see this 😅
[]byte("Saturday"), | ||
} | ||
|
||
func (r *replacer) handleWeirdTimestamp() (endsWithBoundary bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not useful: I love this method name, but I was secretly hoping for handleInsaneTimestamp
!
// - "Mon Jan 2 15:04:05 2006" | ||
// - "Mon Jan 2 15:04:05 MST 2006" | ||
// - "Mon Jan 02 15:04:05 -0700 2006" | ||
if r.advanceChar(' ') && r.advanceMonthName() && r.advanceChar(' ') && (r.advanceChar(' ') || true) && r.advanceUintUpTo(31, 2) && r.advanceChar(' ') && r.advanceTime(false) && r.advanceStringOrNumericTimeZone(true) && r.advanceChar(' ') && r.advanceYear() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& (r.advanceChar(' ') || true)
this is clever 😻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the potential minor performance improvements I want to benchmark at some future point is actually getting rid of advanceOneOf
. We only use it in a couple of places, here:
loki/pkg/pattern/tokenization/replacer.go
Line 191 in c62c6cf
return r.advanceOneOf('+', '-') && r.advanceUintUpTo(14000, 5) |
and here:
loki/pkg/pattern/tokenization/replacer.go
Line 535 in c62c6cf
if r.advanceUintUpTo(12, 2) && r.advanceChar(delim) && r.advanceUintUpTo(31, 2) && r.advanceOneOf('T', ' ') && r.advanceTime(false) { |
And both of these can be easily rewritten like this:
(r.advanceChar('+') || r.advanceChar('-'))
(r.advanceChar('T') || r.advanceChar(' '))
I'll actually comment on a few other potential minor performance improvements I've thought of but haven't had time to benchmark, just so I don't forget them in the future 😅
Sorry if this makes the code review harder, but I decided that now was probably the best moment to split the (No code was actually changed in that commit, just moved around) |
break | ||
} | ||
length++ | ||
val = val*10 + uint(c-'0') // TODO: use bitwise trick? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I meant by "bitwise trick": https://www.guidodiepen.nl/2017/03/clever-bit-tricks-with-ascii/#integervaluesinascii
I doubt it's actually faster on modern CPU architectures (and it's certainly less readable), but worth benchmarking to confirm, just in case 😅
if pPos != -1 && !boundaryChars[r.source[pPos]] { | ||
continue | ||
} | ||
if bytes.HasPrefix(r.source[pPos+1:], key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably can be optimized, the numPreservingKeys
values can be of a string
type and if this check is something like string(r.source[pPos+1:pPos+len(key)]) == key)
, it might be faster. This should not result in an actual string allocation because of some Go compiler optimizations for cases like this. Besides, this is basically what bytes.HasPrefix
does internally as well, just with an extra bounds check and function calls... 😅 Which are probably inlined, but it can't hurt to compare the values directly 🤷♂️
if bytes.HasPrefix(backtrackedSlice, dw) { | ||
matchedDay = dw | ||
break | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If longDayNames
was a map[[3]byte]
or map[string]
, the lookup might be a bit faster if we can do it without an allocation, though not sure if that matters a lot for a slice of 7 elements 😅 Still, while this specific part of the code is probably executed fairly rarely, it couldn't hurt to benchmark 😅
if i < '0' || (i > '9' && i < 'A') || (i > 'Z' && i < 'a') || i > 'z' { | ||
boundaryChars[i] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not certain if the slice lookups like if boundaryChars[c] { /* do something */}
in the code below are actually faster than if we had a simple function like this:
func isBoundary(c byte) {
return c < '0' || (c > '9' && c < 'A') || (c > 'Z' && c < 'a') || (c > 'z' && c < 128)
}
I suspect the function will be faster because of how CPU caching works, but I haven't actually benchmarked it. I adopted this approach originally because I had a couple of exceptions here (e.g. I didn't consider _
to be a boundary character initially).
|
||
// We weren't at a boundary and now we are, so check if we match one | ||
// of the special keywords that will preserve numbers | ||
if onBoundary { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also add another check here to avoid the extra processing here? r.cur - r.tail > minNumPresevingKey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm 🤔 Maybe, though I am not sure if this will always hold. For example, the _
in status_code=
is technically a boundary character, so it's not inconceivable for status
to already be "flushed" when we reach the =
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just stepped through the status_code=123
example with a debugger and that's exactly what happens, because _
is a boundary and c
is a potential start of a hex string, we encounter this branch and copy status_
to the r.dest
slice (so r.cur - r.tail
becomes 0) 😞
Let's keep this thread open though, as a potential performance improvement along with the other ones I opened, because that's an implementation detail that's fixable. If profiling proves that this would make a big difference, we can change the other behavior to be less aggressive in copying the source string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀 thanks again for this @na--
not for this pr, but it might be worth having separate tests for the replacer and tokeniser, helps narrowing down the problem if we see any bugs down the line.
What this PR does / why we need it:
This code allows us to preprocess generic logs and replace highly variable dynamic data (timestamps, IPs, numbers, UUIDs, hex values, bytesizes and durations) with static placeholders for easier pattern extraction and more efficient and user-friendly matching by the Drain algorithm.
Additionally, there is logic that splits generic log lines into discrete tokens that can be used with Drain for better results than just naively splitting the logs on every whitespace. The tokenization here handles quote counting and emits quoted strings as a part of the same token. On the other side, it also handles likely JSON logs without any white spaces in them better, by trying to split
{"key":value}
pairs (without actually parsing the JSON).All of this is done without using regular expressions and without actually parsing the log lines in any specific format. That's why it works very efficiently in terms of CPU usage and alloctions, and should handle all log formats and unformatted logs equally well.
Checklist
CONTRIBUTING.md
guide (required)docs/sources/setup/upgrade/_index.md
production/helm/loki/Chart.yaml
and updateproduction/helm/loki/CHANGELOG.md
andproduction/helm/loki/README.md
. Example PRdeprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR