-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Change Tag ID generation to avoid possible collisions #1286
Conversation
// of the tags into the provided buffer. | ||
func (t Tags) IDMarshalTo(b []byte) []byte { | ||
// ID returns a byte slice representation of the tags. | ||
func (t Tags) ID() []byte { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also fix this ID so that it doesn't generate potentially colliding metric IDs?
i.e.
series 1, tags={"set_values":"something=foo,someother=bar"}
# generates this ID:
set_values=something=foo,someother=bar,
series 2, tags={"set_values":"something=foo","someother":"bar"}
# generates same ID:
set_values=something=foo,someother=bar,
In Prometheus they use strconv.Quote(...)
to avoid this:
https://github.com/prometheus/tsdb/blob/master/labels/labels.go#L43-L58
You could refactor this method from the Go std library to use just []byte as well:
https://golang.org/src/strconv/quote.go
There's some sub-functions but it wouldn't be too difficult to do this properly by using a slightly modified version of it:
func appendQuotedWith(buf []byte, s string, quote byte, ASCIIonly, graphicOnly bool) []byte {
buf = append(buf, quote)
for width := 0; len(s) > 0; s = s[width:] {
r := rune(s[0])
width = 1
if r >= utf8.RuneSelf {
r, width = utf8.DecodeRuneInString(s)
}
if width == 1 && r == utf8.RuneError {
buf = append(buf, `\x`...)
buf = append(buf, lowerhex[s[0]>>4])
buf = append(buf, lowerhex[s[0]&0xF])
continue
}
buf = appendEscapedRune(buf, r, quote, ASCIIonly, graphicOnly)
}
buf = append(buf, quote)
return buf
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Would you recommend going this way or maybe doing something like pre-processing
tag lengths and then appending it to the start of the ID, so it would look like:
series 1
"set_values":"something=foo,someother=bar" -> 0.set_values:something=foo,someother=bar
series 2
"set_values":"something=foo","someother=bar" -> 0,24.set_values:something=foosomeother:bar
Codecov Report
@@ Coverage Diff @@
## master #1286 +/- ##
========================================
- Coverage 70.6% 64.5% -6.1%
========================================
Files 811 814 +3
Lines 68960 69248 +288
========================================
- Hits 48698 44733 -3965
- Misses 17126 21545 +4419
+ Partials 3136 2970 -166
Continue to review full report at Codecov.
|
@@ -1,4 +1,4 @@ | |||
// Copyright (c) 2018 Uber Technologies, Inc. | |||
// Copyright (c) 2019 Uber Technologies, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this got detected as a moved file, gut I deleted the resolver and added this file instead
Nice! Mind posting some benchmark results to get a sense of the improvement here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks decent!
Couple high level things:
- The code's a bit hard to read--left some potential suggestions on that.
- How much of a perf boost is this?
- Can we separate out refactored stdlib
strconv
code from our custom functions at all? Would make reviewing easier I think (I hardly looked at thestrconv
stuff, but if I know which bits we wrote, I can take a closer look at it.
takeFunc takeFunc | ||
params NodeParams | ||
opType string | ||
opTypeBytes []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do we really need both opTypeBytes
and opType
, or could just store one and convert as needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah damn, thought I'd reverted these, good catch.
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
// THE SOFTWARE. | ||
|
||
package strconv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the story with this package? Is it related to Rob's comment about refactoring the stdlib package to use []byte
? Mind adding a package comment describing that?
Also from a perf perspective, how necessary is this compared to using built-in strconv
and converting back to bytes? How hot of a path is this in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this refactors the strconv
quote.go
approach in two main ways;
- changes string functions to bytes
- works on a pre-set byte slice and inserts at indices rather than working on
append
The ID function is called on each incoming write, so it's fairly hot.
src/query/models/tags.go
Outdated
return id | ||
} | ||
|
||
func prependTagLengthMeta(dst []byte, lengths []int) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: is prepend the right term here? prepend to me implies adding to the beginning of the slice, and shifting it); I'd call these either append*
or write*
. write
is probably best, since append can grow the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good. Was thinking prefix
or something as well, but will probably go with write
src/query/models/tags.go
Outdated
return idx | ||
} | ||
|
||
// idLen returns the length of the ID that would be generated from the tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Comment needs updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
} | ||
|
||
// ID returns a byte slice representation of the tags. | ||
func (t Tags) prependMetaID() []byte { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment needs updating
src/query/models/tags.go
Outdated
return id | ||
} | ||
|
||
func (t Tags) escapingAndLength() ([]bool, int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on this? Not immediately clear what the first return value is, which makes it kind of hard to parse this function in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm assuming this approach of figuring out where needs to be escaped and then allocating once is faster in benchmarks (since it does make for more complex code)--is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential way to make this signature more clear without compromising perf (I think):
type tagEscaping struct {
EscapeName bool
EscapeValue bool
}
func (t Tags) escapingAndLength() ([]tagEscaping, int) {
...
https://play.golang.org/p/ctnDxYvBxyK shows that the size of the two is the same (i.e. an array of 2 of the struct takes up the same space as 4 bools)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, hitting append
a lot can hurt performance since it's constantly re-allocating and copying slices around, and you end up with a slice that's too big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we need to calculate the length for the whole tag set, otherwise we're allocating a bunch of temporary buffers per-tag, then putting them together. If Tags
handles this itself, it can determine the total buffer size for the whole ID
|
||
// NeedToEscape returns true if the byte slice contains characters that will | ||
// need to be escaped when quoting the slice. | ||
func NeedToEscape(bb []byte) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these *Escape
functions come from strconv
? If not, should we move them to a separate package? It'd be easier to reason about this package given a clear separation of our modifications and copied code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a super cut down version of utf8.DecodeRune
src/query/models/tags.go
Outdated
// TODO: pool these bytes. | ||
id := make([]byte, length) | ||
idx := 0 | ||
for i, tag := range t.Tags { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential readability refactor for your consideration: could we build up this function in terms of a method to write out a single tag to a []byte
? Something like:
length := 0
for _, t :=` range t.Tags {
sliceLen += t.SerializedLength() // this would take into account NeedsEscaping, etc
}
id := make([]byte, length)
for _, t := range t.Tags {
t.WriteTo(id)
}
src/query/models/tags.go
Outdated
if needEscaping[i*2+1] { | ||
idx = strconv.Quote(id, tag.Value, idx) | ||
} else { | ||
idx = strconv.QuoteSimple(id, tag.Value, idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much benefit does having Quote
and QuoteSimple
give you? Code would definitely be simpler without the distinction (though I could see how this would be faster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a really large difference;
10_tags,_2_len_easy-8 5000000 328 ns/op 64 B/op 1 allocs/op
10_tags,_2_len_hard-8 3000000 583 ns/op 112 B/op 2 allocs/op
100_tags,_2_len_easy-8 500000 3207 ns/op 704 B/op 1 allocs/op
100_tags,_2_len_hard-8 300000 5511 ns/op 1104 B/op 2 allocs/op
1000_tags,_2_len_easy-8 50000 31613 ns/op 8194 B/op 1 allocs/op
1000_tags,_2_len_hard-8 30000 59363 ns/op 12292 B/op 2 allocs/op
10_tags,_10_len_easy-8 3000000 397 ns/op 144 B/op 1 allocs/op
10_tags,_10_len_hard-8 1000000 1257 ns/op 272 B/op 2 allocs/op
100_tags,_10_len_easy-8 500000 4007 ns/op 1536 B/op 1 allocs/op
100_tags,_10_len_hard-8 100000 13479 ns/op 2896 B/op 2 allocs/op
1000_tags,_10_len_easy-8 30000 41068 ns/op 16388 B/op 1 allocs/op
1000_tags,_10_len_hard-8 10000 155339 ns/op 29325 B/op 2 allocs/op
10_tags,_100_len_easy-8 1000000 1176 ns/op 1152 B/op 1 allocs/op
10_tags,_100_len_hard-8 200000 9091 ns/op 2080 B/op 2 allocs/op
100_tags,_100_len_easy-8 200000 10434 ns/op 10880 B/op 1 allocs/op
100_tags,_100_len_hard-8 20000 90329 ns/op 21969 B/op 2 allocs/op
1000_tags,_100_len_easy-8 10000 108446 ns/op 106519 B/op 1 allocs/op
1000_tags,_100_len_hard-8 2000 908084 ns/op 215157 B/op 4 allocs/op
10_tags,_1000_len_easy-8 200000 6704 ns/op 10240 B/op 1 allocs/op
10_tags,_1000_len_hard-8 20000 89744 ns/op 20512 B/op 2 allocs/op
100_tags,_1000_len_easy-8 20000 63663 ns/op 106501 B/op 1 allocs/op
100_tags,_1000_len_hard-8 2000 850395 ns/op 205065 B/op 2 allocs/op
1000_tags,_1000_len_easy-8 2000 650311 ns/op 1008189 B/op 3 allocs/op
1000_tags,_1000_len_hard-8 200 8405176 ns/op 2014826 B/op 22 allocs/op
For most likely cases, there's a 3-10x increase, and this is just with quote characters; if using any long unicode this difference would get even worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, good to know. Which benchmark are these from? Do they compare preprocessing to determine which Quote
method is needed followed by Quote
/QuoteSimple
to just running Quote
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is benchmarking the entire ID()
method; easy
has just valid characters, whereas hard
has just "
. The problem is that utf8.DecodeRune
is quite a slow operation, which is called once per byte, and incurs a fair cost.
|
||
// EscapedLength computes the length required for a byte slice to hold | ||
// a quoted byte slice. | ||
func EscapedLength(src []byte) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this copied code or new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically the strconv Quote
method as a dry-run, which follows the same path but calculates the final length instead.
Benchmark results for relevant tag number/lengths:
Basically, the prepended byte is similar in performance to the original |
Broader question about the schemes--which of the schemes are we recommending people to use? Seems like the choice has significant implications, since it changes the data that gets written (I think). Is this something that we actually need to make a config choice? We definitely need to fix the current scheme, but it seems like we should maybe just pick a correct new format and roll with that unless there are certain scenarios under which one scheme would be significantly better than the other. Thoughts? |
I may be just playing the devil's advocate - but why not just use a simple binary encoding (uint16 len+raw bytes per field) instead, and just keep the quoting for debugging/human-readable representation? |
Unless I'm misunderstanding you, that seems pretty close to the TypePrependMeta scheme, other than it using written out ints rather than bytes (Rob was a bit iffy about writing bytes representing weird control characters to the DB) |
Yep, just that just without the int->str conversion. |
@vdarulis so this is so you can use the APIs (especially the human readable ones like HTTP+JSON) without getting garbage binary in your terminal. It's also important to have readable IDs for a ton of other reasons too. |
src/query/models/config.go
Outdated
TypePrependMeta, | ||
} | ||
|
||
func (t IDSchemeType) validateIDSchemeType() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth just make this an exported method? i.e. Validate() error
so that if we use it outside of YAML parsing somewhere we can call validate on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good.
More of a general style question, but is it better to leave functions unexported initially, then export when needed, or to have "expected" behaviour like Validate()
public even if it's not used anywhere? I guess leaving it private can lead to accidentally rewriting functionality considering how go has functions floating around everywhere in a package :p
Makes sense, JSON is nasty to deal with if you have binary data. |
src/query/models/config.go
Outdated
case TypeQuoted: | ||
return "quoted" | ||
case TypePrependMeta: | ||
return "prependMeta" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this underscore separated, to match our other String()
methods on enum like values?
For example:
https://github.com/m3db/m3/blob/master/src/dbnode/storage/series/policy.go#L60-L72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm 👍 should've checked for existing conventions
Makes sense to me. @arnikola If readability is one of the goals, which scheme should we go for? Seems like |
idLen := 0 | ||
tagLengths := make([]int, len(t.Tags)) | ||
idLen := 1 // account for separator | ||
tagLengths := make([]int, len(t.Tags)*2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestion as with the other scheme--small pair struct would help with readability here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rather just have a float array here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floats? But sure, this is ok as is.
@@ -199,9 +199,16 @@ type RPCConfiguration struct { | |||
// Currently only name, but can expand to cover deduplication settings, or other | |||
// relevant options. | |||
type TagOptionsConfiguration struct { | |||
// Version specifies the version number of tag options. Defaults to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much value does having Version
in addition to Scheme
add? It seems like just changing Scheme
to TypeQuoted
would be a more straightforward way of configuring us to use TypeQuoted
than changing the version number and assuming the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm assuming we'll make csv
the actual default when we do a major version bump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we will have a point in time where we do based on a "YAML" config version change which type (i.e. we'll use quoted by default), but we don't have that mechanism right now.
So removing version makes sense.
{"100 Tags 100 Length", 100, 100}, | ||
} | ||
|
||
const typeQuotedEscaped = IDSchemeType(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with the magic number here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a big number to use as an "invalid" id scheme type
src/query/models/tags.go
Outdated
for i, tt := range t.Tags[:lastIndex] { | ||
idx = tt.writeAtIndex(id, needEscaping[i], idx) | ||
id[idx] = sep | ||
fmt.Println(string(id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove printlns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big oops
src/query/models/tags.go
Outdated
fmt.Println(string(id)) | ||
idx++ | ||
} | ||
fmt.Println(string(id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove printlns?
src/query/models/tags.go
Outdated
fmt.Println(string(id)) | ||
|
||
t.Tags[lastIndex].writeAtIndex(id, needEscaping[lastIndex], idx) | ||
fmt.Println(string(id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove printlns?
src/query/models/tags_test.go
Outdated
func TestLongTagNewIDOutOfOrderQuoted(t *testing.T) { | ||
tags := testLongTagIDOutOfOrder(t, TypeQuoted) | ||
actual := tags.ID() | ||
assert.Equal(t, []byte(`t1="v1",t2="v2",t3="v3",t4="v4"`), actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just as discussed, to exact match the Prometheus ID construction, adding the {
and }
should get us 100% matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to 100% matching. If it is 100% matching, should we go ahead and name it TypePrometheus
as proposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only reason we wouldn't want to do that, is if M3QL ends up using the same ID format. Which it probably should TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes Tags.ID() method to return a byte slice representing the ID rather than allocating strings.