-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pkg/stanza/fileconsumer] Add trie and test cases #23665
[pkg/stanza/fileconsumer] Add trie and test cases #23665
Conversation
27a30ab
to
d8fe374
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm excited to see this moving forward and glad we can look at the trie in isolation. I think we need to define very carefully what our intended use of "value" is in the nodes, and ultimately how it supports the requirements of fingerprint matching.
pkg/stanza/fileconsumer/trie.go
Outdated
} | ||
|
||
// Delete removes the value associated with the given key. Returns true if a | ||
// node was found for the given key. If the node or any of its ancestors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be precise, can the node itself ever become childless? We are deleting its value but not modifying its children. Would it be correct to say "if deleting a value on a childless node, the node itself will be deleted."?
It's not clear to me when we should delete ancestor nodes. Currently, we are saving a value on every node, so how do we assess if the ancestor node represents a fingerprint or just part of the path to the deleted node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if deleting a value on a childless node, the node itself will be deleted
yes, that's what happens. let me give you an example. Consider following TRIE
A
|
B
/ \
C D
/
E
If we call Delete()
on ABCE
.
E
will be removed from nodeC
's children map- As node
C
is childless now, it's entire children map is nullified. C
will be removed from nodeB
's children map.- Now,
B
is not childless. It hasD
as it's child. So, we will break out from here and won't proceed further.
Trie after above operation:
A
|
B
|
D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment appears to be out of date if we are no longer saving values on the nodes.
I was just going through the code, and I don't think we actually require the value field. Currently, we just check if fingerprint is present in the trie. We can just return true/false. |
d8fe374
to
057c6fc
Compare
Ok, let's say we remove the value. I still think we need to define exactly how this trie supports our needs. Let's say we have a trie: t := NewTrie()
t.Put("AB")
t.Put("XYZ") Of course, if we call If we call However, if we call |
Hmm, it might make more sense to make t.HasFingerprint("AB") and not t.HasFingerprint("XY"). You're right |
a54ab7f
to
c1e4d96
Compare
c1e4d96
to
7f177f9
Compare
@djaglowski after some careful consideration, I think there's no need to match "XY". |
@djaglowski can you review this one as your convenience? |
@VihasMakwana, I agree with your conclusion here, that we should change the matching behavior or the trie. |
yeah i did that already I guess. https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/23665/files#r1258686830 |
pkg/stanza/fileconsumer/trie.go
Outdated
} | ||
|
||
// Delete removes the value associated with the given key. Returns true if a | ||
// node was found for the given key. If the node or any of its ancestors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment appears to be out of date if we are no longer saving values on the nodes.
pkg/stanza/fileconsumer/trie.go
Outdated
// Delete removes the value associated with the given key. Returns true if a | ||
// node was found for the given key. If the node or any of its ancestors | ||
// becomes childless as a result, it is removed from the trie. | ||
func (trie *Trie) Delete(key []byte) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is it intended that we will call Delete
? Specifically I'm trying to understand how it relates to the lifecycle of a given file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call delete
, only after the file is successfully read and all the logs are emitted and it is added to the list of knownReaders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a case that doesn't make sense to me with this design:
- First poll finds a file w/ "ABCDEF", add it to the trie, and start reading.
- Second poll finds "ABCDEF" unchanged, but also finds another file w/ "ABC".
HasKey("ABC")
tells us that this file is not currently being read, so we callPut("ABC")
but this has no effect on the trie because "ABCDEF" is already there. - Then "ABCDEF" is fully read, so we add it to
knownReaders
and callDelete("ABCDEF")
. This effectively deletes both "ABCDEF" and "ABC". - A third poll now finds "ABC" again and
HasKey("ABC")
tells us it is not currently being read, so we start reading it again. Now we have two active readers for the same file resulting in duplicated logs.
This seems to indicate that we need to make use of values on nodes:
Put("ABC")
should store a value on "C"HasKey("ABC")
should return true if "C" exists and has a value (regardless of whether it is a leaf)
However, this still may not be enough. Another scenario:
- First poll finds both "ABC" and "ABCDEF".
- The order in which we process these files changes the result:
HasKey("ABC") => false
=>Put("ABC")
=>HasKey("ABCDEF") => true
- Results in trie that contains "ABC" only
HasKey("ABCDEF") => false
=>Put("ABCDEF")
=>HasKey("ABC") => false
=>Put("ABC")
- Results in trie that contains "ABC" and "ABCDEF"
I think the correct behavior depends on the context but it's not clear to me how best to model this. If we find "ABC" in the first poll and then "ABCDEF" in the next, the trie should have only "ABCDEF". However, if find them in the opposite order or within the same poll, I think we should treat these as separate files. Do you have any ideas how to address this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this one is a unique situation.
Let us discuss how current file reading works.
So what happens in the current implementation:
- It will loop through all the file paths.
- It will create fingerprints, then check for duplicates.
While checking duplicates, it will match both ways.
It will match;"ABCDEF" and "ABCD"; and; "ABCD" and "ABCDEF" if that's true, it will treat them as duplicates.
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/fileconsumer/file.go#L228
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now to handle your scenario, we can use values in trie or we can go as per current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one scenario to confirm:
if we have A - B - C (true) - D - E (true) - F - X -Y (true)
- if we try to add,
A - B - C - D - X
to the above trie, we shouldn't add a duplicate entry but push the value atC
down. Trie would become
A - B - C - D - E (true) - F - X -Y (true)
|
X (true)
- if we try to add
A - B - C - D
to the above trie, we shouldn't add a duplicate entry but push the value atC
down. Trie would becomeA - B - C - D (true) - E (true) - F - X -Y (true)
- if we try to add,
A - B - C - D - E - F - X - Y- Z
to the above trie, we shouldn't add a duplicate entry, but push the value atE
down. Trie would becomeA - B - C (true) - D - E(true) - F - X - Y - Z(true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, we are pushing the last value encountered while traversing the current path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. At least I can't think of a scenario where that ends up giving us the wrong result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I pushed the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VihasMakwana, I had a hard time correlating the existing tests with the real world scenarios we've discussed, so I ended up writing an additional framework for validating operation sequences that may need to run on the trie. I opened a PR against your branch here. I found a few failures in the basic operations. I also found another aspect of the design which is unclear: Now that we've updated the |
Okay, I feel like we're finally on same page. Just need to think about this 'push' thing. |
Add tree-based test suite that explores many possible sequences of operations
@djaglowski We currently can push down the values, but the way we handled deletion in the thread pool was as follows:
In our current scenario, we updated the trie and pushed the values. How do we handle deletion here? Since we have pushed the value down. Deleting the fingerprint from trie occurred in the same goroutine in threadpool PR |
Maybe we can use something like this, with some extra bookkeeping: An important assumption:
Instead of storing a mere boolean in value, we can assign each open reader a unique Do you get what I'm saying? |
I understand this is complex, but as pushing the values is required to avoid data loss, this or something like this is required. |
fd12e70
to
cfb85ff
Compare
@djaglowski I was able to fix your test case failures in almost all of your test cases. A couple of questions were answered, let's focus on fixing them once we agree on the design. I tweaked the |
cfb85ff
to
1df62b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of storing a mere boolean in value, we can assign each open reader a unique ID. And store that in the trie.
So when we push the value down the trie, we can actually know to which reader it belongs.
I think it would be ideal if the trie does not need to have additional information within it, so we can test/maintain it in isolation.
WDYT of my idea to switch to Get() []byte
(or Get() int
) and store the length of the registered fingerprint on the reader? This is a tiny amount of additional info, only added to the reader, and should be enough to properly call Delete
when the reader is done.
"Found:ABCDEF": opTree{ // Next poll finds ABCDEF | ||
ops: []testOp{ | ||
has("ABCDEF", true, "recognize ABC w/ DEF appended"), | ||
put("ABCDEF"), // TODO HasKey returns true, so how do we know to call this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this TODO, do we need a way to differentiate prefix matches from exact matches? Otherwise I'm not sure we we can take advantage of the "push" functionality in Put
.
We could switch back to Get
and return the value that "matched" the key. The caller can interpret whether or not it was what they passed as a parameter or just a prefix. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can do that.
Also, If we push the exact matches then it would make no difference to trie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we push the exact matches then it would make no difference to trie.
In most situations I'd be ok with duplicating the Put
but this package is optimized wherever possible so I'd prefer to err on the side of performance here.
"DeleteAs:ABC": opTree{ // Done reading the file, remove it as ABC | ||
ops: []testOp{ | ||
del("ABC", false, "should have been pushed down when ABCDEF was added"), | ||
has("ABCDEF", true, "should not have been deleted"), | ||
}, | ||
continuations: map[string]opTree{ | ||
"DeleteAs:ABCDEF": opTree{ // Also remove it as ABCDEF | ||
ops: []testOp{ | ||
del("ABCDEF", true, "just confirmed it exists"), // TODO this fails to delete | ||
}, | ||
}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree w/ your assessment here that we only need to delete ABCDEF
, so we can remove this case.
The main thing here is that we need to know what to pass to Delete. I think my suggestion of switching back to Get
could work here because we can save a value on the reader that indicates what to pass to Delete when the time comes.
// Below won't necessarily be false. | ||
// A - B(true) - C - D (true) | ||
// If we delete ABCD in above trie, the trie becomes A - B(true) | ||
// ABCD would still match, as AB \w CD appended | ||
// assert.Falsef(t, trie.HasKey([]byte(key)), "called Delete(%s) but HasKey(%s) is still true", key, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
Not that we would do it just for tests, but I think this is another benefit of switching back to Get() []byte
, because we can ensure there is no longer an exact match.
Would length be sufficient enough? Can you please elaborate. |
I'm figuring we would store this length value on the reader, which also has the fingerprint stored, so we can do roughly: |
I see. |
But still, when we push the value, we have to update the new length in registered reader. How would we do that? Ig still need a way to determine the reader that the fingerprint belongs to. |
@djaglowski
In both of the above cases, I think that storing the reader info is required. Just to be flexible, I have used type generics. All your test cases are also passing. |
@VihasMakwana, in the interest of simplicity and to hopefully regain some momentum on this effort, should we reconsider the need to "push" fingerprints down? I believe there is some benefit to the behavior, because our understanding of small files is updated more data is appended. However, small files should be read quickly and removed from the trie anyways, so perhaps it is not worth the additional complexity. Perhaps we can remove this "push" functionality and revisit it later. The question in my mind is whether we have identified any specific cases which rely on this behavior. WDYT? |
@djaglowski that is good by me. I can revert that part and push, so you can take a look? |
Closed in favor of #24982 |
Description: Add Trie data structure and keep it separate from PR #23056
Testing: Relevant test cases added