Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza/fileconsumer] Add trie and test cases #23665

Closed

Conversation

VihasMakwana
Copy link
Contributor

Description: Add Trie data structure and keep it separate from PR #23056

Testing: Relevant test cases added

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm excited to see this moving forward and glad we can look at the trie in isolation. I think we need to define very carefully what our intended use of "value" is in the nodes, and ultimately how it supports the requirements of fingerprint matching.

pkg/stanza/fileconsumer/trie.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/trie.go Outdated Show resolved Hide resolved
}

// Delete removes the value associated with the given key. Returns true if a
// node was found for the given key. If the node or any of its ancestors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise, can the node itself ever become childless? We are deleting its value but not modifying its children. Would it be correct to say "if deleting a value on a childless node, the node itself will be deleted."?

It's not clear to me when we should delete ancestor nodes. Currently, we are saving a value on every node, so how do we assess if the ancestor node represents a fingerprint or just part of the path to the deleted node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if deleting a value on a childless node, the node itself will be deleted

yes, that's what happens. let me give you an example. Consider following TRIE


        A
        |
        B
      /   \
    C      D
   /
  E

If we call Delete() on ABCE.

  • E will be removed from nodeC's children map
  • As node C is childless now, it's entire children map is nullified.
  • C will be removed from node B's children map.
  • Now, B is not childless. It has D as it's child. So, we will break out from here and won't proceed further.

Trie after above operation:


   A
   |
   B
   |
   D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment appears to be out of date if we are no longer saving values on the nodes.

pkg/stanza/fileconsumer/trie_test.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/trie.go Outdated Show resolved Hide resolved
@djaglowski djaglowski added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jun 26, 2023
@djaglowski djaglowski changed the title Add trie and test cases [pkg/stanza/fileconsumer] Add trie and test cases Jun 26, 2023
@VihasMakwana
Copy link
Contributor Author

I'm excited to see this moving forward and glad we can look at the trie in isolation. I think we need to define very carefully what our intended use of "value" is in the nodes, and ultimately how it supports the requirements of fingerprint matching.

I was just going through the code, and I don't think we actually require the value field. Currently, we just check if fingerprint is present in the trie. We can just return true/false.
I ran the same tests after removing the value assignments. and it worked the same way.

@VihasMakwana VihasMakwana force-pushed the add-trie-fingerprints branch from d8fe374 to 057c6fc Compare June 28, 2023 08:13
@djaglowski
Copy link
Member

I'm excited to see this moving forward and glad we can look at the trie in isolation. I think we need to define very carefully what our intended use of "value" is in the nodes, and ultimately how it supports the requirements of fingerprint matching.

I was just going through the code, and I don't think we actually require the value field. Currently, we just check if fingerprint is present in the trie. We can just return true/false. I ran the same tests after removing the value assignments. and it worked the same way.

Ok, let's say we remove the value. Get doesn't return anything, so let's rename it HasFingerprint and return a bool.

I still think we need to define exactly how this trie supports our needs.

Let's say we have a trie:

t := NewTrie()
t.Put("AB")
t.Put("XYZ")

Of course, if we call t.HasFingerprint("AB") or t.HasFingerprint("XYZ"), it will return true because these are exact matches.

If we call t.HasFingerprint("ABC"), we've chosen to return true in order to support the use case where a file was observed once when it contained "AB", and again after "C" has been appended.

However, if we call t.HasFingerprint("XY"), the current implementation would return true but it's not clear to me what use case this supports.

@VihasMakwana
Copy link
Contributor Author

Ok, let's say we remove the value. Get doesn't return anything, so let's rename it HasFingerprint and return a bool.

I still think we need to define exactly how this trie supports our needs.

Let's say we have a trie:

t := NewTrie()
t.Put("AB")
t.Put("XYZ")
Of course, if we call t.HasFingerprint("AB") or t.HasFingerprint("XYZ"), it will return true because these are exact matches.

If we call t.HasFingerprint("ABC"), we've chosen to return true in order to support the use case where a file was observed once when it contained "AB", and again after "C" has been appended.

However, if we call t.HasFingerprint("XY"), the current implementation would return true but it's not clear to me what use case this supports.

Hmm, it might make more sense to make t.HasFingerprint("AB") and not t.HasFingerprint("XY"). You're right

@VihasMakwana VihasMakwana requested a review from djaglowski June 30, 2023 10:25
@VihasMakwana VihasMakwana force-pushed the add-trie-fingerprints branch 2 times, most recently from a54ab7f to c1e4d96 Compare June 30, 2023 10:35
@VihasMakwana VihasMakwana force-pushed the add-trie-fingerprints branch from c1e4d96 to 7f177f9 Compare June 30, 2023 11:45
@VihasMakwana
Copy link
Contributor Author

Let's say we have a trie:

t := NewTrie()
t.Put("AB")
t.Put("XYZ")
Of course, if we call t.HasFingerprint("AB") or t.HasFingerprint("XYZ"), it will return true because these are exact matches.

If we call t.HasFingerprint("ABC"), we've chosen to return true in order to support the use case where a file was observed once when it contained "AB", and again after "C" has been appended.

However, if we call t.HasFingerprint("XY"), the current implementation would return true but it's not clear to me what use case this supports.

@djaglowski after some careful consideration, I think there's no need to match "XY".

@VihasMakwana
Copy link
Contributor Author

@djaglowski can you review this one as your convenience?

@djaglowski
Copy link
Member

@VihasMakwana, I agree with your conclusion here, that we should change the matching behavior or the trie.

@VihasMakwana
Copy link
Contributor Author

pkg/stanza/fileconsumer/trie.go Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/trie.go Outdated Show resolved Hide resolved
}

// Delete removes the value associated with the given key. Returns true if a
// node was found for the given key. If the node or any of its ancestors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment appears to be out of date if we are no longer saving values on the nodes.

// Delete removes the value associated with the given key. Returns true if a
// node was found for the given key. If the node or any of its ancestors
// becomes childless as a result, it is removed from the trie.
func (trie *Trie) Delete(key []byte) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it intended that we will call Delete? Specifically I'm trying to understand how it relates to the lifecycle of a given file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call delete, only after the file is successfully read and all the logs are emitted and it is added to the list of knownReaders.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a case that doesn't make sense to me with this design:

  • First poll finds a file w/ "ABCDEF", add it to the trie, and start reading.
  • Second poll finds "ABCDEF" unchanged, but also finds another file w/ "ABC". HasKey("ABC") tells us that this file is not currently being read, so we call Put("ABC") but this has no effect on the trie because "ABCDEF" is already there.
  • Then "ABCDEF" is fully read, so we add it to knownReaders and call Delete("ABCDEF"). This effectively deletes both "ABCDEF" and "ABC".
  • A third poll now finds "ABC" again and HasKey("ABC") tells us it is not currently being read, so we start reading it again. Now we have two active readers for the same file resulting in duplicated logs.

This seems to indicate that we need to make use of values on nodes:

  • Put("ABC") should store a value on "C"
  • HasKey("ABC") should return true if "C" exists and has a value (regardless of whether it is a leaf)

However, this still may not be enough. Another scenario:

  • First poll finds both "ABC" and "ABCDEF".
  • The order in which we process these files changes the result:
    • HasKey("ABC") => false => Put("ABC") => HasKey("ABCDEF") => true
      • Results in trie that contains "ABC" only
    • HasKey("ABCDEF") => false => Put("ABCDEF") => HasKey("ABC") => false => Put("ABC")
      • Results in trie that contains "ABC" and "ABCDEF"

I think the correct behavior depends on the context but it's not clear to me how best to model this. If we find "ABC" in the first poll and then "ABCDEF" in the next, the trie should have only "ABCDEF". However, if find them in the opposite order or within the same poll, I think we should treat these as separate files. Do you have any ideas how to address this?

Copy link
Contributor Author

@VihasMakwana VihasMakwana Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this one is a unique situation.
Let us discuss how current file reading works.
So what happens in the current implementation:

  • It will loop through all the file paths.
  • It will create fingerprints, then check for duplicates.

While checking duplicates, it will match both ways.
It will match;"ABCDEF" and "ABCD"; and; "ABCD" and "ABCDEF" if that's true, it will treat them as duplicates.
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/fileconsumer/file.go#L228

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now to handle your scenario, we can use values in trie or we can go as per current implementation.

Copy link
Contributor Author

@VihasMakwana VihasMakwana Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one scenario to confirm:
if we have A - B - C (true) - D - E (true) - F - X -Y (true)

  • if we try to add, A - B - C - D - X to the above trie, we shouldn't add a duplicate entry but push the value at C down. Trie would become
A - B - C - D - E (true) - F - X -Y (true)
            |
           X (true)
  • if we try to add A - B - C - D to the above trie, we shouldn't add a duplicate entry but push the value at C down. Trie would become A - B - C - D (true) - E (true) - F - X -Y (true)
  • if we try to add, A - B - C - D - E - F - X - Y- Z to the above trie, we shouldn't add a duplicate entry, but push the value at E down. Trie would become A - B - C (true) - D - E(true) - F - X - Y - Z(true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, we are pushing the last value encountered while traversing the current path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. At least I can't think of a scenario where that ends up giving us the wrong result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I pushed the changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VihasMakwana VihasMakwana requested a review from djaglowski July 11, 2023 07:01
@djaglowski
Copy link
Member

@VihasMakwana, I had a hard time correlating the existing tests with the real world scenarios we've discussed, so I ended up writing an additional framework for validating operation sequences that may need to run on the trie. I opened a PR against your branch here.

I found a few failures in the basic operations.

I also found another aspect of the design which is unclear: Now that we've updated the Put function to update or "push" a fingerprint further into the trie, it's unclear when we would call it. I'd appreciate your thoughts on this.

@VihasMakwana
Copy link
Contributor Author

Okay, I feel like we're finally on same page. Just need to think about this 'push' thing.
Will go through your pr.

Add tree-based test suite that explores many possible sequences of operations
@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Jul 28, 2023

@djaglowski We currently can push down the values, but the way we handled deletion in the thread pool was as follows:

  • Add a fingerprint to trie
  • Send reader and fingerprint to the threadpool
  • Delete fingerprint after emitting.

In our current scenario, we updated the trie and pushed the values. How do we handle deletion here? Since we have pushed the value down.
If we delete the fingerprint that we sent earlier in the threadpool, it will be a no-op.
The pushed value might as well stay in trie forever.

Deleting the fingerprint from trie occurred in the same goroutine in threadpool PR

@VihasMakwana
Copy link
Contributor Author

Maybe we can use something like this, with some extra bookkeeping:

An important assumption:

  • There is a 1-1 relationship between readers and open fingerprints.

Instead of storing a mere boolean in value, we can assign each open reader a unique ID. And store that in the trie.
So when we push the value down the trie, we can actually know to which reader it belongs.
We can add one more field to the trie nodes, parent. And then we can delete nodes iteratively.
We would need hashmaps for this.

Do you get what I'm saying?

@VihasMakwana
Copy link
Contributor Author

I understand this is complex, but as pushing the values is required to avoid data loss, this or something like this is required.

@VihasMakwana VihasMakwana force-pushed the add-trie-fingerprints branch from fd12e70 to cfb85ff Compare July 30, 2023 17:11
@VihasMakwana
Copy link
Contributor Author

@djaglowski I was able to fix your test case failures in almost all of your test cases. A couple of questions were answered, let's focus on fixing them once we agree on the design.

I tweaked the Delete and added a new helper.
The helper allows use to delete the node and travel upwards in the trie. We just provide the node we're interested in deleting and it does the work. This will be the most crucial one for the threadpool PR.

@VihasMakwana VihasMakwana force-pushed the add-trie-fingerprints branch from cfb85ff to 1df62b3 Compare July 30, 2023 17:13
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing a mere boolean in value, we can assign each open reader a unique ID. And store that in the trie.
So when we push the value down the trie, we can actually know to which reader it belongs.

I think it would be ideal if the trie does not need to have additional information within it, so we can test/maintain it in isolation.

WDYT of my idea to switch to Get() []byte (or Get() int) and store the length of the registered fingerprint on the reader? This is a tiny amount of additional info, only added to the reader, and should be enough to properly call Delete when the reader is done.

"Found:ABCDEF": opTree{ // Next poll finds ABCDEF
ops: []testOp{
has("ABCDEF", true, "recognize ABC w/ DEF appended"),
put("ABCDEF"), // TODO HasKey returns true, so how do we know to call this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this TODO, do we need a way to differentiate prefix matches from exact matches? Otherwise I'm not sure we we can take advantage of the "push" functionality in Put.

We could switch back to Get and return the value that "matched" the key. The caller can interpret whether or not it was what they passed as a parameter or just a prefix. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can do that.
Also, If we push the exact matches then it would make no difference to trie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we push the exact matches then it would make no difference to trie.

In most situations I'd be ok with duplicating the Put but this package is optimized wherever possible so I'd prefer to err on the side of performance here.

Comment on lines +226 to +238
"DeleteAs:ABC": opTree{ // Done reading the file, remove it as ABC
ops: []testOp{
del("ABC", false, "should have been pushed down when ABCDEF was added"),
has("ABCDEF", true, "should not have been deleted"),
},
continuations: map[string]opTree{
"DeleteAs:ABCDEF": opTree{ // Also remove it as ABCDEF
ops: []testOp{
del("ABCDEF", true, "just confirmed it exists"), // TODO this fails to delete
},
},
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree w/ your assessment here that we only need to delete ABCDEF, so we can remove this case.

The main thing here is that we need to know what to pass to Delete. I think my suggestion of switching back to Get could work here because we can save a value on the reader that indicates what to pass to Delete when the time comes.

Comment on lines +529 to +533
// Below won't necessarily be false.
// A - B(true) - C - D (true)
// If we delete ABCD in above trie, the trie becomes A - B(true)
// ABCD would still match, as AB \w CD appended
// assert.Falsef(t, trie.HasKey([]byte(key)), "called Delete(%s) but HasKey(%s) is still true", key, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Not that we would do it just for tests, but I think this is another benefit of switching back to Get() []byte, because we can ensure there is no longer an exact match.

@VihasMakwana
Copy link
Contributor Author

Instead of storing a mere boolean in value, we can assign each open reader a unique ID. And store that in the trie.
So when we push the value down the trie, we can actually know to which reader it belongs.

I think it would be ideal if the trie does not need to have additional information within it, so we can test/maintain it in isolation.

WDYT of my idea to switch to Get() []byte (or Get() int) and store the length of the registered fingerprint on the reader? This is a tiny amount of additional info, only added to the reader, and should be enough to properly call Delete when the reader is done.

Would length be sufficient enough? Can you please elaborate.
If we have the fingerprint, it would be sufficient I guess

@djaglowski
Copy link
Member

djaglowski commented Aug 1, 2023

Would length be sufficient enough? Can you please elaborate.
If we have the fingerprint, it would be sufficient I guess

I'm figuring we would store this length value on the reader, which also has the fingerprint stored, so we can do roughly: Delete(reader.Fingerprint.FirstBytes[0:reader.registeredLength])

@VihasMakwana
Copy link
Contributor Author

I see.
So, a combination of both.
I will revert the current changes I made, adding a parent and so on.
Let's go with above approach and keep trie as Isolate as possible

@VihasMakwana
Copy link
Contributor Author

But still, when we push the value, we have to update the new length in registered reader. How would we do that? Ig still need a way to determine the reader that the fingerprint belongs to.

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Aug 7, 2023

@djaglowski
I think of two ways from here:

  • Go with the Delete(reader.Fingerprint.FirstBytes[0:reader.registeredLength]).

    • But, when we push some value down the TRIE, we must also update registeredLength stored on the reader.
    • In other words, we need to identify the reader to which the fingerprint belongs.
  • Go with my DeleteNode(node).

    • In this approach, we store the last node in our fingerprint path on the reader
      • For eg. ReaderA has ABCDE as its fingerprint, so we would store Node - E on our reader.
    • So, when we want to delete the field, we just call DeleteNode(reader.trieNode)

In both of the above cases, I think that storing the reader info is required.
As a result, I have created a PR.

Just to be flexible, I have used type generics. All your test cases are also passing.
If you can take a look ;) Any suggestions are welcomed.

@djaglowski
Copy link
Member

@VihasMakwana, in the interest of simplicity and to hopefully regain some momentum on this effort, should we reconsider the need to "push" fingerprints down?

I believe there is some benefit to the behavior, because our understanding of small files is updated more data is appended. However, small files should be read quickly and removed from the trie anyways, so perhaps it is not worth the additional complexity.

Perhaps we can remove this "push" functionality and revisit it later. The question in my mind is whether we have identified any specific cases which rely on this behavior. WDYT?

@VihasMakwana
Copy link
Contributor Author

@djaglowski that is good by me. I can revert that part and push, so you can take a look?

@djaglowski
Copy link
Member

Closed in favor of #24982

@djaglowski djaglowski closed this Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg/stanza Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants