Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Previously trained messages cannot always be matched #48

Closed
Impelon opened this issue Dec 26, 2021 · 7 comments
Closed

Previously trained messages cannot always be matched #48

Impelon opened this issue Dec 26, 2021 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@Impelon
Copy link
Contributor

Impelon commented Dec 26, 2021

Observe the following MWE (using the current version 0.9.7):

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("training4Model start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'training4Model start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_template_changed', 'cluster_id': 1, 'cluster_size': 2, 'template_mined': '<*> start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel stop', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
False

Since loadModel start was previously passed in to parser.add_log_message, it should be possible to match it later in the "interference" mode.
I would expect previously trained messages to always match later on.
As illustrated above, this cannot be guaranteed for arbitrary messages.


I've analyzed what is happening by looking through the source code and came to the following explanation:

  • When parser.match("loadModel start") is called, there are 2 clusters present with the following templates:
    1. <*> start: This is the cluster that includes the first occurrence of loadModel start.
    2. loadModel stop
  • Thus when tree_search is called, loadModel matches the prefix of 2nd template and 1st template is completely disregarded. (line 131)
  • Obviously loadModel start does NOT exactly match loadModel stop and as such no match is found. (line 134)
    https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L121-L135

The reason the first template was modified to include a wildcard (which later prevents the log message from matching), is that training4Model includes a number and thus will replaced by a wildcard if no other prefix matches.
See: https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L196-L199
The following example works perfectly fine therefore:

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("trainingModel start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'trainingModel start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel start', 'cluster_count': 2}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_template_changed', 'cluster_id': 2, 'cluster_size': 2, 'template_mined': 'loadModel <*>', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
True

I'm not sure how this should be optimally handled.

@davidohana davidohana added the bug Something isn't working label Dec 28, 2021
@davidohana
Copy link
Collaborator

Indeed an interesting edge case. I am not sure how acute is it, since another call to parser.add_log_message("loadModel start") would fix it too, and mostly you get many logs of the same type so you can say that match() is eventually consistent. But I guess it depends on the use case.

I committed two possible ways to solve it:

  1. Option to disable treating numeric tokens as parameters. This will make Drain treat training4Model as a non-parametric token like trainingModel. But this is only a patch because it might also occur if the masker decides to parameterize some token. Nevertheless, I think parametrize_numeric_tokens should be disabled if you have good masking instruction coverage.
  2. A more complete solution - added strategy to match() function, to fallback to full cluster search in case of no match. This may cost in performance of the match() but is good for a case in which match accuracy is super important. It's possible to implement a recursive tree_search (first try exact token match and fallback to wildcard match). This will be both accurate and more efficient than the linear full search, but I am unable to implement if now due to lack of time.

WDYT?

@davidohana davidohana self-assigned this Dec 28, 2021
@Impelon
Copy link
Contributor Author

Impelon commented Dec 28, 2021

Thank you for the quick response and improvements made!
It's great to have more flexibility.

  1. Option to disable treating numeric tokens as parameters. [...] Nevertheless, I think parametrize_numeric_tokens should be disabled if you have good masking instruction coverage.

I believe this would indeed solve the problem in my case.

  1. A more complete solution - added strategy to match() function, to fallback to full cluster search in case of no match. This may cost in performance of the match() but is good for a case in which match accuracy is super important.

In my case I it is important that I don't modify existing templates after the training phase. The most I could do in case match() fails is to add a new template. For me the new fallback strategy would be a good fit, just as an extra precaution in case setting parametrize_numeric_tokens = False does not resolve everything.

Just as an extra note, I think the current documentation for the always strategy is a bit misleading:
https://github.com/IBM/Drain3/blob/8c1e94d74b682b38d18b8091480c58f762acd491/drain3/drain.py#L357-L359
As far as I understand it suffers from the same problem as the fallback-strategy in the sense that a

non-optimal match with more wildcard parameters than necessary

can be found, and this is even more likely here since a restricted search was not run before.

Another thing to note is that fast_match does not necessarily require a list of IDs, just a sequence.
So just passing self.id_to_cluster.keys() will be more memory efficient and slightly faster because no list is created, but I suspect that there are few situations where this will actually makes a difference.
https://github.com/IBM/Drain3/blob/8c1e94d74b682b38d18b8091480c58f762acd491/drain3/drain.py#L373-L376

@davidohana
Copy link
Collaborator

davidohana commented Dec 28, 2021

always should always return the best match as opposed to fallback, because fallback will return the best match only among the clusters in the same leaf node, in case tree search found a match. always compares all clusters with the same sequence length, starting from the root. Tried to clarify that in the docs.
Thanks for the comment regarding list cast - removed it.

@davidohana
Copy link
Collaborator

I see now I have a bug in always, working on a fix.

@davidohana
Copy link
Collaborator

Fixed.

@Impelon
Copy link
Contributor Author

Impelon commented Dec 28, 2021

always should always return the best match as opposed to fallback, because fallback will return the best match only among the clusters in the same leaf node, in case tree search found a match. always compares all clusters with the same sequence length, starting from the root. Tried to clarify that in the docs.

Got it now! Thanks again for the quick fixes.

@davidohana
Copy link
Collaborator

Changes are available at v0.9.8, released today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants