Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce simple string matcher #2433

Merged
merged 1 commit into from
Jan 20, 2017
Merged

Conversation

urso
Copy link

@urso urso commented Sep 1, 2016

Match further optimizes special regular expressions, trying to replace
special regular patterns with more performance string searches.

benchmark results:

run 1:

BenchmarkBeginning/Matcher=Regex-4       1000000              1521 ns/op
BenchmarkBeginning/Matcher=Match-4      10000000               132 ns/op
BenchmarkBeginningSpace/Matcher=Regex-4                  1000000              1368 ns/op
BenchmarkBeginningSpace/Matcher=Match-4                 10000000               135 ns/op
BenchmarkBeginningDate/Matcher=Regex-4                   1000000              1757 ns/op
BenchmarkBeginningDate/Matcher=Match-4                   1000000              1889 ns/op
BenchmarkStringPatternRegex/Matcher=Regex-4              1000000              1314 ns/op
BenchmarkStringPatternRegex/Matcher=Match-4              5000000               266 ns/op
BenchmarkStringPatternDotStarRegex/Matcher=Regex-4                 30000             37084 ns/op
BenchmarkStringPatternDotStarRegex/Matcher=Match-4               5000000               257 ns/op

run 2:

BenchmarkBeginning/Matcher=Regex-4       1000000              1413 ns/op
BenchmarkBeginning/Matcher=Match-4      10000000               135 ns/op
BenchmarkBeginningSpace/Matcher=Regex-4                  1000000              1300 ns/op
BenchmarkBeginningSpace/Matcher=Match-4                 10000000               144 ns/op
BenchmarkBeginningDate/Matcher=Regex-4                   1000000              1918 ns/op
BenchmarkBeginningDate/Matcher=Match-4                   1000000              1727 ns/op
BenchmarkStringPatternRegex/Matcher=Regex-4              1000000              1300 ns/op
BenchmarkStringPatternRegex/Matcher=Match-4              5000000               262 ns/op
BenchmarkStringPatternDotStarRegex/Matcher=Regex-4                 50000             36425 ns/op
BenchmarkStringPatternDotStarRegex/Matcher=Match-4               5000000               254 ns/op

In BenchmarkBeginningDate benchmark the regular expression can not be optimized (yet?), resulting in Regex and Match tests, both executing full regular expression.

@urso urso added in progress Pull request is currently in progress. discuss Issue needs further discussion. labels Sep 1, 2016
@urso urso force-pushed the enh/string-matcher branch from 79a0120 to ae31a99 Compare January 10, 2017 12:59
@urso
Copy link
Author

urso commented Jan 10, 2017

Updated matcher with support for handling dates (only numeric) at beginning of logs and alternative prefixes.

Updated benchmarks:

BenchmarkPatterns/Name=match_any_1,_Matcher=Regex,_Content=mixed-4         	   50000	     25800 ns/op
BenchmarkPatterns/Name=match_any_1,_Matcher=Match,_Content=mixed-4         	20000000	        74.7 ns/op
BenchmarkPatterns/Name=match_any_1,_Matcher=Regex,_Content=simple_log-4    	   50000	     32951 ns/op
BenchmarkPatterns/Name=match_any_1,_Matcher=Match,_Content=simple_log-4    	20000000	        60.3 ns/op
BenchmarkPatterns/Name=match_any_1,_Matcher=Regex,_Content=simple_log2-4   	   50000	     32952 ns/op
BenchmarkPatterns/Name=match_any_1,_Matcher=Match,_Content=simple_log2-4   	20000000	        59.9 ns/op
BenchmarkPatterns/Name=match_any_1,_Matcher=Regex,_Content=simple_log_with_level-4         	   50000	     34731 ns/op
BenchmarkPatterns/Name=match_any_1,_Matcher=Match,_Content=simple_log_with_level-4         	20000000	        59.2 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Regex,_Content=mixed-4                         	  100000	     16937 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Match,_Content=mixed-4                         	20000000	        75.9 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Regex,_Content=simple_log-4                    	  100000	     20728 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Match,_Content=simple_log-4                    	30000000	        59.9 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Regex,_Content=simple_log2-4                   	  100000	     20737 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Match,_Content=simple_log2-4                   	30000000	        60.9 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Regex,_Content=simple_log_with_level-4         	  100000	     22211 ns/op
BenchmarkPatterns/Name=match_any_2,_Matcher=Match,_Content=simple_log_with_level-4         	20000000	        60.2 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Regex,_Content=mixed-4                	 1000000	      1944 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Match,_Content=mixed-4                	10000000	       134 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Regex,_Content=simple_log-4           	 1000000	      1390 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Match,_Content=simple_log-4           	10000000	       119 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Regex,_Content=simple_log2-4          	 1000000	      1481 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Match,_Content=simple_log2-4          	10000000	       118 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Regex,_Content=simple_log_with_level-4         	 1000000	      1676 ns/op
BenchmarkPatterns/Name=startsWith_'PATTERN',_Matcher=Match,_Content=simple_log_with_level-4         	10000000	       121 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Regex,_Content=mixed-4                               	 1000000	      1926 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Match,_Content=mixed-4                               	10000000	       148 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Regex,_Content=simple_log-4                          	 1000000	      1461 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Match,_Content=simple_log-4                          	10000000	       122 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Regex,_Content=simple_log2-4                         	 1000000	      1366 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Match,_Content=simple_log2-4                         	10000000	       120 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Regex,_Content=simple_log_with_level-4               	 1000000	      1526 ns/op
BenchmarkPatterns/Name=startsWith_'_',_Matcher=Match,_Content=simple_log_with_level-4               	10000000	       124 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Regex,_Content=mixed-4                               	 1000000	      2248 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Match,_Content=mixed-4                               	10000000	       148 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Regex,_Content=simple_log-4                          	  500000	      2729 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Match,_Content=simple_log-4                          	 5000000	       355 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Regex,_Content=simple_log2-4                         	 1000000	      1745 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Match,_Content=simple_log2-4                         	10000000	       168 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Regex,_Content=simple_log_with_level-4               	 1000000	      1575 ns/op
BenchmarkPatterns/Name=startsWithDate,_Matcher=Match,_Content=simple_log_with_level-4               	20000000	       100 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=mixed-4                              	 1000000	      2021 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=mixed-4                              	10000000	       117 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=simple_log-4                         	 1000000	      1879 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=simple_log-4                         	20000000	       110 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=simple_log2-4                        	  500000	      2732 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=simple_log2-4                        	 5000000	       277 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Regex,_Content=simple_log_with_level-4              	 1000000	      1487 ns/op
BenchmarkPatterns/Name=startsWithDate2,_Matcher=Match,_Content=simple_log_with_level-4              	20000000	        99.5 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Regex,_Content=mixed-4                              	 1000000	      1980 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Match,_Content=mixed-4                              	10000000	       153 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Regex,_Content=simple_log-4                         	 1000000	      1500 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Match,_Content=simple_log-4                         	10000000	       124 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Regex,_Content=simple_log2-4                        	  500000	      2788 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Match,_Content=simple_log2-4                        	 5000000	       302 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Regex,_Content=simple_log_with_level-4              	 1000000	      1384 ns/op
BenchmarkPatterns/Name=startsWithDate3,_Matcher=Match,_Content=simple_log_with_level-4              	10000000	       134 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Regex,_Content=mixed-4                              	  500000	      3465 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Match,_Content=mixed-4                              	 3000000	       411 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Regex,_Content=simple_log-4                         	  500000	      2704 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Match,_Content=simple_log-4                         	 5000000	       368 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Regex,_Content=simple_log2-4                        	  500000	      2722 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Match,_Content=simple_log2-4                        	 5000000	       360 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Regex,_Content=simple_log_with_level-4              	  500000	      2870 ns/op
BenchmarkPatterns/Name=startsWithLevel,_Matcher=Match,_Content=simple_log_with_level-4              	10000000	       221 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Regex,_Content=mixed-4                           	 1000000	      1660 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Match,_Content=mixed-4                           	 5000000	       287 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Regex,_Content=simple_log-4                      	 1000000	      1381 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Match,_Content=simple_log-4                      	 5000000	       261 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Regex,_Content=simple_log2-4                     	 1000000	      1362 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Match,_Content=simple_log2-4                     	 5000000	       259 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Regex,_Content=simple_log_with_level-4           	 1000000	      1388 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Match,_Content=simple_log_with_level-4           	 5000000	       260 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Regex,_Content=mixed-4                  	   50000	     39312 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Match,_Content=mixed-4                  	 5000000	       287 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Regex,_Content=simple_log-4             	   30000	     46424 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Match,_Content=simple_log-4             	 5000000	       261 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Regex,_Content=simple_log2-4            	   30000	     46874 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Match,_Content=simple_log2-4            	 5000000	       256 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Regex,_Content=simple_log_with_level-4  	   30000	     49857 ns/op
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Match,_Content=simple_log_with_level-4  	 5000000	       255 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Regex,_Content=mixed-4                                   	 1000000	      1409 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Match,_Content=mixed-4                                   	20000000	        82.0 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Regex,_Content=simple_log-4                              	 1000000	      1079 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Match,_Content=simple_log-4                              	20000000	        65.3 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Regex,_Content=simple_log2-4                             	 1000000	      1080 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Match,_Content=simple_log2-4                             	20000000	        65.6 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Regex,_Content=simple_log_with_level-4                   	 1000000	      1043 ns/op
BenchmarkPatterns/Name=empty_line,_Matcher=Match,_Content=simple_log_with_level-4                   	20000000	        65.1 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Regex,_Content=mixed-4          	 1000000	      1991 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Match,_Content=mixed-4          	10000000	       200 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Regex,_Content=simple_log-4     	 1000000	      1242 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Match,_Content=simple_log-4     	10000000	       148 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Regex,_Content=simple_log2-4    	 1000000	      1248 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Match,_Content=simple_log2-4    	10000000	       146 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Regex,_Content=simple_log_with_level-4         	 1000000	      1214 ns/op
BenchmarkPatterns/Name=empty_line_with_optional_whitespace,_Matcher=Match,_Content=simple_log_with_level-4         	10000000	       148 ns/op
PASS
ok  	github.com/elastic/beats/libbeat/common/match	158.735s

@urso urso changed the title [WIP] Introduce simple string matcher Introduce simple string matcher Jan 10, 2017
@urso urso added review and removed in progress Pull request is currently in progress. labels Jan 10, 2017
@ruflin
Copy link
Member

ruflin commented Jan 11, 2017

@urso make fmt is not happy ;-)

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments.

logContentLevel,
}

var mixedContent = makeContent("mixed", `Lorem ipsum dolor sit amet,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to see this extensive test suite. I kind of expect that we will hit some cases we haven't thought of but this will be easily fixible by just adding a test with the exception and find the issue.

{
`^\s*$`,
typeOf((*emptyWhiteStringMatcher)(nil)),
// []string{"", " ", "\t"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this line be removed?

@@ -0,0 +1,204 @@
package match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are kind of two solutions here:

  • We fix it automatically for the user what this code is doing
  • We log a warning to the that his regexp is not optimal and that he could optimize it when doing x
  • Doing a combination of both

I would opt for version 3. But will any user ever have a look at it an optimize the regexp query if we tell him it is not optimal? Main thing about doing 1 is that this could introduce potential errors. Means the regexp we use internally is not the regexp someone has in the config file. So it would be good to at least have some debug message with the now used regexp.

@urso
Copy link
Author

urso commented Jan 11, 2017

@urso make fmt is not happy ;-)

Hm... my editor is using goimports on every safe. Let me check.

@urso
Copy link
Author

urso commented Jan 11, 2017

Added alternative literals use case (e.g. (DEBUG|INFO|ERROR|CRITICAL):

BenchmarkPatterns/Name=hasLevel,_Matcher=Regex,_Content=mixed-4         	   10000	    105611 ns/op
BenchmarkPatterns/Name=hasLevel,_Matcher=Match,_Content=mixed-4         	 1000000	      1057 ns/op
BenchmarkPatterns/Name=hasLevel,_Matcher=Regex,_Content=simple_log-4    	   10000	    128666 ns/op
BenchmarkPatterns/Name=hasLevel,_Matcher=Match,_Content=simple_log-4    	 2000000	       955 ns/op
BenchmarkPatterns/Name=hasLevel,_Matcher=Regex,_Content=simple_log2-4   	   50000	     30134 ns/op
BenchmarkPatterns/Name=hasLevel,_Matcher=Match,_Content=simple_log2-4   	 3000000	       550 ns/op
BenchmarkPatterns/Name=hasLevel,_Matcher=Regex,_Content=simple_log_with_level-4         	  200000	      7214 ns/op
BenchmarkPatterns/Name=hasLevel,_Matcher=Match,_Content=simple_log_with_level-4         	 3000000	       545 ns/op

@monicasarbu
Copy link
Contributor

@urso do you prefer squash and merge or to do rebase yourself?

@urso
Copy link
Author

urso commented Jan 12, 2017

let me rebase first and get a proper commit message.

Provide match.Matcher and match.ExactMatcher using regular expressions for
matching use-case only.

The matchers compile a regular expression into a Matcher, which only provides
the Match functionality. This gives us a chance to optimize/replace some common
cases used for matching:
- replace capture-groups by non-capturing groups
- remove leading/trailing `.*` expressions (Match already searches for
  sub-string matching the regex)
- replace simple literal searches with `==` and `strings.Contains` and
  `strings.startsWith`
- replace regex for alternative literals (e.g. `DEBUG|INFO|ERROR`) with
  strings.Contains over set of literals
- optimized empty-lines checks

If input regular expression can not be matched to a simple case, regexp.Regexp
will be used.

The `ExactMatcher` will embedd `<regex>` into `^<regex>$` by default.

Note: Matcher does currently not split simple cases. e.g. `abc.*def` or
`abc.def` will still fallback to regexp.Regexp.
@urso urso force-pushed the enh/string-matcher branch from f7f6b7c to c683077 Compare January 12, 2017 10:41
@andrewkroh andrewkroh merged commit 0a8ca7d into elastic:master Jan 20, 2017
@urso urso deleted the enh/string-matcher branch February 19, 2019 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants