Use DFA algorithm for sequenceMatch without `(?t)`. #4004

LeoErcolanelli · 2019-01-08T13:25:09Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

Improvement

Short description (up to few sentences):

Add an optional parameter to sequenceMatch and sequenceCount in order to temporary lift the hard-coded limit on the number of iterations.

Detailed description (optional):

As of today the constant sequence_match_max_iterations sets an arbitrary limit on the number of iterations supported by sequenceMatch and sequenceCount. In certain cases, making this limit configurable may be desirable (current opened issue).

In order to do so, a new parameter has been added to those two functions. The fact that the parameter is optional and default to the previous hard-coded value makes this change backward compatible.

I think going for a query setting would have been cleaner but I've found no way to access the query settings from within the implementation of function. If such a way exists, I'll gladly modify the PR in order to interact with the settings rather than an optional parameter.

Let me know if anything is to be changed !

filimonov · 2019-01-08T15:24:34Z

Maybe it would be better just to add that to a user level settings instead of passing that back and forward with every function call?

Also: increasing that without any limit by users can be quite dangerous, i.e. i can increase that to 9223372036854775807, and even if cpu is able to to 10^12 iterations/sec it will take like half a year to do all the iterations for one particular group.

So generally if you hit that limit - then most probably something is not going well - your groups are too big or conditions are too complicated, and may be you need to try different approach instead of forcing max_iterations for sequenceMatch.

Mariya Mansurova was mentioning some approaches in that presentation: https://youtu.be/YpurT78U2qA?t=433

Also there was some discussions there: #2096

And there is also function called windowFunnel: #2352

LeoErcolanelli · 2019-01-08T15:59:36Z

@filimonov Thanks a lot for the quick answer!

I will have a look at all the resources you linked.
In the meantime, regarding the use of a user level setting, how do you go on passing the value of the setting down to the implementation of the function ? It looks like today none of the functions do that yet.

filimonov · 2019-01-09T10:36:43Z

Yep, looks like context is not passed to Aggregate functions (while being available in regular functions). Maybe it was done like that by intent? @alexey-milovidov ?

LeoErcolanelli · 2019-01-14T16:17:21Z

@filimonov I gave a try at passing the context down to the AggregateFunctions however I encounter 1 issue and I think your greater knowledge of the code base will be of help.

Changing the definition of AggregateFunctionCreator in order to accept an additional Context argument is pretty straightforward. The issue starts right after:

DataTypeAggregateFunction's creator function uses AggregateFunctionFactory::get() in order to return the DataType.
Because AggregateFunctionFactory::get() now take a Context it means that DataTypeFactory also has to in order for DataTypeAggregateFunction to have access to the context.
This bubbles up everywhere as anything using DataTypeFactory has to have access to Context.

I started implementing passing down a Context wherever it's needed but the the diff will be quite big, and I wonder if there is not a clever way of doing it.

What I observed is that DataTypeFunction doesn't actually registers itself to the DataTypeFactory, argument's type checking is done mostly in FunctionArrayMapped.h and ActionsVisitor.cpp. This would effectively break the dependency between AggregateFunctionFactory and DataTypeFactory. Do you think this is feasible?

alexey-milovidov · 2019-01-14T21:06:51Z

Yep, looks like context is not passed to Aggregate functions (while being available in regular functions). Maybe it was done like that by intent? @alexey-milovidov ?

No, it just was not needed yet.

alexey-milovidov · 2019-01-14T21:10:39Z

@ercolanelli-leo

DataTypeAggregateFunction is a full featured data type that can be stored in tables, manipulated in various ways. In contrast, DataTypeFunction is used only to represent temporary value for lambda expression.

You see that adding dependancy on Context for AggregateFunction is difficult... so, better don't do it.

alexey-milovidov · 2019-01-14T21:18:52Z

This PR looks almost Ok, but a few considerations exist:

We need to also set a hard limit on number of iterations to avoid (almost) infinite loops, that even cannot be cancelled.
Regexp matching algorithm in sequenceMatch is suboptimal. The number of iterations usually grow exponentially on regexp complexity (number of components separated by .*). For exponential growth there is not much difference between default 1000000 and, for example, 1000000000. But one billion iterations per single aggregation state is definitely too much. In consequence, this parameter will be useful only in some marginal cases.

But it is possible to implement more efficient implementation for subset of regexps that are used in most cases (like (?1).*(?2).*(?3)...). Current implementation of sequenceMatch is too generic.

Also it is possible to make sequenceMatch more efficient even in generic case (switch from backtracking to FSM).

LeoErcolanelli · 2019-01-15T06:59:08Z

@alexey-milovidov Thanks a lot for the detailed reply!

Regarding the upper bound, I understand from your message that 1 billion would be a suitable upper bound. I'll be adding a check for this upper bound in my PR.

Regarding the algorithm itself. The fact that the backtracking is sub-optimal is also something that I thought of, until I tried to implement a DFA approach.
Even though the supported patterns are really simple, the (?t) make it non-trivial for the algorithm to be implemented with a FSM approach. There is maybe a way to support this construct while keeping a linear complexity but I didn't get to find it; if you do I'll be happy to work on the implementation!
What I can do, though, is to implement the DFA approach if no (?t) constructs are present inside the pattern, would it be something that interests you?

filimonov · 2019-01-15T08:43:04Z

Still have doubts if such a change introduce something. If we will allow to parametrize max_itertations in a range (1 million...1 billion) some users will still have an impression that this limit is synthetic (and that's true) and they will be sure that 'for their one particular case' it should be extended to N+1 billion case, and they 'are ok to wait'.

May be it's better to consider an option to make it configurable in server config (it's ok to expect that all servers in a cluster will have the same settings).

I agree that the best would be to improve algorithm.

LeoErcolanelli · 2019-01-15T09:11:28Z

If we want to pass the Context down to the aggregating function I actually have a commit ready. It's a rather big change and I am not sure if I went with the best solution. If you wish I can push the commit on this branch for you to see.

filimonov · 2019-01-15T10:50:14Z

If we want to pass the Context down to the aggregating function I actually have a commit ready.

No. As @alexey-milovidov said it's not the best idea.
But beside user level config, there is also a server config - config.xml (from poco config() ).

But I think it's worth to try to improve the performance first, and may be that max_iterations will not be needed at all.

Even though the supported patterns are really simple, the (?t) make it non-trivial for the algorithm to be implemented with a FSM approach. There is maybe a way to support this construct while keeping a linear complexity but I didn't get to find it; if you do I'll be happy to work on the implementation!

I didn't dig into that, but it looks like you can just try to preprocess the sequence of events with simple DFA just to collect / filter out only needed events, ignoring time conditions at the begining, and each time when last condition is satisfied - check the time conditions (if any) back inside filtered sequence (with backtracks).

So your DFA can collect sometimg like: all events matching first condition and their time, all the events matching second condition which appears AFTER first match of first condition and their time, etc. When you get to last condition you need to check times inside that filtered sequence.

That should just work with linear time for sequences without time condtions, for seqences with time conditions it will still have exponential component (you will need to do backtracks), but exponent of number of submatches, not of whole sequence length, which is much better.

May be if matches will be collected in a good structure it would be much better than exponential: for example collect timestamps per condition in an ordered list/array. In that case when you came to last condition you need just filter those timestamps in each pairs which satisfy the conditions, and check if sequence is still possible with filtered matches.

LeoErcolanelli · 2019-01-16T09:24:50Z

@filimonov Thanks a lot for the insights, the filtering of the states before feeding them to the backtracking algorithm looks like a great idea!

At the moment I implemented the DFA-based algorithm for the case where there are no (?t) in the pattern. I ran a (very) naive test in order to get a feel for the improvement (you can find the results below) and the results are what is to be expected:

On a scenario without backtracking the performances are within the same range (DFA slightly faster)
With backtracking the DFA outperform the backtracking by at least x30 (as the backtracking doesn't terminate)

The table:

CREATE TABLE test.sequence
(
    userID UInt64,
    eventType Enum8('A' = 1, 'B' = 2, 'C' = 3),
    EventTime UInt64
)
ENGINE = Memory

The data:

1,'A',0
1,'B',1
...
1,'B',10000
1,'C',10001

On master

Query 1:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'C')

┌─userID─┐
│      1 │
└────────┘

1 rows in set. Elapsed: 0.040 sec. Processed 10.00 thousand rows, 170.03 KB (252.48 thousand rows/s., 4.29 MB/s.)

Query 2:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'A')

↘ Progress: 10.00 thousand rows, 170.03 KB (84.45 thousand rows/s., 1.44 MB/s.) Received exception from server (version 19.1.0):
Code: 160. DB::Exception: Received from localhost:9000, ::1. DB::Exception: Pattern application proves too difficult, exceeding max iterations (1000000).

0 rows in set. Elapsed: 0.966 sec. Processed 10.00 thousand rows, 170.03 KB (10.36 thousand rows/s., 176.05 KB/s.)

On this branch

Query 1:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'C')

┌─userID─┐
│      1 │
└────────┘

1 rows in set. Elapsed: 0.027 sec. Processed 10.00 thousand rows, 170.03 KB (370.44 thousand rows/s., 6.30 MB/s.)

Query 2:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'A')

Ok.

0 rows in set. Elapsed: 0.032 sec. Processed 10.00 thousand rows, 170.03 KB (311.08 thousand rows/s., 5.29 MB/s.)

LeoErcolanelli · 2019-01-16T09:26:07Z

Should I go and implement the approach with the discrimination of the states by the DFA before feeding them to the backtracking algorithm (for the patterns containing (?t)) or should this PR be merged first so we keep the set of changes small ?

alexey-milovidov · 2019-01-16T15:21:35Z

Let's merge this PR first.

alexey-milovidov · 2019-01-16T16:48:53Z

Can you also use dfa_match if possible for sequenceCount?

alexey-milovidov · 2019-01-16T17:24:17Z

PS. If the number of states is small (for example, 16), we can use extremely efficient implementation based on table lookup (for every combination of active states).

LeoErcolanelli changed the title ~~add optional max_iterations parameter to sequence(Match|Count)~~ Use DFA algorithm for sequenceMatch without (?t). Jan 16, 2019

Léo Ercolanelli added 2 commits January 16, 2019 10:39

sequence: implement DFA matching algorithm

0465150

add comment to sequenceMatch's DFA implementation

6048a83

alexey-milovidov added the can be tested label Jan 16, 2019

fix style in AggregateFunctionSequenceMatch.h

043ef82

alexey-milovidov approved these changes Jan 16, 2019

View reviewed changes

alexey-milovidov merged commit 043ef82 into ClickHouse:master Jan 16, 2019

alexey-milovidov added a commit that referenced this pull request Jan 16, 2019

Indentation #4004

61a5995

alexey-milovidov added a commit that referenced this pull request Jan 16, 2019

Attempt to improve performance of sequenceMatch #4004

2f4aab5

alexey-milovidov added a commit that referenced this pull request Jan 16, 2019

Added a test #4004

fc2a0cd

filimonov mentioned this pull request Jul 9, 2021

sequenceMatch fails often even for trivial amounts of data (hardcoded limit) #26095

Closed

filimonov mentioned this pull request Aug 30, 2021

Sequence match quick check #27729

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DFA algorithm for sequenceMatch without `(?t)`. #4004

Use DFA algorithm for sequenceMatch without `(?t)`. #4004

LeoErcolanelli commented Jan 8, 2019 •

edited

Loading

filimonov commented Jan 8, 2019 •

edited

Loading

LeoErcolanelli commented Jan 8, 2019

filimonov commented Jan 9, 2019 •

edited

Loading

LeoErcolanelli commented Jan 14, 2019

alexey-milovidov commented Jan 14, 2019 •

edited

Loading

alexey-milovidov commented Jan 14, 2019

alexey-milovidov commented Jan 14, 2019

LeoErcolanelli commented Jan 15, 2019

filimonov commented Jan 15, 2019

LeoErcolanelli commented Jan 15, 2019

filimonov commented Jan 15, 2019 •

edited

Loading

LeoErcolanelli commented Jan 16, 2019

LeoErcolanelli commented Jan 16, 2019

alexey-milovidov commented Jan 16, 2019

alexey-milovidov commented Jan 16, 2019

alexey-milovidov commented Jan 16, 2019

Use DFA algorithm for sequenceMatch without (?t). #4004

Use DFA algorithm for sequenceMatch without (?t). #4004

Conversation

LeoErcolanelli commented Jan 8, 2019 • edited Loading

Short description (up to few sentences):

Detailed description (optional):

filimonov commented Jan 8, 2019 • edited Loading

LeoErcolanelli commented Jan 8, 2019

filimonov commented Jan 9, 2019 • edited Loading

LeoErcolanelli commented Jan 14, 2019

alexey-milovidov commented Jan 14, 2019 • edited Loading

alexey-milovidov commented Jan 14, 2019

alexey-milovidov commented Jan 14, 2019

LeoErcolanelli commented Jan 15, 2019

filimonov commented Jan 15, 2019

LeoErcolanelli commented Jan 15, 2019

filimonov commented Jan 15, 2019 • edited Loading

LeoErcolanelli commented Jan 16, 2019

On master

On this branch

LeoErcolanelli commented Jan 16, 2019

alexey-milovidov commented Jan 16, 2019

alexey-milovidov commented Jan 16, 2019

alexey-milovidov commented Jan 16, 2019

Use DFA algorithm for sequenceMatch without `(?t)`. #4004

Use DFA algorithm for sequenceMatch without `(?t)`. #4004

LeoErcolanelli commented Jan 8, 2019 •

edited

Loading

filimonov commented Jan 8, 2019 •

edited

Loading

filimonov commented Jan 9, 2019 •

edited

Loading

alexey-milovidov commented Jan 14, 2019 •

edited

Loading

filimonov commented Jan 15, 2019 •

edited

Loading