Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DFA algorithm for sequenceMatch without (?t). #4004

Merged
merged 3 commits into from
Jan 16, 2019
Merged

Use DFA algorithm for sequenceMatch without (?t). #4004

merged 3 commits into from
Jan 16, 2019

Conversation

LeoErcolanelli
Copy link
Contributor

@LeoErcolanelli LeoErcolanelli commented Jan 8, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • Improvement

Short description (up to few sentences):

Add an optional parameter to sequenceMatch and sequenceCount in order to temporary lift the hard-coded limit on the number of iterations.

Detailed description (optional):

As of today the constant sequence_match_max_iterations sets an arbitrary limit on the number of iterations supported by sequenceMatch and sequenceCount. In certain cases, making this limit configurable may be desirable (current opened issue).

In order to do so, a new parameter has been added to those two functions. The fact that the parameter is optional and default to the previous hard-coded value makes this change backward compatible.

I think going for a query setting would have been cleaner but I've found no way to access the query settings from within the implementation of function. If such a way exists, I'll gladly modify the PR in order to interact with the settings rather than an optional parameter.

Let me know if anything is to be changed !

@filimonov
Copy link
Contributor

filimonov commented Jan 8, 2019

Maybe it would be better just to add that to a user level settings instead of passing that back and forward with every function call?

Also: increasing that without any limit by users can be quite dangerous, i.e. i can increase that to 9223372036854775807, and even if cpu is able to to 10^12 iterations/sec it will take like half a year to do all the iterations for one particular group.

So generally if you hit that limit - then most probably something is not going well - your groups are too big or conditions are too complicated, and may be you need to try different approach instead of forcing max_iterations for sequenceMatch.

Mariya Mansurova was mentioning some approaches in that presentation: https://youtu.be/YpurT78U2qA?t=433

Also there was some discussions there: #2096

And there is also function called windowFunnel: #2352

@LeoErcolanelli
Copy link
Contributor Author

@filimonov Thanks a lot for the quick answer!

I will have a look at all the resources you linked.
In the meantime, regarding the use of a user level setting, how do you go on passing the value of the setting down to the implementation of the function ? It looks like today none of the functions do that yet.

@filimonov
Copy link
Contributor

filimonov commented Jan 9, 2019

Yep, looks like context is not passed to Aggregate functions (while being available in regular functions). Maybe it was done like that by intent? @alexey-milovidov ?

@LeoErcolanelli
Copy link
Contributor Author

@filimonov I gave a try at passing the context down to the AggregateFunctions however I encounter 1 issue and I think your greater knowledge of the code base will be of help.

Changing the definition of AggregateFunctionCreator in order to accept an additional Context argument is pretty straightforward. The issue starts right after:

  • DataTypeAggregateFunction's creator function uses AggregateFunctionFactory::get() in order to return the DataType.

  • Because AggregateFunctionFactory::get() now take a Context it means that DataTypeFactory also has to in order for DataTypeAggregateFunction to have access to the context.

  • This bubbles up everywhere as anything using DataTypeFactory has to have access to Context.

I started implementing passing down a Context wherever it's needed but the the diff will be quite big, and I wonder if there is not a clever way of doing it.

What I observed is that DataTypeFunction doesn't actually registers itself to the DataTypeFactory, argument's type checking is done mostly in FunctionArrayMapped.h and ActionsVisitor.cpp. This would effectively break the dependency between AggregateFunctionFactory and DataTypeFactory. Do you think this is feasible?

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jan 14, 2019

Yep, looks like context is not passed to Aggregate functions (while being available in regular functions). Maybe it was done like that by intent? @alexey-milovidov ?

No, it just was not needed yet.

@alexey-milovidov
Copy link
Member

@ercolanelli-leo

DataTypeAggregateFunction is a full featured data type that can be stored in tables, manipulated in various ways. In contrast, DataTypeFunction is used only to represent temporary value for lambda expression.

You see that adding dependancy on Context for AggregateFunction is difficult... so, better don't do it.

@alexey-milovidov
Copy link
Member

This PR looks almost Ok, but a few considerations exist:

  1. We need to also set a hard limit on number of iterations to avoid (almost) infinite loops, that even cannot be cancelled.

  2. Regexp matching algorithm in sequenceMatch is suboptimal. The number of iterations usually grow exponentially on regexp complexity (number of components separated by .*). For exponential growth there is not much difference between default 1000000 and, for example, 1000000000. But one billion iterations per single aggregation state is definitely too much. In consequence, this parameter will be useful only in some marginal cases.

But it is possible to implement more efficient implementation for subset of regexps that are used in most cases (like (?1).*(?2).*(?3)...). Current implementation of sequenceMatch is too generic.

Also it is possible to make sequenceMatch more efficient even in generic case (switch from backtracking to FSM).

@LeoErcolanelli
Copy link
Contributor Author

@alexey-milovidov Thanks a lot for the detailed reply!

Regarding the upper bound, I understand from your message that 1 billion would be a suitable upper bound. I'll be adding a check for this upper bound in my PR.

Regarding the algorithm itself. The fact that the backtracking is sub-optimal is also something that I thought of, until I tried to implement a DFA approach.
Even though the supported patterns are really simple, the (?t) make it non-trivial for the algorithm to be implemented with a FSM approach. There is maybe a way to support this construct while keeping a linear complexity but I didn't get to find it; if you do I'll be happy to work on the implementation!
What I can do, though, is to implement the DFA approach if no (?t) constructs are present inside the pattern, would it be something that interests you?

@filimonov
Copy link
Contributor

Still have doubts if such a change introduce something. If we will allow to parametrize max_itertations in a range (1 million...1 billion) some users will still have an impression that this limit is synthetic (and that's true) and they will be sure that 'for their one particular case' it should be extended to N+1 billion case, and they 'are ok to wait'.

May be it's better to consider an option to make it configurable in server config (it's ok to expect that all servers in a cluster will have the same settings).

I agree that the best would be to improve algorithm.

@LeoErcolanelli
Copy link
Contributor Author

If we want to pass the Context down to the aggregating function I actually have a commit ready. It's a rather big change and I am not sure if I went with the best solution. If you wish I can push the commit on this branch for you to see.

@filimonov
Copy link
Contributor

filimonov commented Jan 15, 2019

If we want to pass the Context down to the aggregating function I actually have a commit ready.

No. As @alexey-milovidov said it's not the best idea.
But beside user level config, there is also a server config - config.xml (from poco config() ).

But I think it's worth to try to improve the performance first, and may be that max_iterations will not be needed at all.

Even though the supported patterns are really simple, the (?t) make it non-trivial for the algorithm to be implemented with a FSM approach. There is maybe a way to support this construct while keeping a linear complexity but I didn't get to find it; if you do I'll be happy to work on the implementation!

I didn't dig into that, but it looks like you can just try to preprocess the sequence of events with simple DFA just to collect / filter out only needed events, ignoring time conditions at the begining, and each time when last condition is satisfied - check the time conditions (if any) back inside filtered sequence (with backtracks).

So your DFA can collect sometimg like: all events matching first condition and their time, all the events matching second condition which appears AFTER first match of first condition and their time, etc. When you get to last condition you need to check times inside that filtered sequence.

That should just work with linear time for sequences without time condtions, for seqences with time conditions it will still have exponential component (you will need to do backtracks), but exponent of number of submatches, not of whole sequence length, which is much better.

May be if matches will be collected in a good structure it would be much better than exponential: for example collect timestamps per condition in an ordered list/array. In that case when you came to last condition you need just filter those timestamps in each pairs which satisfy the conditions, and check if sequence is still possible with filtered matches.

@LeoErcolanelli LeoErcolanelli changed the title add optional max_iterations parameter to sequence(Match|Count) Use DFA algorithm for sequenceMatch without (?t). Jan 16, 2019
@LeoErcolanelli
Copy link
Contributor Author

@filimonov Thanks a lot for the insights, the filtering of the states before feeding them to the backtracking algorithm looks like a great idea!

At the moment I implemented the DFA-based algorithm for the case where there are no (?t) in the pattern. I ran a (very) naive test in order to get a feel for the improvement (you can find the results below) and the results are what is to be expected:

  • On a scenario without backtracking the performances are within the same range (DFA slightly faster)
  • With backtracking the DFA outperform the backtracking by at least x30 (as the backtracking doesn't terminate)

The table:

CREATE TABLE test.sequence
(
    userID UInt64,
    eventType Enum8('A' = 1, 'B' = 2, 'C' = 3),
    EventTime UInt64
)
ENGINE = Memory

The data:

1,'A',0
1,'B',1
...
1,'B',10000
1,'C',10001

On master

Query 1:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'C')

┌─userID─┐
│      1 │
└────────┘

1 rows in set. Elapsed: 0.040 sec. Processed 10.00 thousand rows, 170.03 KB (252.48 thousand rows/s., 4.29 MB/s.)

Query 2:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'A')

↘ Progress: 10.00 thousand rows, 170.03 KB (84.45 thousand rows/s., 1.44 MB/s.) Received exception from server (version 19.1.0):
Code: 160. DB::Exception: Received from localhost:9000, ::1. DB::Exception: Pattern application proves too difficult, exceeding max iterations (1000000).

0 rows in set. Elapsed: 0.966 sec. Processed 10.00 thousand rows, 170.03 KB (10.36 thousand rows/s., 176.05 KB/s.)

On this branch

Query 1:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'C')

┌─userID─┐
│      1 │
└────────┘

1 rows in set. Elapsed: 0.027 sec. Processed 10.00 thousand rows, 170.03 KB (370.44 thousand rows/s., 6.30 MB/s.)

Query 2:

SELECT userID
FROM test.sequence
GROUP BY userID
HAVING sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(EventTime), eventType = 'A', eventType = 'B', eventType = 'A')

Ok.

0 rows in set. Elapsed: 0.032 sec. Processed 10.00 thousand rows, 170.03 KB (311.08 thousand rows/s., 5.29 MB/s.)

@LeoErcolanelli
Copy link
Contributor Author

Should I go and implement the approach with the discrimination of the states by the DFA before feeding them to the backtracking algorithm (for the patterns containing (?t)) or should this PR be merged first so we keep the set of changes small ?

@alexey-milovidov
Copy link
Member

Let's merge this PR first.

@alexey-milovidov
Copy link
Member

Can you also use dfa_match if possible for sequenceCount?

@alexey-milovidov
Copy link
Member

PS. If the number of states is small (for example, 16), we can use extremely efficient implementation based on table lookup (for every combination of active states).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants