-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DFA algorithm for sequenceMatch without (?t)
.
#4004
Use DFA algorithm for sequenceMatch without (?t)
.
#4004
Conversation
Maybe it would be better just to add that to a user level settings instead of passing that back and forward with every function call? Also: increasing that without any limit by users can be quite dangerous, i.e. i can increase that to 9223372036854775807, and even if cpu is able to to 10^12 iterations/sec it will take like half a year to do all the iterations for one particular group. So generally if you hit that limit - then most probably something is not going well - your groups are too big or conditions are too complicated, and may be you need to try different approach instead of forcing max_iterations for sequenceMatch. Mariya Mansurova was mentioning some approaches in that presentation: https://youtu.be/YpurT78U2qA?t=433 Also there was some discussions there: #2096 And there is also function called windowFunnel: #2352 |
@filimonov Thanks a lot for the quick answer! I will have a look at all the resources you linked. |
Yep, looks like context is not passed to Aggregate functions (while being available in regular functions). Maybe it was done like that by intent? @alexey-milovidov ? |
@filimonov I gave a try at passing the context down to the Changing the definition of
I started implementing passing down a Context wherever it's needed but the the diff will be quite big, and I wonder if there is not a clever way of doing it. What I observed is that |
No, it just was not needed yet. |
@ercolanelli-leo
You see that adding dependancy on |
This PR looks almost Ok, but a few considerations exist:
But it is possible to implement more efficient implementation for subset of regexps that are used in most cases (like Also it is possible to make |
@alexey-milovidov Thanks a lot for the detailed reply! Regarding the upper bound, I understand from your message that 1 billion would be a suitable upper bound. I'll be adding a check for this upper bound in my PR. Regarding the algorithm itself. The fact that the backtracking is sub-optimal is also something that I thought of, until I tried to implement a DFA approach. |
Still have doubts if such a change introduce something. If we will allow to parametrize max_itertations in a range (1 million...1 billion) some users will still have an impression that this limit is synthetic (and that's true) and they will be sure that 'for their one particular case' it should be extended to N+1 billion case, and they 'are ok to wait'. May be it's better to consider an option to make it configurable in server config (it's ok to expect that all servers in a cluster will have the same settings). I agree that the best would be to improve algorithm. |
If we want to pass the Context down to the aggregating function I actually have a commit ready. It's a rather big change and I am not sure if I went with the best solution. If you wish I can push the commit on this branch for you to see. |
No. As @alexey-milovidov said it's not the best idea. But I think it's worth to try to improve the performance first, and may be that max_iterations will not be needed at all.
I didn't dig into that, but it looks like you can just try to preprocess the sequence of events with simple DFA just to collect / filter out only needed events, ignoring time conditions at the begining, and each time when last condition is satisfied - check the time conditions (if any) back inside filtered sequence (with backtracks). So your DFA can collect sometimg like: all events matching first condition and their time, all the events matching second condition which appears AFTER first match of first condition and their time, etc. When you get to last condition you need to check times inside that filtered sequence. That should just work with linear time for sequences without time condtions, for seqences with time conditions it will still have exponential component (you will need to do backtracks), but exponent of number of submatches, not of whole sequence length, which is much better. May be if matches will be collected in a good structure it would be much better than exponential: for example collect timestamps per condition in an ordered list/array. In that case when you came to last condition you need just filter those timestamps in each pairs which satisfy the conditions, and check if sequence is still possible with filtered matches. |
max_iterations
parameter to sequence(Match|Count)(?t)
.
@filimonov Thanks a lot for the insights, the filtering of the states before feeding them to the backtracking algorithm looks like a great idea! At the moment I implemented the DFA-based algorithm for the case where there are no
The table: CREATE TABLE test.sequence
(
userID UInt64,
eventType Enum8('A' = 1, 'B' = 2, 'C' = 3),
EventTime UInt64
)
ENGINE = Memory The data:
On masterQuery 1:
Query 2:
On this branchQuery 1:
Query 2:
|
Should I go and implement the approach with the discrimination of the states by the DFA before feeding them to the backtracking algorithm (for the patterns containing |
Let's merge this PR first. |
Can you also use |
PS. If the number of states is small (for example, 16), we can use extremely efficient implementation based on table lookup (for every combination of active states). |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
For changelog. Remove if this is non-significant change.
Category (leave one):
Short description (up to few sentences):
Add an optional parameter to
sequenceMatch
andsequenceCount
in order to temporary lift the hard-coded limit on the number of iterations.Detailed description (optional):
As of today the constant sequence_match_max_iterations sets an arbitrary limit on the number of iterations supported by
sequenceMatch
andsequenceCount
. In certain cases, making this limit configurable may be desirable (current opened issue).In order to do so, a new parameter has been added to those two functions. The fact that the parameter is optional and default to the previous hard-coded value makes this change backward compatible.
I think going for a query setting would have been cleaner but I've found no way to access the query settings from within the implementation of function. If such a way exists, I'll gladly modify the PR in order to interact with the settings rather than an optional parameter.
Let me know if anything is to be changed !