Skip to content

Motivations

Olivier Lartillot edited this page Oct 24, 2017 · 1 revision

Through this method of constructing patterns of various lengths, along multiple levels of magnitude, from very short patterns to very long patterns, the aim is to grossly cover a large range of types of pattern structures that could be considered in theory.

For instance, filling a sequence using random symbols from a fixed alphabet would correspond to the use solely of the smallest level. In this framework, it is also possible to avoid constructing patterns of this smallest level.

This method is not aimed at covering in details all possible types of pattern structures. It is just a very rough approximation of various types of configurations. For instance, pattern lengths can only be set to the values 3, 30, 300, etc.

As a result of the process, some occurrences of patterns are truncated at their start, middle or end, leading to the creation of subpatterns. As such, the actual content of each sequence does not exactly correspond to the specifications. But since the same type of overlap happens for each sequence length, the overall distribution of pattern lengths, normalised by the sequence length, should remain stable and similar when comparing different sequence lengths. Experimental results show indeed that the average runtime and memory usage for a given configuration and sequence length remains stable, and that for an algorithm exhibiting a linear complexity, both runtime and memory usage grow proportionally to the sequence length.

Parts of the generated sequences that have not been specified are filled with don't care character (usually written $, here 0). If those parts were filled instead with the successive repetition of a same character from the alphabet, this would produce a large amount of periodic sequences. In this first version of PatMakr, we decided to avoid generating periodic sequences, as they lead to particular structural characteristics that require particular processes in the pattern mining algorithm that need to be taken into consideration efficiently.

This pattern generation algorithm is used as part of an ongoing research (publication submitted to a journal), for the evaluation of the results of various pattern mining methods. The taking into consideration of periodic pattern is planned for a subsequent publication.

Clone this wiki locally