Neural networks are non-deterministic with respect to their runtime, predictive performance and learned motifs, that is, even given fixed input sequences and hyper-parameters you will often get different results when training a network multiple times. This is due to the random initialization of network weights and the random nature of the stochastic gradient descent optimization algorithms (e.g. Adam in our case) combined with mini-batch training.
While the predictive performance is usually very stable and high (as shown in the Supplementary Table of the pysster paper) the runtime and the learned motifs might differ to some degree. Neural networks are first and foremost classifiers and not motif finders. They just happen to learn motifs to perform the classification task (or something that we interpret as motifs). While training a network with only a few convolutional kernels (== motifs) the network will usually do a decent job at classifying the sequences, but it might not learn all desired motifs. This is especially true for co-occurring motifs, because in this case only one motif is enough to do the classification, the co-occurring motifs are just redundant information not needed for the classification task. This description is a little bit exaggerated, because biological data are noisy and perfect co-occurrence doesn’t really happen.
Nevertheless, from our experience, training a network with only a few kernels as done in the RNA A-to-I editing tutorial will give you a stable predictive performance, but it might also give you different motifs from run to run, particularly on such a big data set. Here, we only used 20 kernels for the sake of the tutorial. In practice, we recommend using 50 or 100 kernels for this amount of sequences (and the degree of diversity in the data) to get more consistent motifs and to have a higher chance of capturing many different motifs in a single run. In turn, of course, this increases the runtime and the difficulty of interpretation, because we now have more motifs to look at. For less diverse data (e.g. CLIP-seq binding sites) for which we would expect a lower number of motifs, a lower number of kernels is appropriate.
Please keep in mind that there is no silver bullet in machine learning and everything depends on the individual data set in question. You should always consider your domain knowledge when tuning the parameters of a method (that includes the possibility of switching to another machine learning technique). If you have questions about how to pick some pysster parameters for a particular data set, feel free to ask us.