We have reconsidered the evaluation criteria. Instead of 100 random citations, we selected 10 random citations for each regular expression. We decided to examine a fixed quantity of citations per regular expression and not an amount proportional to the number of matches to remove the bias given by the population under consideration.
The function to generate the evaluation data is get_random_results
in evaluation.py.