Skip to content

Basic Collecting

Simon Proctor edited this page Oct 1, 2013 · 5 revisions

Lucene allows you to change the way results are collected. This has a number of useful use cases and we'll use a real world sample here to describe the current API.

In a number of news and blog related applications we've worked on, we have needed to provide summary views and dynamic navigation based on the data that is currently present. Depending on the project, we don't always have access to a simple relational database and have to aggregate the data differently. One sample of this is when using Sitecore. Here we had to generate the site navigation based on the news story data available. If there were no stories for a month or year, we wouldn't show them in the navigation.

The sample below, taken from the test suite, is a simple version of this that shows how to sum documents that are related by some search term and group them according their day of the week. This is very similar to an aggregate function in SQL but with the power of Lucene:

		IQueryBuilder queryBuilder = new QueryBuilder();
		queryBuilder.Setup
			(
				x => x.WildCard(BBCFields.Description, "food"),
				x => x.Filter(DateRangeFilter.Filter(BBCFields.PublishDateObject, DateTime.Parse("01/02/2013"), DateTime.Parse("28/02/2013")))
			);

		DateCollector collector = new DateCollector();
		luceneSearch.Collect(queryBuilder.Build(), collector);

		Assert.Greater(collector.DailyCount.Keys.Count, 0);
		foreach (String day in collector.DailyCount.Keys)
		{
			Console.Error.WriteLine("Day: {0} had {1} documents", day, collector.DailyCount[day]);
		}

		Console.WriteLine();

Here we search for all documents that contain 'food' in the description field and use a filter to restrict the date range to February, 2013. We use the collector to sum up these documents by day so we find how many documents mentioned food by day for that month.

When run, depending on the contents of the index, the output is similar to:

Day: Wednesday had 3 documents
Day: Tuesday had 2 documents
Day: Friday had 2 documents
Day: Thursday had 3 documents
Day: Sunday had 1 documents
Day: Monday had 5 documents
Day: Saturday had 1 documents

The collector, also included in the test suite, looks like this:

public class DateCollector : Collector
{
	public int Count { get; private set; }

	private long[] dates;

	public Dictionary<String, int> DailyCount { get; set; }

	public DateCollector()
	{
		dates = new long[10];
		DailyCount = new Dictionary<String, int>();
	}

	public void Reset()
	{
		Count = 0;
	}

	/// <summary>
	/// 
	/// </summary>
	/// <param name="docId"></param>
	public override void Collect(int docId)
	{
		Count = Count + 1;

		try
		{
			long temp = dates[docId];

			DateTime date = new DateTime(temp);
			String day = date.DayOfWeek.ToString();

			if (!DailyCount.ContainsKey(day))
			{
				DailyCount[day] = 1;
			}
			else
			{
				DailyCount[day]++;
			}
		}
		catch (Exception ex)
		{
			Console.Error.WriteLine(ex.ToString());
		}
		
	}

	public override void SetScorer(Scorer scorer) { }

	public override void SetNextReader(IndexReader reader, int docBase)
	{
        dates = FieldCache_Fields.DEFAULT.GetLongs(reader, BBCFields.PublishDateObject);
	}

    public override bool AcceptsDocsOutOfOrder
    {
        get { return true; }
    }
}

This is a very simple implementation based on storing the DateTime using Lucinq extensions to convert them into Numeric fields (long) so that we can pull them back out without using string based parsing. We're also accepting documents out of order as we aren't interested in scoring by relevancy and so this should optimise our call as well.