Skip to content

Deduplicate rows filter

David Megginson edited this page May 12, 2017 · 5 revisions
Deduplicate rows filter form

The Deduplicate rows filter removes duplicated rows from a HXL dataset. By default, the filter looks for rows where all tagged columns have identical values, but you can specify a specific set of columns to look at instead (ignoring the rest). The filter will keep the first instance of each duplicated column (use a Sort rows filter first to change the order of the rows if necessary, e.g. sorting by date).

Options

Tags to look at for deduplication: a list of tag patterns for the columns that make a row unique (e.g. "activity+code,date"). If omitted, then all tagged columns must match.

Deduplicate only rows matching this query: a row query matching the rows that should be considered. The HXL Proxy will ignore any row not matching the query during the deduplication process.

Example

Before

#org #sector #adm1 #date
UNICEF WASH Coast 2016-11-15
TdH Health Coast 2016-11-12
TdH Health Coast 2016-12-01
UNICEF WASH Coast 2016-11-15

After

#org #sector #adm1 #date
UNICEF WASH Coast 2016-11-15
TdH Health Coast 2016-11-12
TdH Health Coast 2016-12-01

Use cases

This filter is especially useful after using the Append datasets filter on two or more datasets that might contain duplicate rows (possibly after using the Clean data filter first to normalise dates, numbers, and whitespace).

Where records contain one or more columns that make up a unique identifier, using only those for deduplication can help ignore superficial differences in other columns.

To pick only the most recent version of each record, you could first use a Sort rows filter to sort the data by the #date column in descending order, then use this filter with the relevant columns (e.g. #org, #sector, #adm1) to pick the latest activity that matches each set of criteria.

Clone this wiki locally