Deduplicate rows filter

The Deduplicate rows filter removes duplicated rows from a HXL dataset. By default, the filter looks for rows where all tagged columns have identical values, but you can specify a specific set of columns to look at instead (ignoring the rest). The filter will keep the first instance of each duplicated column (use a Sort rows filter first to change the order of the rows if necessary, e.g. sorting by date).

Options

Tags to look at for deduplication: a list of tag patterns for the columns that make a row unique (e.g. "activity+code,date"). If omitted, then all tagged columns must match.

Deduplicate only rows matching this query: a row query matching the rows that should be considered. The HXL Proxy will ignore any row not matching the query during the deduplication process.

Example

Before

#org	#sector	#adm1	#date
UNICEF	WASH	Coast	2016-11-15
TdH	Health	Coast	2016-11-12
TdH	Health	Coast	2016-12-01
UNICEF	WASH	Coast	2016-11-15

After

#org	#sector	#adm1	#date
UNICEF	WASH	Coast	2016-11-15
TdH	Health	Coast	2016-11-12
TdH	Health	Coast	2016-12-01

Use cases

This filter is especially useful after using the Append datasets filter on two or more datasets that might contain duplicate rows (possibly after using the Clean data filter first to normalise dates, numbers, and whitespace).

Where records contain one or more columns that make up a unique identifier, using only those for deduplication can help ignore superficial differences in other columns.

To pick only the most recent version of each record, you could first use a Sort rows filter to sort the data by the #date column in descending order, then use this filter with the relevant columns (e.g. #org, #sector, #adm1) to pick the latest activity that matches each set of criteria.

Learn more about the HXL standard at http://hxlstandard.org

Home
Use cases
Core workflow
- Source page
- Tagger page
- Recipe page
  - Filters
- View page
Special topics
For coders
Additional API services

Provide feedback

Saved searches