-
Notifications
You must be signed in to change notification settings - Fork 4
Deduplicate rows filter
The Deduplicate rows filter removes duplicated rows from a HXL dataset. By default, the filter looks for rows where all tagged columns have identical values, but you can specify a specific set of columns to look at instead (ignoring the rest). The filter will keep the first instance of each duplicated column (use a Sort rows filter first to change the order of the rows if necessary, e.g. sorting by date).
Tags to look at for deduplication: a list of tag patterns for the columns that make a row unique (e.g. "activity+code,date"). If omitted, then all tagged columns must match.
Deduplicate only rows matching this query: a row query matching the rows that should be considered. The HXL Proxy will ignore any row not matching the query during the deduplication process.
#org | #sector | #adm1 | #date |
---|---|---|---|
UNICEF | WASH | Coast | 2016-11-15 |
TdH | Health | Coast | 2016-11-12 |
TdH | Health | Coast | 2016-12-01 |
UNICEF | WASH | Coast | 2016-11-15 |
#org | #sector | #adm1 | #date |
---|---|---|---|
UNICEF | WASH | Coast | 2016-11-15 |
TdH | Health | Coast | 2016-11-12 |
TdH | Health | Coast | 2016-12-01 |
This filter is especially useful after using the Append datasets filter on two or more datasets that might contain duplicate rows (possibly after using the Clean data filter first to normalise dates, numbers, and whitespace).
Where records contain one or more columns that make up a unique identifier, using only those for deduplication can help ignore superficial differences in other columns.
To pick only the most recent version of each record, you could first use a Sort rows filter to sort the data by the #date column in descending order, then use this filter with the relevant columns (e.g. #org, #sector, #adm1) to pick the latest activity that matches each set of criteria.
Learn more about the HXL standard at http://hxlstandard.org