Skip to content

Commit

Permalink
Add duplicates component (#10323)
Browse files Browse the repository at this point in the history
* Update existing behaviou to match new

* Add signatures

* Red test

* First test green

* sbt javafmtAll

* In-Memory working

* Not implemeted for In-Db

* Docs

* Disable tests for in-db

* Changelog

* Code review changes

* Fix

* Fix

* Fixc tests
  • Loading branch information
AdRiley authored Jun 24, 2024
1 parent 791dba6 commit c324c78
Show file tree
Hide file tree
Showing 12 changed files with 241 additions and 43 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
- [Implemented `.cast` to and from `Decimal` columns for the in-memory
database.][10206]
- [Implemented fallback to Windows-1252 encoding for `Encoding.Default`.][10190]
- [Added Table.duplicates component][10323]

[debug-shortcuts]:

Expand All @@ -50,6 +51,7 @@
[10130]: https://github.com/enso-org/enso/pull/10130
[10206]: https://github.com/enso-org/enso/pull/10206
[10190]: https://github.com/enso-org/enso/pull/10190
[10323]: https://github.com/enso-org/enso/pull/10323

<br/>![Release Notes](/docs/assets/tags/release_notes.svg)

Expand Down
4 changes: 2 additions & 2 deletions distribution/lib/Standard/Base/0.0.0-dev/src/Data/Array.enso
Original file line number Diff line number Diff line change
Expand Up @@ -357,9 +357,9 @@ type Array
first duplicate appeared in the input.

> Example
Removing repeating entries.
Removing unique entries.

[1, 3, 1, 2, 2, 1].to_array . duplicates == [1, 2].to_array
[1, 3, 1, 2, 2, 1].to_array . duplicates == [1, 1, 2, 2, 1].to_array
duplicates : (Any -> Any) -> Vector Any
duplicates self (on = x->x) =
Array_Like_Helpers.duplicates self on
Expand Down
4 changes: 2 additions & 2 deletions distribution/lib/Standard/Base/0.0.0-dev/src/Data/Vector.enso
Original file line number Diff line number Diff line change
Expand Up @@ -1227,9 +1227,9 @@ type Vector a
first duplicate appeared in the input.

> Example
Removing repeating entries.
Removing unique entries.

[1, 3, 1, 2, 2, 1] . duplicates == [1, 2]
[1, 3, 1, 2, 2, 1] . duplicates == [1, 1, 2, 2, 1]
duplicates : (Any -> Any) -> Vector Any
duplicates self (on = x->x) =
Array_Like_Helpers.duplicates self on
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,11 +165,14 @@ distinct vector on =
existing.insert key True

duplicates vector on = Vector.build builder->
vector.fold Map.empty current-> item->
counts = vector.fold Map.empty current-> item->
key = on item
count = current.get key 0
if count == 1 then builder.append item
current.insert key count+1
vector.map item->
key = on item
count = counts.get key 0
if count != 1 then builder.append item

take vector range = case range of
## We are using a specialized implementation for `take Sample`, because
Expand Down
41 changes: 36 additions & 5 deletions distribution/lib/Standard/Database/0.0.0-dev/src/DB_Table.enso
Original file line number Diff line number Diff line change
Expand Up @@ -1323,23 +1323,54 @@ type DB_Table
raised as an error regardless of the problem behavior, because it is
not possible to create a table without any columns.
- If a column in `columns` is not in the input table, a
`Missing_Input_Columns` is raised as an error, unless
`error_on_missing_columns` is set to `False`, in which case the
problem is reported according to the `on_problems` setting.
`Missing_Input_Columns` is raised as an error.
- If no valid columns are selected, a `No_Input_Columns_Selected`, is
reported as a dataflow error regardless of setting.
- If floating points values are present in the distinct columns, a
`Floating_Point_Equality` is reported according to the `on_problems`
setting.
@columns Widget_Helpers.make_column_name_multi_selector
distinct : Vector (Integer | Text | Regex) | Text | Integer | Regex -> Case_Sensitivity -> Boolean -> Problem_Behavior -> DB_Table ! No_Output_Columns | Missing_Input_Columns | No_Input_Columns_Selected | Floating_Point_Equality
distinct self columns=self.column_names case_sensitivity:Case_Sensitivity=..Default error_on_missing_columns:Boolean=True on_problems:Problem_Behavior=Report_Warning =
key_columns = self.columns_helper.select_columns columns Case_Sensitivity.Default reorder=True error_on_missing_columns=error_on_missing_columns on_problems=on_problems . catch No_Output_Columns _->
distinct self columns=self.column_names case_sensitivity:Case_Sensitivity=..Default on_problems:Problem_Behavior=Report_Warning =
key_columns = self.columns_helper.select_columns columns Case_Sensitivity.Default reorder=True error_on_missing_columns=True on_problems=on_problems . catch No_Output_Columns _->
Error.throw No_Input_Columns_Selected
problem_builder = Problem_Builder.new
new_table = self.connection.dialect.prepare_distinct self key_columns case_sensitivity problem_builder
problem_builder.attach_problems_before on_problems new_table

## GROUP Standard.Base.Selections
ICON preparation
Returns the set of rows which are duplicated within the specified columns from the
input table.

When multiple rows have the same values within the specified columns all of those rows are
returned. Rows which are unique within the specified columns are removed.

Arguments:
- columns: The columns of the table to use for distinguishing the rows.
- case_sensitivity: Specifies if the text values should be compared case
sensitively.
- on_problems: Specifies how to handle if a problem occurs, raising as a
warning by default.

! Error Conditions

- If there are no columns in the output table, a `No_Output_Columns` is
raised as an error regardless of the problem behavior, because it is
not possible to create a table without any columns.
- If a column in `columns` is not in the input table, a
`Missing_Input_Columns` is raised as an error.
- If no valid columns are selected, a `No_Input_Columns_Selected`, is
reported as a dataflow error regardless of setting.
- If floating points values are present in the distinct columns, a
`Floating_Point_Equality` is reported according to the `on_problems`
setting.
@columns Widget_Helpers.make_column_name_multi_selector
duplicates : Vector (Integer | Text | Regex) | Text | Integer | Regex -> Case_Sensitivity -> Boolean -> Problem_Behavior -> DB_Table ! No_Output_Columns | Missing_Input_Columns | No_Input_Columns_Selected | Floating_Point_Equality
duplicates self columns=self.column_names case_sensitivity:Case_Sensitivity=..Default on_problems:Problem_Behavior=..Report_Warning =
_ = [columns, case_sensitivity, on_problems]
Error.throw (Unsupported_Database_Operation.Error "DB_Table.duplicates is not implemented yet for the Database backends.")

## GROUP Standard.Base.Calculations
ICON join
Joins two tables according to the specified join conditions.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -336,7 +336,7 @@ rename_columns (naming_helper : Column_Naming_Helper) (internal_columns:Vector)
## Attempt to treat as Map
map = Map.from_vector mapping error_on_duplicates=False
if map.length == mapping.length then rename_columns naming_helper internal_columns map case_sensitivity error_on_missing_columns on_problems else
duplicates = mapping.duplicates on=_.first . map p->p.first.to_text
duplicates = mapping.duplicates on=_.first . map p->p.first.to_text . distinct
duplicate_text = if duplicates.length < 5 then duplicates.to_vector . join ", " else
duplicates.take 3 . to_vector . join ", " + (", ... " + (duplicates.length - 3).to_text + " others")
Error.throw (Illegal_Argument.Error "duplicate old name mappings ("+duplicate_text+").")
Expand Down
47 changes: 42 additions & 5 deletions distribution/lib/Standard/Table/0.0.0-dev/src/Table.enso
Original file line number Diff line number Diff line change
Expand Up @@ -939,18 +939,16 @@ type Table
raised as an error regardless of the problem behavior, because it is
not possible to create a table without any columns.
- If a column in `columns` is not in the input table, a
`Missing_Input_Columns` is raised as an error, unless
`error_on_missing_columns` is set to `False`, in which case the
problem is reported according to the `on_problems` setting.
`Missing_Input_Columns` is raised as an error.
- If no valid columns are selected, a `No_Input_Columns_Selected`, is
reported as a dataflow error regardless of setting.
- If floating points values are present in the distinct columns, a
`Floating_Point_Equality` is reported according to the `on_problems`
setting.
@columns Widget_Helpers.make_column_name_multi_selector
distinct : Vector (Integer | Text | Regex) | Text | Integer | Regex -> Case_Sensitivity -> Boolean -> Problem_Behavior -> Table ! No_Output_Columns | Missing_Input_Columns | No_Input_Columns_Selected | Floating_Point_Equality
distinct self (columns = self.column_names) case_sensitivity:Case_Sensitivity=Case_Sensitivity.Default error_on_missing_columns:Boolean=True on_problems:Problem_Behavior=..Report_Warning =
key_columns = self.columns_helper.select_columns columns Case_Sensitivity.Default reorder=True error_on_missing_columns=error_on_missing_columns on_problems=on_problems . catch No_Output_Columns _->
distinct self (columns = self.column_names) case_sensitivity:Case_Sensitivity=Case_Sensitivity.Default on_problems:Problem_Behavior=..Report_Warning =
key_columns = self.columns_helper.select_columns columns Case_Sensitivity.Default reorder=True error_on_missing_columns=True on_problems=on_problems . catch No_Output_Columns _->
Error.throw No_Input_Columns_Selected
java_columns = key_columns.map c->c.java_column
text_folding_strategy = Case_Sensitivity.folding_strategy case_sensitivity
Expand All @@ -959,6 +957,45 @@ type Table
self.java_table.distinct java_columns text_folding_strategy java_aggregator
Table.Value java_table

## GROUP Standard.Base.Selections
ICON preparation
Returns the set of rows which are duplicated within the specified columns from the
input table.

When multiple rows have the same values within the specified columns all of those rows are
returned. Rows which are unique within the specified columns are removed.

Arguments:
- columns: The columns of the table to use for distinguishing the rows.
- case_sensitivity: Specifies if the text values should be compared case
sensitively.
- on_problems: Specifies how to handle if a problem occurs, raising as a
warning by default.

! Error Conditions

- If there are no columns in the output table, a `No_Output_Columns` is
raised as an error regardless of the problem behavior, because it is
not possible to create a table without any columns.
- If a column in `columns` is not in the input table, a
`Missing_Input_Columns` is raised as an error.
- If no valid columns are selected, a `No_Input_Columns_Selected`, is
reported as a dataflow error regardless of setting.
- If floating points values are present in the distinct columns, a
`Floating_Point_Equality` is reported according to the `on_problems`
setting.
@columns Widget_Helpers.make_column_name_multi_selector
duplicates : Vector (Integer | Text | Regex) | Text | Integer | Regex -> Case_Sensitivity -> Boolean -> Problem_Behavior -> Table ! No_Output_Columns | Missing_Input_Columns | No_Input_Columns_Selected | Floating_Point_Equality
duplicates self (columns = self.column_names) case_sensitivity:Case_Sensitivity=..Default on_problems:Problem_Behavior=..Report_Warning =
key_columns = self.columns_helper.select_columns columns Case_Sensitivity.Default reorder=True error_on_missing_columns=True on_problems=on_problems . catch No_Output_Columns _->
Error.throw No_Input_Columns_Selected
java_columns = key_columns.map c->c.java_column
text_folding_strategy = Case_Sensitivity.folding_strategy case_sensitivity
java_table = Illegal_Argument.handle_java_exception <|
Java_Problems.with_problem_aggregator on_problems java_aggregator->
self.java_table.duplicates java_columns text_folding_strategy java_aggregator
Table.Value java_table

## GROUP Standard.Base.Conversions
ICON convert
Parses columns within a `Table` to a specific value type.
Expand Down
24 changes: 24 additions & 0 deletions std-bits/table/src/main/java/org/enso/table/data/table/Table.java
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,30 @@ public Table distinct(
return new Table(newColumns);
}

/**
* Creates a new table keeping only rows with distinct key columns.
*
* @param keyColumns set of columns to use as an index
* @param textFoldingStrategy a strategy for folding text columns
* @param problemAggregator an aggregator for problems
* @return a table where duplicate rows with the same key are removed
*/
public Table duplicates(
Column[] keyColumns,
TextFoldingStrategy textFoldingStrategy,
ProblemAggregator problemAggregator) {
var rowsToKeep =
Distinct.buildDuplicatesRowsMask(
rowCount(), keyColumns, textFoldingStrategy, problemAggregator);
int cardinality = rowsToKeep.cardinality();
Column[] newColumns = new Column[this.columns.length];
for (int i = 0; i < this.columns.length; i++) {
newColumns[i] = this.columns[i].applyFilter(rowsToKeep, cardinality);
}

return new Table(newColumns);
}

/**
* Selects a subset of columns of this table, by names.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@

import java.util.Arrays;
import java.util.BitSet;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import org.enso.base.text.TextFoldingStrategy;
import org.enso.table.data.column.storage.Storage;
import org.enso.table.data.index.MultiValueKeyBase;
Expand All @@ -15,6 +17,7 @@
import org.graalvm.polyglot.Context;

public class Distinct {

/** Creates a row mask containing only the first row from sets of rows grouped by key columns. */
public static BitSet buildDistinctRowsMask(
int tableSize,
Expand Down Expand Up @@ -50,4 +53,42 @@ public static BitSet buildDistinctRowsMask(

return mask;
}

public static BitSet buildDuplicatesRowsMask(
int tableSize,
Column[] keyColumns,
TextFoldingStrategy textFoldingStrategy,
ProblemAggregator problemAggregator) {
ColumnAggregatedProblemAggregator groupingProblemAggregator =
new ColumnAggregatedProblemAggregator(problemAggregator);
Context context = Context.getCurrent();
var mask = new BitSet();
if (keyColumns.length != 0) {
Map<MultiValueKeyBase, Integer> visitedRows = new HashMap<>();
int size = keyColumns[0].getSize();
Storage<?>[] storage =
Arrays.stream(keyColumns).map(Column::getStorage).toArray(Storage[]::new);
List<TextFoldingStrategy> strategies = ConstantList.make(textFoldingStrategy, storage.length);
for (int i = 0; i < size; i++) {
UnorderedMultiValueKey key = new UnorderedMultiValueKey(storage, i, strategies);
key.checkAndReportFloatingEquality(
groupingProblemAggregator, columnIx -> keyColumns[columnIx].getName());

var keyIndex = visitedRows.get(key);
if (keyIndex == null) {
visitedRows.put(key, i);
} else {
mask.set(i);
mask.set(keyIndex);
}

context.safepoint();
}
} else {
// If there are no columns to distinct-by we just return the whole table.
mask.set(0, tableSize);
}

return mask;
}
}
10 changes: 5 additions & 5 deletions test/Base_Tests/src/Data/Vector_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -858,11 +858,11 @@ type_spec suite_builder name alter = suite_builder.group name group_builder->
alter [1, 1.0, 2, 2.0] . distinct . should_equal [1, 2]
alter [] . distinct . should_equal []

group_builder.specify "should return a vector containing only duplicate elements" <|
alter [1, 3, 1, 2, 2, 1] . duplicates . should_equal [1, 2]
alter ["a", "a", "a"] . duplicates . should_equal ["a"]
alter ['ś', 's', 's\u0301'] . duplicates . should_equal ['s\u0301']
alter [1, 1.0, 2, 2.0] . duplicates . should_equal [1.0, 2.0]
group_builder.specify "should return a vector containing duplicate elements" <|
alter [1, 3, 1, 2, 2, 1] . duplicates . should_equal [1, 1, 2, 2, 1]
alter ["a", "a", "a"] . duplicates . should_equal ["a", "a", "a"]
alter ['ś', 's', 's\u0301'] . duplicates . should_equal ['ś', 's\u0301']
alter [1, 1.0, 2, 2.0] . duplicates . should_equal [1, 1.0, 2, 2.0]
alter [] . duplicates . should_equal []

group_builder.specify "should be able to handle distinct on different primitive values" <|
Expand Down
Loading

0 comments on commit c324c78

Please sign in to comment.