Skip to content

Commit

Permalink
Add Table.distinct function to In-Memory table (#3684)
Browse files Browse the repository at this point in the history
Implements https://www.pivotaltracker.com/story/show/182307143

# Important Notes
- Modified standard library Java helpers dependencies so that `std-table` module depends on `std-base`, as a provided dependency. This is allowed, because `std-table` is used by the `Standard.Table` Enso module which depends on `Standard.Base` which ensures that the `std-base` is loaded onto the classpath, thus whenever `std-table` is loaded by `Standard.Table`, so is `std-base`. Thus we can rely on classes from `std-base` and its dependencies being _provided_ on the classpath. Thanks to that we can use utilities like `Text_Utils` also in `std-table`, avoiding code duplication. Additional advantage of that is that we don't need to specify ICU4J as a separate dependency for `std-table`, since it is 'taken' from `std-base` already - so we avoid including it in our build packages twice.
  • Loading branch information
radeusgd authored Sep 7, 2022
1 parent 9967dd3 commit 551100a
Show file tree
Hide file tree
Showing 35 changed files with 620 additions and 919 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@
- [Added various date part functions to `Date` and `Date_Time`.][3669]
- [Implemented `Table.take` and `Table.drop` for the in-memory backend.][3647]
- [Implemented specialized storage for the in-memory Table.][3673]
- [Implemented `Table.distinct` for the in-memory backend.][3684]

[debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
Expand Down Expand Up @@ -303,6 +304,7 @@
[3669]: https://github.com/enso-org/enso/pull/3669
[3647]: https://github.com/enso-org/enso/pull/3647
[3673]: https://github.com/enso-org/enso/pull/3673
[3684]: https://github.com/enso-org/enso/pull/3684

#### Enso Compiler

Expand Down
3 changes: 2 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -1767,7 +1767,7 @@ lazy val `std-table` = project
Compile / packageBin / artifactPath :=
`table-polyglot-root` / "std-table.jar",
libraryDependencies ++= Seq(
"com.ibm.icu" % "icu4j" % icuVersion,
"com.ibm.icu" % "icu4j" % icuVersion % "provided",
"com.univocity" % "univocity-parsers" % "2.9.1",
"org.apache.poi" % "poi-ooxml" % "5.2.2",
"org.apache.xmlbeans" % "xmlbeans" % "5.1.0",
Expand All @@ -1786,6 +1786,7 @@ lazy val `std-table` = project
result
}.value
)
.dependsOn(`std-base` % "provided")

lazy val `std-image` = project
.in(file("std-bits") / "image")
Expand Down
20 changes: 20 additions & 0 deletions distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Case.enso
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
from Standard.Base import all

polyglot java import org.enso.base.text.TextFoldingStrategy

## Specifies the casing options for text conversion.
type Case
## All letters in lower case.
Expand All @@ -8,3 +12,19 @@ type Case

## First letter of each word in upper case, rest in lower case.
Title

## Represents case-insensitive comparison mode.

Arguments:
- locale: The locale used for the comparison.
type Case_Insensitive
Case_Insensitive_Data locale=Locale.default

## PRIVATE
Creates a Java `TextFoldingStrategy` from the case sensitivity setting.
folding_strategy : (True|Case_Insensitive) -> TextFoldingStrategy
folding_strategy case_sensitive = case case_sensitive of
True -> TextFoldingStrategy.unicodeNormalizedFold
Case_Insensitive_Data locale ->
TextFoldingStrategy.caseInsensitiveFold locale.java_locale

Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from Standard.Base import all

from Standard.Base.Data.Text.Case import Case_Insensitive, Case_Insensitive_Data
from Standard.Base.Error.Problem_Behavior import Report_Warning
from Standard.Base.Error.Common import Wrapped_Dataflow_Error_Data

Expand All @@ -13,13 +14,6 @@ No_Matches_Found.to_display_text self =
"The criteria "+self.criteria.to_text+" did not match any names in the input."


## Represents case-insensitive comparison mode.

Arguments:
- locale: The locale used for the comparison.
type Case_Insensitive
Case_Insensitive_Data locale=Locale.default

## Represents exact text matching mode.

Arguments:
Expand Down
4 changes: 3 additions & 1 deletion distribution/lib/Standard/Base/0.0.0-dev/src/Main.enso
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import project.Data.Regression
import project.Data.Statistics
import project.Data.Statistics.Rank_Method
import project.Data.Text
import project.Data.Text.Case
import project.Data.Text.Encoding
import project.Data.Text.Extensions
import project.Data.Text.Matching
Expand Down Expand Up @@ -97,7 +98,8 @@ from project.Data.Range export all
https://www.pivotaltracker.com/story/show/181403340
https://www.pivotaltracker.com/story/show/181309938
from project.Data.Text.Extensions export Text, Line_Ending_Style, Case, Location, Matching_Mode
from project.Data.Text.Matching export Case_Insensitive_Data, Text_Matcher_Data, Regex_Matcher_Data, No_Matches_Found_Data
from project.Data.Text.Matching export Text_Matcher_Data, Regex_Matcher_Data, No_Matches_Found_Data
from project.Data.Text.Case export Case_Insensitive_Data, Text_Matcher_Data, Regex_Matcher_Data, No_Matches_Found_Data
from project.Data.Text export all hiding Encoding, Span, Text_Ordering
from project.Data.Text.Encoding export Encoding, Encoding_Error, Encoding_Error_Data
from project.Data.Text.Text_Ordering export all
Expand Down
34 changes: 33 additions & 1 deletion distribution/lib/Standard/Database/0.0.0-dev/src/Data/Table.enso
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ import Standard.Table.Internal.Aggregate_Column_Helper
from Standard.Database.Data.Column import Column, Aggregate_Column_Builder, Column_Data
from Standard.Database.Data.Internal.IR import Internal_Column, Internal_Column_Data
from Standard.Table.Errors import No_Such_Column_Error, No_Such_Column_Error_Data
from Standard.Table.Data.Column_Selector import Column_Selector, By_Index
from Standard.Table.Data.Column_Selector import Column_Selector, By_Index, By_Name
from Standard.Table.Data.Data_Formatter import Data_Formatter
from Standard.Database.Error import Unsupported_Database_Operation_Error_Data
import Standard.Table.Data.Column_Name_Mapping
Expand Down Expand Up @@ -547,6 +547,38 @@ type Table
new_ctx = self.context.add_orders new_order_descriptors
self.updated_context new_ctx

## Returns the distinct set of rows within the specified columns from the
input table.

When multiple rows have the same values within the specified columns, the
first row of each such set is returned.

For the in-memory table, the unique rows will be in the order they
occurred in the input (this is not guaranteed for database operations).

Arguments:
- columns: The columns of the table to use for distinguishing the rows.
- case_sensitive: Specifies if the text values should be compared case
sensitively.
- on_problems: Specifies how to handle if a problem occurs, raising as a
warning by default.

The following problems can occur:
- If a column in columns is not in the input table, a
`Missing_Input_Columns`.
- If duplicate columns, names or indices are provided, a
`Duplicate_Column_Selectors`.
- If a column index is out of range, a `Column_Indexes_Out_Of_Range`.
- If two distinct indices refer to the same column, an
`Input_Indices_Already_Matched`.
- If no valid columns are selected, a `No_Input_Columns_Selected`.
- If floating points values are present in the distinct columns, a
`Floating_Point_Grouping` warning.
distinct : Column_Selector -> (True|Case_Insensitive) -> Problem_Behavior -> Table
distinct self (columns = By_Name (self.columns.map .name)) case_sensitive=True on_problems=Report_Warning =
_ = [columns, case_sensitive, on_problems]
Error.throw (Unsupported_Database_Operation_Error_Data "`Table.distinct` is not yet implemented for the database backend.")

## UNSTABLE

Efficiently joins two tables based on either the index or a key column.
Expand Down
5 changes: 0 additions & 5 deletions distribution/lib/Standard/Table/0.0.0-dev/THIRD-PARTY/NOTICE
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,6 @@ The license file can be found at `licenses/BSD-3-Clause`.
Copyright notices related to this dependency can be found in the directory `com.github.virtuald.curvesapi-1.07`.


'icu4j', licensed under the Unicode/ICU License, is distributed with the Table.
The license information can be found along with the copyright notices.
Copyright notices related to this dependency can be found in the directory `com.ibm.icu.icu4j-71.1`.


'univocity-parsers', licensed under the Apache 2, is distributed with the Table.
The license file can be found at `licenses/APACHE2.0`.
Copyright notices related to this dependency can be found in the directory `com.univocity.univocity-parsers-2.9.1`.
Expand Down
Loading

0 comments on commit 551100a

Please sign in to comment.