Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add split and tokenize to the Table. #6233

Merged
merged 73 commits into from
Apr 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
a4f4efb
tests
GregoryTravis Apr 5, 2023
c63861b
stubs and split to col
GregoryTravis Apr 5, 2023
6ef0e20
links
GregoryTravis Apr 6, 2023
c1eb857
1 test
GregoryTravis Apr 6, 2023
69d3fc2
tables_should_be_equal
GregoryTravis Apr 6, 2023
ad92d85
cleanup
GregoryTravis Apr 6, 2023
f02648f
comment
GregoryTravis Apr 6, 2023
e0a6821
replace_column_with_columns
GregoryTravis Apr 6, 2023
f6371ab
test 2
GregoryTravis Apr 6, 2023
351539f
to_cols
GregoryTravis Apr 6, 2023
f6bc31c
transform_table_column_to_columns
GregoryTravis Apr 6, 2023
d9b79df
fin
GregoryTravis Apr 6, 2023
7663d27
split_to_rows
GregoryTravis Apr 6, 2023
345ac27
tok rows
GregoryTravis Apr 6, 2023
a97f2c0
_. form
GregoryTravis Apr 6, 2023
96ddaaa
rearrange
GregoryTravis Apr 6, 2023
a8522a1
comments
GregoryTravis Apr 6, 2023
95f3adc
rename
GregoryTravis Apr 6, 2023
7f6af1d
must be text
GregoryTravis Apr 7, 2023
3c00b5a
problems
GregoryTravis Apr 7, 2023
90b27e5
unique names
GregoryTravis Apr 10, 2023
3d2a0af
unimplementeds
GregoryTravis Apr 10, 2023
321d245
map_column_vector_to_multiple
GregoryTravis Apr 10, 2023
3c47777
problems, better
GregoryTravis Apr 10, 2023
b439e88
cleanup
GregoryTravis Apr 10, 2023
896d47a
tmp
GregoryTravis Apr 10, 2023
7ff4af3
merge
GregoryTravis Apr 10, 2023
e09229f
Merge branch 'develop' into wip/gmt/5125-split-tok
GregoryTravis Apr 11, 2023
2a3b42b
expect_text throws
GregoryTravis Apr 11, 2023
a838918
tests for cols
GregoryTravis Apr 11, 2023
9e7356a
transpose with less copying
GregoryTravis Apr 11, 2023
e48e901
column order
GregoryTravis Apr 11, 2023
6fcb6ca
reverse jagged
GregoryTravis Apr 11, 2023
2d747ed
remove old code
GregoryTravis Apr 11, 2023
e5426a5
Column_Count_Exceeded members private
GregoryTravis Apr 11, 2023
07ae867
changelog
GregoryTravis Apr 11, 2023
17fe3c1
mixed column
GregoryTravis Apr 11, 2023
21c484c
comment
GregoryTravis Apr 11, 2023
f1bfb5f
merge
GregoryTravis Apr 11, 2023
ee80885
unused import
GregoryTravis Apr 11, 2023
d23f163
restored accidentally-deleted import
GregoryTravis Apr 11, 2023
9539448
unused import
GregoryTravis Apr 12, 2023
4f754b9
fan_out_to_rows use storage
GregoryTravis Apr 12, 2023
f32ebb5
repeat
GregoryTravis Apr 12, 2023
414c713
naming
GregoryTravis Apr 12, 2023
7acdd00
tok case-sens
GregoryTravis Apr 12, 2023
e372fa0
Merge branch 'develop' into wip/gmt/5125-split-tok
GregoryTravis Apr 12, 2023
1362a63
Merge branch 'develop' into wip/gmt/5125-split-tok
GregoryTravis Apr 13, 2023
f06db78
Update distribution/lib/Standard/Database/0.0.0-dev/src/Data/Table.enso
GregoryTravis Apr 13, 2023
1e15e9f
review
GregoryTravis Apr 13, 2023
b6db6d8
Auto to Nothing
GregoryTravis Apr 13, 2023
b57666c
review
GregoryTravis Apr 13, 2023
5c0d66b
review
GregoryTravis Apr 13, 2023
4f66416
max
GregoryTravis Apr 13, 2023
ab8318c
should_equal_verbose
GregoryTravis Apr 13, 2023
851709b
double quotes
GregoryTravis Apr 13, 2023
4e0abe5
check empty column
GregoryTravis Apr 13, 2023
5540c34
group names
GregoryTravis Apr 13, 2023
0d00b56
tok name conflict test
GregoryTravis Apr 13, 2023
aee7c27
unused on_problems
GregoryTravis Apr 13, 2023
94057a1
comment
GregoryTravis Apr 13, 2023
4e75f7c
each
GregoryTravis Apr 13, 2023
e1cbf16
should_equal_verbose text
GregoryTravis Apr 13, 2023
ba3dba9
wip
GregoryTravis Apr 13, 2023
4701650
tokenize spec with no matches
GregoryTravis Apr 13, 2023
2ab6777
tests for nothing
GregoryTravis Apr 13, 2023
de16f7c
expect_text
GregoryTravis Apr 14, 2023
451714e
docs consistent
GregoryTravis Apr 14, 2023
b5740d5
Merge branch 'develop' into wip/gmt/5125-split-tok
GregoryTravis Apr 14, 2023
85c2518
Merge branch 'develop' into wip/gmt/5125-split-tok
mergify[bot] Apr 14, 2023
a00664e
Merge branch 'develop' into wip/gmt/5125-split-tok
GregoryTravis Apr 14, 2023
150149b
removed param
GregoryTravis Apr 14, 2023
141365f
Merge branch 'wip/gmt/5125-split-tok' of github.com:enso-org/enso int…
GregoryTravis Apr 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,7 @@
methods.][6176]
- [Implemented `Table.union` for the Database backend.][6204]
- [Array & Vector have the same methods & behavior][6218]
- [Implemented `Table.split` and `Table.tokenize` for in-memory tables.][6233]

[debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
Expand Down Expand Up @@ -578,6 +579,7 @@
[6204]: https://github.com/enso-org/enso/pull/6204
[6077]: https://github.com/enso-org/enso/pull/6077
[6218]: https://github.com/enso-org/enso/pull/6218
[6233]: https://github.com/enso-org/enso/pull/6233

#### Enso Compiler

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1392,6 +1392,80 @@ type Table
msg = "Parsing values is not supported in database tables, the table has to be materialized first with `read`."
Error.throw (Unsupported_Database_Operation.Error msg)

## Splits a column of text into a set of new columns.
The original column will be removed from the table.
The new columns will be named with the name of the input column with a
incrementing number after.

Arguments:
- column: The name or index of the column to split the text of.
- delimiter: The term or terms used to split the text.
- column_count: The number of columns to split to.
If `Nothing` then columns will be added to fit all data.
- on_problems: Specifies the behavior when a problem occurs.

! Error Conditions
If the data exceeds the `column_count`, a `Column_Count_Exceeded` will
be reported according to the `on_problems` behavior.
split_to_columns : Text | Integer -> Text -> Integer | Nothing -> Problem_Behavior -> Table
split_to_columns self column delimiter="," column_count=Nothing on_problems=Report_Error =
_ = [column delimiter column_count on_problems]
Error.throw (Unsupported_Database_Operation.Error "Table.split_to_columns is not implemented yet for the Database backends.")

## Splits a column of text into a set of new rows.
The values of other columns are repeated for the new rows.

Arguments:
- column: The name or index of the column to split the text of.
- delimiter: The term or terms used to split the text.
- on_problems: Specifies the behavior when a problem occurs.
split_to_rows : Text | Integer -> Text -> Table
split_to_rows self column delimiter="," =
_ = [column delimiter]
Error.throw (Unsupported_Database_Operation.Error "Table.split_to_rows is not implemented yet for the Database backends.")

## Tokenizes a column of text into a set of new columns using a regular
expression.
If the pattern contains marked groups, the values are concatenated
together; otherwise the whole match is returned.
The original column will be removed from the table.
The new columns will be named with the name of the input column with a
incrementing number after.

Arguments:
- column: The name or index of the column to tokenize the text of.
- pattern: The pattern used to find within the text.
- case_sensitivity: Specifies if the text values should be compared case
sensitively.
- column_count: The number of columns to split to.
If `Nothing` then columns will be added to fit all data.
- on_problems: Specifies the behavior when a problem occurs.

! Error Conditions
If the data exceeds the `column_count`, a `Column_Count_Exceeded` will
be reported according to the `on_problems` behavior.
tokenize_to_columns : Text | Integer -> Text -> Case_Sensitivity -> Integer | Nothing -> Problem_Behavior -> Table
tokenize_to_columns self column pattern="." case_sensitivity=Case_Sensitivity.Sensitive column_count=Nothing on_problems=Report_Error =
_ = [column pattern case_sensitivity column_count on_problems]
Error.throw (Unsupported_Database_Operation.Error "Table.tokenize_to_columns is not implemented yet for the Database backends.")

## Tokenizes a column of text into a set of new rows using a regular
expression.
If the pattern contains marked groups, the values are concatenated
together; otherwise the whole match is returned.
The values of other columns are repeated for the new rows.

Arguments:
- column: The name or index of the column to tokenize the text of.
- pattern: The pattern used to find within the text.
- case_sensitivity: Specifies if the text values should be compared case
sensitively.
- on_problems: Specifies the behavior when a problem occurs.
tokenize_to_rows : Text | Integer -> Text -> Case_Sensitivity -> Table
tokenize_to_rows self column pattern="." case_sensitivity=Case_Sensitivity.Sensitive =
_ = [column pattern case_sensitivity]
Error.throw (Unsupported_Database_Operation.Error "Table.tokenize_to_rows is not implemented yet for the Database backends.")

## PRIVATE
UNSTABLE
Cast the selected columns to a specific type.
Expand Down
71 changes: 71 additions & 0 deletions distribution/lib/Standard/Table/0.0.0-dev/src/Data/Table.enso
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import project.Internal.Join_Helpers
import project.Internal.Naming_Helpers.Naming_Helpers
import project.Internal.Parse_Values_Helper
import project.Internal.Problem_Builder.Problem_Builder
import project.Internal.Split_Tokenize
import project.Internal.Table_Helpers
import project.Internal.Table_Helpers.Table_Column_Helper
import project.Internal.Unique_Name_Strategy.Unique_Name_Strategy
Expand Down Expand Up @@ -918,6 +919,76 @@ type Table
result = Table.new new_columns
problem_builder.attach_problems_after on_problems result

## Splits a column of text into a set of new columns.
The original column will be removed from the table.
The new columns will be named with the name of the input column with a
incrementing number after.

Arguments:
- column: The name or index of the column to split the text of.
- delimiter: The term or terms used to split the text.
- column_count: The number of columns to split to.
If `Nothing` then columns will be added to fit all data.
- on_problems: Specifies the behavior when a problem occurs.

! Error Conditions
If the data exceeds the `column_count`, a `Column_Count_Exceeded` will
be reported according to the `on_problems` behavior.
split_to_columns : Text | Integer -> Text -> Integer | Nothing -> Problem_Behavior -> Table
split_to_columns self column delimiter="," column_count=Nothing on_problems=Report_Error =
Split_Tokenize.split_to_columns self column delimiter column_count on_problems

## Splits a column of text into a set of new rows.
The values of other columns are repeated for the new rows.

Arguments:
- column: The name or index of the column to split the text of.
- delimiter: The term or terms used to split the text.
- on_problems: Specifies the behavior when a problem occurs.
split_to_rows : Text | Integer -> Text -> Table
split_to_rows self column delimiter="," =
Split_Tokenize.split_to_rows self column delimiter

## Tokenizes a column of text into a set of new columns using a regular
expression.
If the pattern contains marked groups, the values are concatenated
together; otherwise the whole match is returned.
The original column will be removed from the table.
The new columns will be named with the name of the input column with a
incrementing number after.

Arguments:
- column: The name or index of the column to tokenize the text of.
- pattern: The pattern used to find within the text.
- case_sensitivity: Specifies if the text values should be compared case
sensitively.
- column_count: The number of columns to split to.
If `Nothing` then columns will be added to fit all data.
- on_problems: Specifies the behavior when a problem occurs.

! Error Conditions
If the data exceeds the `column_count`, a `Column_Count_Exceeded` will
be reported according to the `on_problems` behavior.
tokenize_to_columns : Text | Integer -> Text -> Case_Sensitivity -> Integer | Nothing -> Problem_Behavior -> Table
tokenize_to_columns self column pattern="." case_sensitivity=Case_Sensitivity.Sensitive column_count=Nothing on_problems=Report_Error =
Split_Tokenize.tokenize_to_columns self column pattern case_sensitivity column_count on_problems

## Tokenizes a column of text into a set of new rows using a regular
expression.
If the pattern contains marked groups, the values are concatenated
together; otherwise the whole match is returned.
The values of other columns are repeated for the new rows.

Arguments:
- column: The name or index of the column to tokenize the text of.
- pattern: The pattern used to find within the text.
- case_sensitivity: Specifies if the text values should be compared case
sensitively.
- on_problems: Specifies the behavior when a problem occurs.
tokenize_to_rows : Text | Integer -> Text -> Case_Sensitivity -> Table
tokenize_to_rows self column pattern="." case_sensitivity=Case_Sensitivity.Sensitive =
Split_Tokenize.tokenize_to_rows self column pattern case_sensitivity

## ALIAS Filter Rows

Selects only the rows of this table that correspond to `True` values of
Expand Down
13 changes: 13 additions & 0 deletions distribution/lib/Standard/Table/0.0.0-dev/src/Errors.enso
Original file line number Diff line number Diff line change
Expand Up @@ -552,3 +552,16 @@ type Invalid_Value_For_Type
to_display_text : Text
to_display_text self =
"The value ["+self.value.to_text+"] is not valid for the column type ["+self.value_type.to_text+"]."

type Column_Count_Exceeded
## PRIVATE
Indicates that an operation generating new columns produced more columns
than allowed by the limit.
Error (limit : Integer) (column_count : Integer)

## PRIVATE

Create a human-readable version of the error.
to_display_text : Text
to_display_text self =
"The operation produced more columns than the specified limit. The limit is "+self.limit.to_text+" and the number of new columns was "+self.column_count.to_text+". The limit may be turned off by setting the `limit` option to `Nothing`."
Loading