Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add split and tokenize to the Table. #5125

Closed
wdanilo opened this issue Feb 5, 2023 · 10 comments · Fixed by #6233
Closed

Add split and tokenize to the Table. #5125

wdanilo opened this issue Feb 5, 2023 · 10 comments · Fixed by #6233
Assignees
Labels
-libs Libraries: New libraries to be implemented l-regex p-low Low priority x-new-feature Type: new feature request

Comments

@wdanilo
Copy link
Member

wdanilo commented Feb 5, 2023

This task is automatically imported from the old Task Issue Board and it was originally created by James Dunkerley.
Original issue is here.


  • Table variant of the Text.split and Text.tokenize.
  • Split to rows or to new columns.
  • In-Memory for v1, should report unsupported on the database.
## Splits a column of text into a set of new columns.
   The original column will be removed from the table.
   The new columns will be named with the name of the input with a incrementing number after.

   Arguments:
   - column: The column to split the text of.
   - delimiter: The term or terms used to split the text.
   - column_count: The number of columns to split to. 
     If `Auto` then columns will be added to fit all data.
     If the data exceeds the number of columns, a `Column_Count_Exceeded` error 
     will follow the `on_problems` behavior.
   - on_problems: Specifies the behavior when a problem occurs.
Table.split_to_columns : Text | Integer -> Text -> Auto | Integer -> Problem_Behavior -> Table
Table.split_to_columns self column delimiter="," column_count=Auto on_problems=Report_Error = ...

## Splits a column of text into a set of new rows.
   The values of other columns are repeated for the new rows.

   Arguments:
   - column: The column to split the text of.
   - delimiter: The term or terms used to split the text.
   - on_problems: Specifies the behavior when a problem occurs.
Table.split_to_rows : Text | Integer -> Text -> Problem_Behavior -> Table
Table.split_to_rows self column delimiter="," on_problems=Report_Error = ...
## Splits a column of text into a set of new columns using a regular expression.
   If the pattern contains marked groups, the values are concatenated together 
   otherwise the whole match is returned.
   The original column will be removed from the table.
   The new columns will be named with the name of the input with a incrementing number after.

   Arguments:
   - column: The column to tokenize the text of.
   - pattern: The pattern used to find within the text.
   - case_sensitivity: Specifies if the text values should be compared case
     sensitively.
   - column_count: The number of columns to split to. 
     If `Auto` then columns will be added to fit all data.
     If the data exceeds the number of columns, a `Column_Count_Exceeded` error 
     will follow the `on_problems` behavior.
   - on_problems: Specifies the behavior when a problem occurs.
Table.tokenize_to_columns : Text | Integer -> Text -> Case_Sensitivity -> Auto | Integer -> Problem_Behavior -> Table
Table.tokenize_to_columns self column pattern="." case_sensitivity=Case_Sensitivity.Sensitive column_count=Auto on_problems=Report_Error =

## Takes a regular expression pattern and returns all the matches as new rows.
   If the pattern contains marked groups, the values are concatenated together 
   otherwise the whole match is returned.
   The values of other columns are repeated for the new rows.

   Arguments:
   - column: The column to split the text of.
   - pattern: The pattern used to find within the text.
   - case_sensitivity: Specifies if the text values should be compared case
     sensitively.
   - on_problems: Specifies the behavior when a problem occurs.
Table.tokenize_to_rows : Text | Integer -> Text -> Case_Sensitivity -> Vector Text
Table.tokenize_to_rows self column pattern="." case_sensitivity=Case_Sensitivity.Sensitive =
@wdanilo wdanilo added this to the Beta Release milestone Feb 6, 2023
@jdunkerley jdunkerley moved this to ❓New in Issues Board Feb 7, 2023
@jdunkerley jdunkerley moved this from ❓New to 📤 Backlog in Issues Board Feb 14, 2023
@github-project-automation github-project-automation bot moved this from 📤 Backlog to 🟢 Accepted in Issues Board Apr 4, 2023
@GregoryTravis GregoryTravis reopened this Apr 4, 2023
@github-project-automation github-project-automation bot moved this from 🟢 Accepted to ❓New in Issues Board Apr 4, 2023
@GregoryTravis GregoryTravis moved this from ❓New to 🔧 Implementation in Issues Board Apr 4, 2023
@GregoryTravis GregoryTravis moved this from 🔧 Implementation to 📤 Backlog in Issues Board Apr 4, 2023
@GregoryTravis GregoryTravis moved this from 📤 Backlog to 🔧 Implementation in Issues Board Apr 5, 2023
@enso-bot
Copy link

enso-bot bot commented Apr 5, 2023

Greg Travis reports a new STANDUP for today (2023-04-05):

Progress: tests, research, start implementation of 5125 It should be finished by 2023-04-10.

Next Day: split to rows

@enso-bot
Copy link

enso-bot bot commented Apr 6, 2023

Greg Travis reports a new STANDUP for today (2023-04-06):

Progress: implemented + factored Table.split and .tokenize, just basic functionality, and tests It should be finished by 2023-04-10.

Next Day: column max

@enso-bot
Copy link

enso-bot bot commented Apr 7, 2023

Greg Travis reports a new STANDUP for today (2023-04-07):

Progress: problem handling for split + tokenize It should be finished by 2023-04-10.

Next Day: more

@enso-bot
Copy link

enso-bot bot commented Apr 10, 2023

Greg Travis reports a new STANDUP for today (2023-04-10):

Progress: Error handling and tests It should be finished by 2023-04-10.

Next Day: more

@jdunkerley jdunkerley linked a pull request Apr 11, 2023 that will close this issue
5 tasks
@enso-bot
Copy link

enso-bot bot commented Apr 11, 2023

Greg Travis reports a new 🔴 DELAY for today (2023-04-11):

Summary: There is 1 day delay in implementation of the Add split and tokenize to the Table. (#5125) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: optimize *_to_rows

@enso-bot
Copy link

enso-bot bot commented Apr 11, 2023

Greg Travis reports a new STANDUP for today (2023-04-11):

Progress: finish features, review, more tests, optimize *_to_rows It should be finished by 2023-04-11.

Next Day: optimize *_to_columns

@GregoryTravis GregoryTravis moved this from 🔧 Implementation to 👁️ Code review in Issues Board Apr 12, 2023
@enso-bot
Copy link

enso-bot bot commented Apr 12, 2023

Greg Travis reports a new 🔴 DELAY for today (2023-04-12):

Summary: There is 1 day delay in implementation of the Add split and tokenize to the Table. (#5125) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: optimize *_to_cols

@enso-bot
Copy link

enso-bot bot commented Apr 12, 2023

Greg Travis reports a new STANDUP for today (2023-04-12):

Progress: finish features, review, more tests, optimize *_to_rows It should be finished by 2023-04-12.

Next Day: review, 5126

@enso-bot
Copy link

enso-bot bot commented Apr 13, 2023

Greg Travis reports a new 🔴 DELAY for today (2023-04-13):

Summary: There is 1 day delay in implementation of the Add split and tokenize to the Table. (#5125) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: review

@enso-bot
Copy link

enso-bot bot commented Apr 13, 2023

Greg Travis reports a new STANDUP for today (2023-04-13):

Progress: split/tok review; getting a head start on text_to_table It should be finished by 2023-04-13.

Next Day: review, 5126

@mergify mergify bot closed this as completed in #6233 Apr 14, 2023
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-libs Libraries: New libraries to be implemented l-regex p-low Low priority x-new-feature Type: new feature request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants