Data analysts should be able to use Text.split, Text.lines and `T…

…ext.words` to break up strings (#3415) Implements https://www.pivotaltracker.com/story/show/181266184 ### Important Notes Changed example image download to only proceed if the file did not exist before - thus cutting on the build time (the build used to download it _every_ time - which completely failed the build if network is down). A redownload can be forced by performing a fresh repository checkout.
enso-org · Apr 26, 2022 · 14257d0 · 14257d0
1 parent 69b5e2a
commit 14257d0
Show file tree

Hide file tree

Showing 11 changed files with 190 additions and 178 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -113,6 +113,7 @@
   encoding to `File.read_text`. New `File.read` API.][3390]
 - [Improved the `Range` type. Added a `down_to` counterpart to `up_to` and
   `with_step` allowing to change the range step.][3408]
+- [Aligned `Text.split` API with other methods and added `Text.lines`.][3415]
 
 [debug-shortcuts]:
   https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
@@ -173,6 +174,7 @@
 [3393]: https://github.com/enso-org/enso/pull/3393
 [3390]: https://github.com/enso-org/enso/pull/3390
 [3408]: https://github.com/enso-org/enso/pull/3408
+[3415]: https://github.com/enso-org/enso/pull/3415
 
 #### Enso Compiler
 

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso
@@ -10,7 +10,6 @@ import Standard.Base.Data.Text.Case
 import Standard.Base.Data.Text.Location
 import Standard.Base.Data.Text.Line_Ending_Style
 from Standard.Base.Data.Text.Span as Span_Module import Span
-import Standard.Base.Data.Text.Split_Kind
 import Standard.Base.Data.Text.Text_Sub_Range
 from Standard.Base.Data.Text.Encoding as Encoding_Module import Encoding, Encoding_Error
 from Standard.Base.Error.Problem_Behavior as Problem_Behavior_Module import Problem_Behavior, Report_Warning
@@ -22,7 +21,6 @@ from Standard.Builtins export Text
 export Standard.Base.Data.Text.Matching_Mode
 export Standard.Base.Data.Text.Case
 export Standard.Base.Data.Text.Location
-export Standard.Base.Data.Text.Split_Kind
 export Standard.Base.Data.Text.Line_Ending_Style
 
 polyglot java import com.ibm.icu.lang.UCharacter
@@ -348,82 +346,49 @@ Text.find pattern mode=Mode.All match_ascii=Nothing case_insensitive=Nothing dot
 
 ## ALIAS Split Text
 
-   Takes a separator and returns the vector that results from splitting `this`
-   on the configured number of occurrences of `separator`.
+   Takes a delimiter and returns the vector that results from splitting `this`
+   on each of its occurrences.
 
    Arguments:
-   - separator: The pattern used to split the text.
-   - mode: This argument specifies how many matches the engine will try and
-     find. When mode is set to either `Mode.First` or `Mode.Full`, this method
-     will return either a single `Text` or `Nothing`. If set to an `Integer` or
-     `Mode.All`, this method will return either a `Vector Text` or `Nothing`.
-   - match_ascii: Enables or disables pure-ASCII matching for the regex. If you
-     know your data only contains ASCII then you can enable this for a
-     performance boost on some regex engines.
-   - case_insensitive: Enables or disables case-insensitive matching. Case
-     insensitive matching behaves as if it normalises the case of all input
-     text before matching on it.
-   - dot_matches_newline: Enables or disables the dot matches newline option.
-     This specifies that the `.` special character should match everything
-     _including_ newline characters. Without this flag, it will match all
-     characters _except_ newlines.
-   - multiline: Enables or disables the multiline option. Multiline specifies
-     that the `^` and `$` pattern characters match the start and end of lines,
-     as well as the start and end of the input respectively.
-   - verbose: Enables or disables the verbose mode for the regular expression.
-     In verbose mode, the following changes apply:
-     - Whitespace within the pattern is ignored, except when within a
-       character class or when preceeded by an unescaped backslash, or within
-       grouping constructs (e.g. `(?...)`).
-     - When a line contains a `#`, that is not in a character class and is not
-       preceeded by an unescaped backslash, all characters from the leftmost
-       such `#` to the end of the line are ignored. That is to say, they act
-       as _comments_ in the regex.
-   - extra_opts: Specifies additional options in a vector. This allows options
-     to be supplied and computed without having to break them out into arguments
-     to the function. Where these overlap with one of the flags (`match_ascii`,
-     `case_insensitive`, `dot_matches_newline`, `multiline` and `verbose`), the
-     flags take precedence.
-
-   ! Boolean Flags and Extra Options
-     This function contains a number of arguments that are boolean flags that
-     enable or disable common options for the regex. At the same time, it also
-     provides the ability to specify options in the `extra_opts` argument.
-
-     Where one of the flags is _set_ (has the value `True` or `False`), the
-     value of the flag takes precedence over the value in `extra_opts` when
-     merging the options to the engine. The flags are _unset_ (have value
-     `Nothing`) by default.
+   - delimiter: The pattern used to split the text.
+   - matcher: If a `Text_Matcher`, the text is compared using case-sensitivity
+     rules specified in the matcher. If a `Regex_Matcher`, the term is used as a
+     regular expression and matched using the associated options.
 
    > Example
-     Split the comma-separated text into a vector of items.
+     Split the text on any occurrence of the separator `"::"`.
 
-         "ham,eggs,cheese,tomatoes".split ","
+         example_split =
+             text = "Namespace::package::package::Type"
+             text.split "::" == ["Namespace", "package", "package", "Type"]
 
    > Example
-     Split the text on whitespace into a vector of items.
+     Split the text on a regex pattern.
 
-         "ham eggs cheese tomatoes".split Split_Kind.Whitespace
+         "abc--def==>ghi".split "[-=>]+" Regex_Matcher == ["abc", "def", "ghi"]
 
    > Example
-     Split the text on any occurrence of the separator `"::"`.
+     Split the text on any whitespace.
 
-         example_split =
-             text = "Namespace::package::package::Type"
-             text.split ":::"
-Text.split : Split_Kind -> Mode.Mode -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Boolean | Nothing -> Vector.Vector Option.Option -> Vector.Vector Text
-Text.split separator=Split_Kind.Whitespace mode=Mode.All match_ascii=Nothing case_insensitive=Nothing dot_matches_newline=Nothing multiline=Nothing comments=Nothing extra_opts=[] =
-    case separator of
-        Split_Kind.Words -> Vector.Vector this.words
-        Split_Kind.Whitespace ->
-            pattern = Regex.compile "\s+" match_ascii=match_ascii case_insensitive=case_insensitive dot_matches_newline=dot_matches_newline multiline=multiline comments=comments extra_opts=extra_opts
-            pattern.split this mode=mode
-        Split_Kind.Lines ->
-            pattern = Regex.compile "\v+" match_ascii=match_ascii case_insensitive=case_insensitive dot_matches_newline=dot_matches_newline multiline=multiline comments=comments extra_opts=extra_opts
-            pattern.split this mode=mode
-        Text ->
-            pattern = Regex.compile separator match_ascii=match_ascii case_insensitive=case_insensitive dot_matches_newline=dot_matches_newline multiline=multiline comments=comments extra_opts=extra_opts
-            pattern.split this mode=mode
+         'abc  def\tghi'.split '\\s+' Regex_Matcher == ["abc", "def", "ghi"]
+Text.split : Text -> (Text_Matcher | Regex_Matcher) -> Vector.Vector Text
+Text.split delimiter="," matcher=Text_Matcher = if delimiter.is_empty then Error.throw (Illegal_Argument_Error "The delimiter cannot be empty.") else
+    case matcher of
+        Text_Matcher case_sensitivity ->
+            delimiters = Vector.Vector <| case case_sensitivity of
+                True ->
+                    Text_Utils.span_of_all this delimiter
+                Case_Insensitive locale ->
+                    Text_Utils.span_of_all_case_insensitive this delimiter locale.java_locale
+            Vector.new delimiters.length+1 i->
+                start = if i == 0 then 0 else
+                    delimiters.at i-1 . codeunit_end
+                end = if i == delimiters.length then (Text_Utils.char_length this) else
+                    delimiters.at i . codeunit_start
+                Text_Utils.substring this start end
+        Regex_Matcher _ _ _ _ _ ->
+            compiled_pattern = matcher.compile delimiter
+            compiled_pattern.split this mode=Mode.All
 
 ## ALIAS Replace Text
    Replaces the first, last, or all occurrences of term with new_text in the
@@ -547,7 +512,12 @@ Text.replace term="" new_text="" mode=Mode.All matcher=Text_Matcher = if term.is
    > Example
      Getting the words in the sentence "I have not one, but two cats."
 
-        "I have not one, but two cats.".words
+        "I have not one, but two cats.".words == ['I', 'have', 'not', 'one', ',', 'but', 'two', 'cats', '.']
+
+   > Example
+     Getting the words in the Thai sentence "แมวมีสี่ขา"
+
+         "แมวมีสี่ขา".words == ['แมว', 'มี', 'สี่', 'ขา']
 Text.words : Boolean -> Vector.Vector Text
 Text.words keep_whitespace=False =
     iterator = BreakIterator.getWordInstance
@@ -559,9 +529,7 @@ Text.words keep_whitespace=False =
     build prev nxt = if nxt == -1 then Nothing else
         word = Text_Utils.substring this prev nxt
         word_not_whitespace = (Text_Utils.is_all_whitespace word).not
-        if word_not_whitespace then bldr.append word else
-            if keep_whitespace then
-                bldr.append word
+        if word_not_whitespace || keep_whitespace then bldr.append word
 
         next_nxt = iterator.next
         @Tail_Call build nxt next_nxt
@@ -570,6 +538,33 @@ Text.words keep_whitespace=False =
 
     bldr.to_vector
 
+## ALIAS Get Lines
+
+   Splits the text into lines, based on '\n', '\r' or '\r\n' line endings.
+
+   Empty lines are added for leading newlines. Multiple consecutive
+   newlines will also yield additional empty lines. A line ending at the end of
+   the line is not required, but if it is present it will not cause an empty
+   line to be added at the end.
+
+   > Example
+     Split the text 'a\nb\nc' into lines.
+
+        'a\nb\nc'.lines == ['a', 'b', 'c']
+
+   > Example
+     Split the text '\na\n\nb\n\n' into lines.
+
+        '\na\n\nb\n\n\n'.lines == ['', 'a', '', 'b', '', '']
+
+   > Example
+     Split the text '\na\nb\n' into lines, keeping the line endings.
+
+        '\na\nb\n'.lines keep_endings=True == ['\n', 'a\n', 'b\n']
+Text.lines : Boolean -> Vector.Vector Text
+Text.lines keep_endings=False =
+    Vector.Vector (Text_Utils.split_on_lines this keep_endings)
+
 ## Checks whether `this` is equal to `that`.
 
    Arguments:

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Regex.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Regex.enso
@@ -12,7 +12,6 @@ import Standard.Base.Data.Text.Regex.Engine
 import Standard.Base.Data.Text.Regex.Engine.Default as Default_Engine
 import Standard.Base.Data.Text.Regex.Mode
 import Standard.Base.Data.Text.Regex.Option
-import Standard.Base.Data.Text.Split_Kind
 import Standard.Base.Data.Map
 
 import Standard.Base.Error.Extensions as Errors

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Regex/Engine/Default.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Regex/Engine/Default.enso
@@ -476,7 +476,7 @@ type Pattern
                     Mode_Error "Cannot match a negative number of times."
 
                 mode + 1
-            Mode.All -> 0
+            Mode.All -> -1
             Mode.Full -> Panic.throw <|
                 Mode_Error "Splitting on a full match yields an empty text."
             Mode.Bounded _ _ _ -> Panic.throw <|

diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Split_Kind.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Split_Kind.enso
diff --git a/distribution/lib/Standard/Base/0.0.0-dev/src/Main.enso b/distribution/lib/Standard/Base/0.0.0-dev/src/Main.enso
@@ -58,7 +58,7 @@ from project.Data.Range export Range
    Relevant issues:
    https://www.pivotaltracker.com/story/show/181403340
    https://www.pivotaltracker.com/story/show/181309938
-from project.Data.Text.Extensions export Text, Split_Kind, Line_Ending_Style, Case, Location
+from project.Data.Text.Extensions export Text, Line_Ending_Style, Case, Location
 from project.Data.Text.Matching export Case_Insensitive, Text_Matcher, Regex_Matcher
 from project.Error.Common export all
 from project.Error.Extensions export all

diff --git a/distribution/lib/Standard/Searcher/0.0.0-dev/src/Data_Science/Text.enso b/distribution/lib/Standard/Searcher/0.0.0-dev/src/Data_Science/Text.enso
@@ -8,7 +8,7 @@
    > Example
      Split the text on whitespace into a vector of items.
 
-         "ham eggs cheese tomatoes".split Split_Kind.Whitespace
+         "ham eggs cheese tomatoes".split "\s+"
 
    > Example
      Getting the words in the sentence "I have not one, but two cats."

diff --git a/project/DistributionPackage.scala b/project/DistributionPackage.scala
@@ -82,17 +82,6 @@ object DistributionPackage {
     }
   }
 
-  def downloadFileToLocation(
-    address: String,
-    location: File
-  ): File = {
-    val exitCode = (url(address) #> location).!
-    if (exitCode != 0) {
-      throw new RuntimeException(s"Downloading the file at $address failed.")
-    }
-    location
-  }
-
   def executableName(baseName: String): String =
     if (Platform.isWindows) baseName + ".exe" else baseName
 
@@ -154,7 +143,6 @@ object DistributionPackage {
       cacheFactory    = cacheFactory.sub("engine-libraries"),
       log             = log
     )
-    getStdlibDataFiles(distributionRoot, targetStdlibVersion)
 
     copyDirectoryIncremental(
       file("distribution/bin"),
@@ -238,19 +226,6 @@ object DistributionPackage {
 
   }
 
-  private def getStdlibDataFiles(
-    distributionRoot: File,
-    stdlibVersion: String
-  ): Unit = {
-    val exampleImageUrl =
-      "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e9/" +
-      "Hue_alpha_falloff.png/320px-Hue_alpha_falloff.png"
-    downloadFileToLocation(
-      exampleImageUrl,
-      distributionRoot / s"lib/Standard/Examples/$stdlibVersion/data/image.png"
-    )
-  }
-
   private def buildEngineManifest(
     template: File,
     destination: File,