Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex.Compile using Truffle Regex, update find, match and match_all #5785

Merged
merged 70 commits into from
Mar 10, 2023
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
a1f269a
calling TRegexObject via matches
GregoryTravis Feb 23, 2023
c4c29e1
internal_pattern : Any
GregoryTravis Feb 23, 2023
49633f7
Merge branch 'develop' into wip/gmt/compile-regexp
GregoryTravis Feb 27, 2023
2cc6627
move new code to new modules
GregoryTravis Feb 27, 2023
28f2651
Pattern_2.matches
GregoryTravis Feb 27, 2023
2be426c
iterator
GregoryTravis Feb 28, 2023
3f6b5dd
to_text_debug
GregoryTravis Feb 28, 2023
14efa04
dead
GregoryTravis Feb 28, 2023
b0cacc8
change options to string and include them in src construction
GregoryTravis Mar 1, 2023
08d0cf0
Any, not Object
GregoryTravis Mar 1, 2023
ff8577d
Illegal_Argument
GregoryTravis Mar 1, 2023
c96511a
no UnicodeRegex
GregoryTravis Mar 1, 2023
7581dd5
Merge branch 'develop' into wip/gmt/compile-regexp
GregoryTravis Mar 1, 2023
4423481
match
GregoryTravis Mar 1, 2023
f2a3a68
groups work
GregoryTravis Mar 1, 2023
86a8955
case
GregoryTravis Mar 1, 2023
c59dd6e
install in match, find, find_all
GregoryTravis Mar 1, 2023
b67aab0
clean up
GregoryTravis Mar 1, 2023
d05c0e2
fix check_span
GregoryTravis Mar 1, 2023
fc94c95
clean up
GregoryTravis Mar 1, 2023
16f53d3
unicode woes
GregoryTravis Mar 2, 2023
8bbb758
mark normalization test pending
GregoryTravis Mar 2, 2023
5a34030
cleanup
GregoryTravis Mar 2, 2023
f6af863
convert to grapheme spans
GregoryTravis Mar 2, 2023
6a0ab81
Update test/Tests/src/Data/Text_Spec.enso
GregoryTravis Mar 2, 2023
3a131d0
cleanup, remove _2, escape
GregoryTravis Mar 3, 2023
0e5a7a4
rename to group
GregoryTravis Mar 3, 2023
4e51cbe
trying to read polyglot map
GregoryTravis Mar 3, 2023
be06de9
merge
GregoryTravis Mar 6, 2023
f3393d8
wip
GregoryTravis Mar 6, 2023
a197a74
Coerce polyglot values to supported Enso types
JaroslavTulach Mar 6, 2023
718efa7
undo catch
GregoryTravis Mar 6, 2023
799fbeb
Merge branch 'wip/gmt/compile-regexp' of github.com:enso-org/enso int…
GregoryTravis Mar 6, 2023
e8f7f67
disable syntax error test
GregoryTravis Mar 6, 2023
e2bf6f7
Coerce values obtained from readMember
JaroslavTulach Mar 7, 2023
e93cff1
jaroslav fix
GregoryTravis Mar 7, 2023
7d5f9e6
convert regex exception
GregoryTravis Mar 7, 2023
e6127ed
cleanup, correct error declarations
GregoryTravis Mar 7, 2023
69b710e
nonparticpating matches, docs
GregoryTravis Mar 7, 2023
3303cc4
docs
GregoryTravis Mar 7, 2023
3f72aa5
idiomatic
GregoryTravis Mar 7, 2023
8ce1ce6
docs
GregoryTravis Mar 7, 2023
f668bbd
docs
GregoryTravis Mar 7, 2023
6c8a62d
merge
GregoryTravis Mar 7, 2023
7e439f2
changelog
GregoryTravis Mar 7, 2023
354b1bf
fmt
GregoryTravis Mar 7, 2023
fe43b71
review
GregoryTravis Mar 8, 2023
1c12ed9
idiomatic
GregoryTravis Mar 8, 2023
3b2da3f
Update CHANGELOG.md
GregoryTravis Mar 8, 2023
e0a9e8b
Update distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Regex_2…
GregoryTravis Mar 8, 2023
ee1431f
review
GregoryTravis Mar 8, 2023
81280ba
remove self from node
GregoryTravis Mar 8, 2023
bf4d119
review
GregoryTravis Mar 8, 2023
2181fea
Merge branch 'wip/gmt/compile-regexp' of github.com:enso-org/enso int…
GregoryTravis Mar 8, 2023
fb06590
matches must match at both ends
GregoryTravis Mar 8, 2023
d8ae85f
fmt
GregoryTravis Mar 8, 2023
92782fa
Merge branch 'develop' into wip/gmt/compile-regexp
GregoryTravis Mar 8, 2023
6e0ad33
dead
GregoryTravis Mar 8, 2023
16d82ef
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 9, 2023
4cd19cc
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 9, 2023
dac3966
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 9, 2023
09f57ef
Merge branch 'develop' into wip/gmt/compile-regexp
jdunkerley Mar 10, 2023
750abd1
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 10, 2023
7754e49
Fix changelog.
jdunkerley Mar 10, 2023
13228b4
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 10, 2023
f2bb972
Add TruffleBoundary.
jdunkerley Mar 10, 2023
154e595
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 10, 2023
079b06c
Ignore new files.
jdunkerley Mar 10, 2023
9b75487
Fix failing tests.
jdunkerley Mar 10, 2023
a2588a6
Merge branch 'develop' into wip/gmt/compile-regexp
jdunkerley Mar 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,7 @@
and renamed them to `match`, `find`, `find_all` (respectively).][5721]
- [Updated `rename_columns` to new API. Added `first_row`, `second_row` and
`last_row` to Table types][5719]
- [Remove many regex compile flags; separated `match` into `match` and `match_all`.][5785]

[debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
Expand Down Expand Up @@ -508,6 +509,7 @@
[5779]: https://github.com/enso-org/enso/pull/5779
[5757]: https://github.com/enso-org/enso/pull/5757
[5802]: https://github.com/enso-org/enso/pull/5802
[5785]: https://github.com/enso-org/enso/pull/5785

#### Enso Compiler

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,18 @@ import project.Data.Text.Case_Sensitivity.Case_Sensitivity
import project.Data.Text.Encoding.Encoding
import project.Data.Text.Location
import project.Data.Text.Matching_Mode.Matching_Mode
import project.Data.Text.Regex
import project.Data.Text.Regex.Match.Match
import project.Data.Text.Regex.Regex_Mode.Regex_Mode
import project.Data.Text.Regex_Matcher.Regex_Matcher
import project.Data.Text.Regex_2
import project.Data.Text.Regex_2.Regex_Syntax_Error
import project.Data.Text.Span.Span
import project.Data.Text.Span.Utf_16_Span
import project.Data.Text.Text
import project.Data.Text.Text_Matcher.Text_Matcher
import project.Data.Text.Text_Sub_Range.Codepoint_Ranges
import project.Data.Text.Text_Sub_Range.Text_Sub_Range
import project.Data.Vector.Vector
import project.Error.Common.Compile_Error
import project.Error.Common.Index_Out_Of_Bounds
import project.Error.Error
import project.Error.Encoding_Error.Encoding_Error
Expand Down Expand Up @@ -227,10 +227,10 @@ Text.characters self =
example_find_insensitive =
## This matches `aBc` @ character 11
"aabbbbccccaaBcaaaa".find "a[ab]c" Case_Sensitivity.Insensitive
Text.find : Text -> Case_Sensitivity -> Match | Nothing ! Compile_Error
Text.find : Text -> Case_Sensitivity -> Match | Nothing ! Regex_Syntax_Error
Text.find self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
case_insensitive = case_sensitivity.is_case_insensitive_in_memory
Regex.compile pattern case_insensitive=case_insensitive . match self Matching_Mode.First
Regex_2.compile pattern case_insensitive=case_insensitive . match self

## Finds all the matches of the regular expression `pattern` in `self`,
returning a Vector. If not found, will be an empty Vector.
Expand All @@ -249,12 +249,10 @@ Text.find self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
example_find_all_insensitive =
## This matches `aABbbbc` @ character 0 and `aBC` @ character 11
"aABbbbccccaaBCaaaa".find_all "a[ab]+c" Case_Sensitivity.Insensitive
Text.find_all : Text -> Case_Sensitivity -> Vector Match ! Compile_Error
Text.find_all : Text -> Case_Sensitivity -> Vector Match ! Regex_Syntax_Error
Text.find_all self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
case_insensitive = case_sensitivity.is_case_insensitive_in_memory
case Regex.compile pattern case_insensitive=case_insensitive . match self Regex_Mode.All of
Nothing -> []
matches -> matches
Regex_2.compile pattern case_insensitive=case_insensitive . match_all self

## ALIAS Check Matches

Expand All @@ -276,10 +274,10 @@ Text.find_all self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
regex = ".+ct@.+"
# Evaluates to true
"[email protected]".match regex Case_Sensitivity.Insensitive
Text.match : Text -> Case_Sensitivity -> Boolean ! Compile_Error
Text.match : Text -> Case_Sensitivity -> Boolean ! Regex_Syntax_Error
Text.match self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
case_insensitive = case_sensitivity.is_case_insensitive_in_memory
compiled_pattern = Regex.compile pattern case_insensitive=case_insensitive
compiled_pattern = Regex_2.compile pattern case_insensitive=case_insensitive
compiled_pattern.matches self

## ALIAS Split Text
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,24 @@
## Internal text utilities for inspecting text primitives.

import project.Any.Any
import project.Data.Text.Text

## PRIVATE

Forces flattening of a text value.
optimize : Text
optimize text = @Builtin_Method "Prim_Text_Helper.optimize"

## PRIVATE

Compile the regex using the Truffle regex library.

Returns a Java RegexObject (Truffle)
(See https://github.com/oracle/graal/blob/master/regex/docs/README.md)

Arguments:
- pattern: the regex to compile
- options: string containing traditional regex flags (for example, "g"
as in "/foo/g"
compile_regex : Text -> Text -> Any
compile_regex pattern options = @Builtin_Method "Prim_Text_Helper.compile_regex"
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from project.Data.Boolean import Boolean, True, False
import project.Data.Numbers.Integer
import project.Data.Range.Range
import project.Data.Text.Span.Span
import project.Data.Text.Span.Utf_16_Span
import project.Data.Text.Text
import project.Nothing.Nothing

type Match_2
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
## internal_regex_result : RegexResult (Truffle)
(See https://github.com/oracle/graal/blob/master/regex/docs/README.md)
Value (pattern : Pattern_2) (internal_regex_result : Any) (input : Text)

## PRIVATE
Returns the start character of group n.

Arguments:
- n: the integer group number. Note that the groups explicitly
defined in the regex are numbered starting at 1; group 0 refers to the
entire match range.
start : Integer -> Integer
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
start self n = self.internal_regex_result.getStart n

## PRIVATE
Returns the start character of group n

Arguments:
- n: the integer group number
end : Integer -> Integer
end self n = self.internal_regex_result.getEnd n

## PRIVATE

Gets the text matched by the group with the provided identifier, or
`Nothing` if the group did not participate in the match. If no such group
exists for the provided identifier, a `No_Such_Group` is returned.

Arguments:
- id: The integer index or name of that group.

? The Full Match
The group with index 0 is always the full match of the pattern.

? Named Groups by Index
If the regex contained named groups, these may also be accessed by
index based on their position in the pattern.

Note that it is possible for a group to "not participate in the match",
for example with a disjunction. In the example below, the "(d)" group
does not participate -- it neither matches nor fails.

"ab((c)|(d))".find "abc"

In this case, the group id for "(d)", which is 3, is a valid group id and
(Pattern_2.lookup_group 3) will return 3. If the caller tries to get group 3,
Match_2.group will return Nothing.
group : Integer | Text -> Span
group self id =
n = self.pattern.lookup_group id
start = self.start n
end = self.end n
does_not_participate = start == -1 || end == -1
case does_not_participate of
True -> Nothing
False ->
range = Range.new (self.start n) (self.end n)
(Utf_16_Span.Value range self.input).to_grapheme_span
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
import project.Any.Any
#import project.Data.Boolean.Boolean
from project.Data.Boolean import Boolean, True, False
import project.Data.Map.Map
import project.Data.Numbers.Integer
import project.Data.Range.Extensions
import project.Data.Range.Range
import project.Data.Text.Span.Span
import project.Data.Text.Span.Utf_16_Span
import project.Data.Text.Regex.Match_2.Match_2
import project.Data.Text.Regex_2.No_Such_Group
import project.Data.Text.Text
import project.Data.Vector.Vector
import project.Error.Error
import project.Meta
import project.Nothing.Nothing
import project.Panic.Panic
import project.Polyglot

polyglot java import org.enso.base.Text_Utils

type Pattern_2
jdunkerley marked this conversation as resolved.
Show resolved Hide resolved
## internal_regex_object : RegexObject (Truffle)
(See https://github.com/oracle/graal/blob/master/regex/docs/README.md)
Value (internal_regex_object : Any)

## Returns `True` if the input matches against the pattern described by
`self`, otherwise `False`.

Arguments:
- input: The text to check for matching.
matches : Text -> Boolean
matches self input =
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
m = self.internal_regex_object.exec input 0
m . isMatch && m.getStart 0 == 0 && m.getEnd 0 == input.length

## Tries to match the provided `input` against the pattern `self`.

Arguments:
- input: The text to match the pattern described by `self` against.
match : Text -> Match_2 | Nothing
match self input =
it = Match_Iterator.new self input
case it.next of
Match_Iterator_Value.Next _ match _ -> match
Match_Iterator_Value.Last _ -> Nothing

## Tries to match the provided `input` against the pattern `self`.

Returns a Vector of Match_2 objects.

Arguments:
- input: The text to match the pattern described by `self` against.
match_all : Text -> Vector Match_2
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
match_all self input =
builder = Vector.new_builder
it = Match_Iterator.new self input
go it = case it.next of
Match_Iterator_Value.Next _ match next_it ->
builder.append match
go next_it
Match_Iterator_Value.Last _ -> Nothing
go it
builder.to_vector

## PRIVATE

Look up a match group name or number, and check that it is valid.

Arguments:
- id: The name or number of the group that was asked for.

Returns: a group number.

A group number is invalid if it is outside the range of groups
that were in the original pattern.

A group name is invalid if it was not defined in the original pattern.

A group name is an alias for a group number; if a name is passed to
this method, it returns the corresponding group number.

Note that it is possible for a group to "not participate in the match",
for example with a disjunction. In the example below, the "(d)" group
does not participate -- it neither matches nor fails.

"ab((c)|(d))".find "abc"

In this case, the group id for "(d)", which is 3, is a valid group id and
(Pattern_2.lookup_group 3) will return 3. If the caller tries to get group 3,
Match_2.group will return Nothing.

lookup_group : Integer | Text -> Integer ! No_Such_Group
lookup_group self id =
case id of
n : Integer -> case (n >= 0 && n < self.internal_regex_object.groupCount) of
True -> n
False -> Error.throw (No_Such_Group.Error n)
name : Text ->
# Maps name to number
groups = self.internal_regex_object.groups

n = case groups of
# If Nothing, there are no named groups
Nothing -> Error.throw (No_Such_Group.Error name)
_ ->
qq = (read_group_map groups name)
case qq of
Nothing -> Nothing
n : Integer -> n
case n of
_ : Integer -> n
Nothing -> Error.throw (No_Such_Group.Error name)

## PRIVATE

Performs the regex match, and iterates through the results. Yields both
the matched parts of the string, and the 'filler' parts between them.

At each step, it yields a Match_Iterator_Value, whivch has either a filler
and a match, or just the final filler. A Match_Iterator_Value.Last value is
return at the end, and only at the end.
type Match_Iterator
new : Pattern_2 -> Text -> Match_Iterator
new pattern input = Match_Iterator.Value pattern input 0

Value (pattern : Pattern_2) (input : Text) (cursor : Integer)

next : Match_Iterator_Value
next self =
regex_result = self.pattern.internal_regex_object.exec self.input self.cursor
case regex_result.isMatch of
False ->
filler_range = Range.new self.cursor (Text_Utils.char_length self.input)
filler_span = (Utf_16_Span.Value filler_range self.input).to_grapheme_span
Match_Iterator_Value.Last filler_span
True ->
match_start = regex_result.getStart 0
filler_range = Range.new self.cursor match_start
filler_span = (Utf_16_Span.Value filler_range self.input).to_grapheme_span
match = Match_2.Value self.pattern regex_result self.input
next_cursor = match.end 0
next_iterator = Match_Iterator.Value self.pattern self.input next_cursor
Match_Iterator_Value.Next filler_span match next_iterator

to_text_debug : Vector Text
to_text_debug self =
vb = Vector.new_builder
go it = case it.next of
Match_Iterator_Value.Next filler match next_it ->
vb.append ('\"' + filler.text + '\"')
vb.append ("/" + (match.span 0).text + "/")
go next_it
Match_Iterator_Value.Last filler ->
vb.append ('\"' + filler.text + '\"')
go self
vb.to_vector

## PRIVATE
type Match_Iterator_Value
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
Next (filler : Span) (match : Match_2) (next_iterator : Match_Iterator)
Last (filler : Span)
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved

## PRIVATE
group_map_contains : Any -> Text -> Boolean
group_map_contains map elem =
members = Polyglot.get_members map
as_vec = Vector.from_polyglot_array members
as_vec.contains elem

## PRIVATE
read_group_map : Any -> Text -> Integer | Nothing
read_group_map map name =
case group_map_contains map name of
True -> Polyglot.get_member map name
False -> Nothing

Loading