Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex.Compile using Truffle Regex, update find, match and match_all #5785

Merged
merged 70 commits into from
Mar 10, 2023
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
a1f269a
calling TRegexObject via matches
GregoryTravis Feb 23, 2023
c4c29e1
internal_pattern : Any
GregoryTravis Feb 23, 2023
49633f7
Merge branch 'develop' into wip/gmt/compile-regexp
GregoryTravis Feb 27, 2023
2cc6627
move new code to new modules
GregoryTravis Feb 27, 2023
28f2651
Pattern_2.matches
GregoryTravis Feb 27, 2023
2be426c
iterator
GregoryTravis Feb 28, 2023
3f6b5dd
to_text_debug
GregoryTravis Feb 28, 2023
14efa04
dead
GregoryTravis Feb 28, 2023
b0cacc8
change options to string and include them in src construction
GregoryTravis Mar 1, 2023
08d0cf0
Any, not Object
GregoryTravis Mar 1, 2023
ff8577d
Illegal_Argument
GregoryTravis Mar 1, 2023
c96511a
no UnicodeRegex
GregoryTravis Mar 1, 2023
7581dd5
Merge branch 'develop' into wip/gmt/compile-regexp
GregoryTravis Mar 1, 2023
4423481
match
GregoryTravis Mar 1, 2023
f2a3a68
groups work
GregoryTravis Mar 1, 2023
86a8955
case
GregoryTravis Mar 1, 2023
c59dd6e
install in match, find, find_all
GregoryTravis Mar 1, 2023
b67aab0
clean up
GregoryTravis Mar 1, 2023
d05c0e2
fix check_span
GregoryTravis Mar 1, 2023
fc94c95
clean up
GregoryTravis Mar 1, 2023
16f53d3
unicode woes
GregoryTravis Mar 2, 2023
8bbb758
mark normalization test pending
GregoryTravis Mar 2, 2023
5a34030
cleanup
GregoryTravis Mar 2, 2023
f6af863
convert to grapheme spans
GregoryTravis Mar 2, 2023
6a0ab81
Update test/Tests/src/Data/Text_Spec.enso
GregoryTravis Mar 2, 2023
3a131d0
cleanup, remove _2, escape
GregoryTravis Mar 3, 2023
0e5a7a4
rename to group
GregoryTravis Mar 3, 2023
4e51cbe
trying to read polyglot map
GregoryTravis Mar 3, 2023
be06de9
merge
GregoryTravis Mar 6, 2023
f3393d8
wip
GregoryTravis Mar 6, 2023
a197a74
Coerce polyglot values to supported Enso types
JaroslavTulach Mar 6, 2023
718efa7
undo catch
GregoryTravis Mar 6, 2023
799fbeb
Merge branch 'wip/gmt/compile-regexp' of github.com:enso-org/enso int…
GregoryTravis Mar 6, 2023
e8f7f67
disable syntax error test
GregoryTravis Mar 6, 2023
e2bf6f7
Coerce values obtained from readMember
JaroslavTulach Mar 7, 2023
e93cff1
jaroslav fix
GregoryTravis Mar 7, 2023
7d5f9e6
convert regex exception
GregoryTravis Mar 7, 2023
e6127ed
cleanup, correct error declarations
GregoryTravis Mar 7, 2023
69b710e
nonparticpating matches, docs
GregoryTravis Mar 7, 2023
3303cc4
docs
GregoryTravis Mar 7, 2023
3f72aa5
idiomatic
GregoryTravis Mar 7, 2023
8ce1ce6
docs
GregoryTravis Mar 7, 2023
f668bbd
docs
GregoryTravis Mar 7, 2023
6c8a62d
merge
GregoryTravis Mar 7, 2023
7e439f2
changelog
GregoryTravis Mar 7, 2023
354b1bf
fmt
GregoryTravis Mar 7, 2023
fe43b71
review
GregoryTravis Mar 8, 2023
1c12ed9
idiomatic
GregoryTravis Mar 8, 2023
3b2da3f
Update CHANGELOG.md
GregoryTravis Mar 8, 2023
e0a9e8b
Update distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Regex_2…
GregoryTravis Mar 8, 2023
ee1431f
review
GregoryTravis Mar 8, 2023
81280ba
remove self from node
GregoryTravis Mar 8, 2023
bf4d119
review
GregoryTravis Mar 8, 2023
2181fea
Merge branch 'wip/gmt/compile-regexp' of github.com:enso-org/enso int…
GregoryTravis Mar 8, 2023
fb06590
matches must match at both ends
GregoryTravis Mar 8, 2023
d8ae85f
fmt
GregoryTravis Mar 8, 2023
92782fa
Merge branch 'develop' into wip/gmt/compile-regexp
GregoryTravis Mar 8, 2023
6e0ad33
dead
GregoryTravis Mar 8, 2023
16d82ef
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 9, 2023
4cd19cc
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 9, 2023
dac3966
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 9, 2023
09f57ef
Merge branch 'develop' into wip/gmt/compile-regexp
jdunkerley Mar 10, 2023
750abd1
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 10, 2023
7754e49
Fix changelog.
jdunkerley Mar 10, 2023
13228b4
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 10, 2023
f2bb972
Add TruffleBoundary.
jdunkerley Mar 10, 2023
154e595
Merge branch 'develop' into wip/gmt/compile-regexp
mergify[bot] Mar 10, 2023
079b06c
Ignore new files.
jdunkerley Mar 10, 2023
9b75487
Fix failing tests.
jdunkerley Mar 10, 2023
a2588a6
Merge branch 'develop' into wip/gmt/compile-regexp
jdunkerley Mar 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ import project.Data.Text.Encoding.Encoding
import project.Data.Text.Location
import project.Data.Text.Matching_Mode.Matching_Mode
import project.Data.Text.Regex
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
import project.Data.Text.Regex_2
import project.Data.Text.Regex.Match.Match
import project.Data.Text.Regex.Regex_Mode.Regex_Mode
import project.Data.Text.Regex_Matcher.Regex_Matcher
Expand Down Expand Up @@ -230,7 +231,7 @@ Text.characters self =
Text.find : Text -> Case_Sensitivity -> Match | Nothing ! Compile_Error
Text.find self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
case_insensitive = case_sensitivity.is_case_insensitive_in_memory
Regex.compile pattern case_insensitive=case_insensitive . match self Matching_Mode.First
Regex_2.compile_2 pattern case_insensitive=case_insensitive . match self

## Finds all the matches of the regular expression `pattern` in `self`,
returning a Vector. If not found, will be an empty Vector.
Expand All @@ -252,9 +253,7 @@ Text.find self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
Text.find_all : Text -> Case_Sensitivity -> Vector Match ! Compile_Error
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
Text.find_all self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
case_insensitive = case_sensitivity.is_case_insensitive_in_memory
case Regex.compile pattern case_insensitive=case_insensitive . match self Regex_Mode.All of
Nothing -> []
matches -> matches
Regex_2.compile_2 pattern case_insensitive=case_insensitive . match_all self

## ALIAS Check Matches

Expand All @@ -279,7 +278,7 @@ Text.find_all self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
Text.match : Text -> Case_Sensitivity -> Boolean ! Compile_Error
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
Text.match self pattern=".*" case_sensitivity=Case_Sensitivity.Sensitive =
case_insensitive = case_sensitivity.is_case_insensitive_in_memory
compiled_pattern = Regex.compile pattern case_insensitive=case_insensitive
compiled_pattern = Regex_2.compile_2 pattern case_insensitive=case_insensitive
compiled_pattern.matches self

## ALIAS Split Text
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
## Internal text utilities for inspecting text primitives.

import project.Any.Any
import project.Data.Text.Text

## PRIVATE

Forces flattening of a text value.
optimize : Text
optimize text = @Builtin_Method "Prim_Text_Helper.optimize"

compile_regex : Text -> Text -> Any
compile_regex pattern options = @Builtin_Method "Prim_Text_Helper.compile_regex"
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import project.Data.Numbers.Integer
import project.Data.Range.Range
import project.Data.Text.Span.Span

type Match_2
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
## internal_regex_result : RegexResult (Truffle)
(See https://github.com/oracle/graal/blob/master/regex/docs/README.md)
Value (internal_regex_result : Any) (input : Text)

start : Integer -> Integer
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
start self n = self.internal_regex_result.getStart n

end : Integer -> Integer
end self n = self.internal_regex_result.getEnd n

span : Integer -> Span
span self n =
range = Range.new (self.start n) (self.end n)
Span.Value range self.input
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import project.Any.Any
#import project.Data.Boolean.Boolean
from project.Data.Boolean import Boolean, True, False
import project.Data.Map.Map
import project.Data.Numbers.Integer
import project.Data.Range.Extensions
import project.Data.Range.Range
import project.Data.Text.Span.Span
import project.Data.Text.Regex.Match_2.Match_2
import project.Data.Text.Text
import project.Data.Vector.Vector
import project.IO
import project.Nothing.Nothing

type Pattern_2
jdunkerley marked this conversation as resolved.
Show resolved Hide resolved
## internal_regex_object : RegexObject (Truffle)
(See https://github.com/oracle/graal/blob/master/regex/docs/README.md)
Value (internal_regex_object : Any)

# start_iterator

matches : Text -> Boolean
matches self input =
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
m = self.internal_regex_object.exec input 0
m . isMatch && m.getEnd 0 == input.length

match_all : Text -> Vector Match_2
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
match_all self input =
builder = Vector.new_builder
it = Match_Iterator.new self input
go it = case it.next of
Match_Iterator_Value.Next _ match next_it ->
builder.append match
go next_it
Match_Iterator_Value.Last _ -> Nothing
go it
builder.to_vector

match : Text -> Match_2 | Nothing
match self input =
it = Match_Iterator.new self input
case it.next of
Match_Iterator_Value.Next _ match _ -> match
Match_Iterator_Value.Last _ -> Nothing

to_text_debug : Text -> Vector Text
to_text_debug self input =
Match_Iterator.new self input . to_text_debug
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved

type Match_Iterator
new : Pattern_2 -> Text -> Match_Iterator
new pattern input = Match_Iterator.Value pattern input 0

Value (pattern : Pattern_2) (input : Text) (cursor : Integer)

next : Match_Iterator_Value
next self =
regex_result = self.pattern.internal_regex_object.exec self.input self.cursor
case regex_result.isMatch of
False ->
filler_range = Range.new self.cursor (self.input.length)
filler_span = Span.Value filler_range self.input
Match_Iterator_Value.Last filler_span
True ->
match_start = regex_result.getStart 0
filler_range = Range.new self.cursor match_start
filler_span = Span.Value filler_range self.input
match = Match_2.Value regex_result self.input
next_cursor = match.end 0
next_iterator = Match_Iterator.Value self.pattern self.input next_cursor
Match_Iterator_Value.Next filler_span match next_iterator

to_text_debug : Vector Text
to_text_debug self =
vb = Vector.new_builder
go it = case it.next of
Match_Iterator_Value.Next filler match next_it ->
vb.append ('\"' + filler.text + '\"')
vb.append ("/" + (match.span 0).text + "/")
go next_it
Match_Iterator_Value.Last filler ->
vb.append ('\"' + filler.text + '\"')
go self
vb.to_vector

type Match_Iterator_Value
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
Next (filler : Span) (match : Match_2) (next_iterator : Match_Iterator)
Last (filler : Span)
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import project.Any.Any
from project.Data.Boolean import Boolean, True, False
import project.Data.Numbers.Integer
import project.Data.Text.Prim_Text_Helper
import project.Data.Text.Regex.Pattern_2.Pattern_2
import project.Data.Text.Text
from project.Error.Common import Compile_Error, Syntax_Error
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
import project.Error.Illegal_Argument.Illegal_Argument
import project.IO
import project.Nothing.Nothing
import project.Panic.Panic

compile_2 : Text -> Boolean | Nothing -> Pattern_2 ! (Compile_Error | Illegal_Argument | Syntax_Error)
compile_2 self expression case_insensitive=Nothing =
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
#all_options = options + self.engine_opts
#options_bitmask = from_enso_options all_options
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
options_string = if case_insensitive == True then "usgi" else "usg"

maybe_regex_object = Panic.recover Any <|
Prim_Text_Helper.compile_regex expression options_string

internal_regex_object = maybe_regex_object.map_error case _ of
# err : PatternSyntaxException -> Syntax_Error.Error ("The regex could not be compiled: " + err.getMessage)
other -> other

Pattern_2.Value internal_regex_object
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

example_span =
text = "Hello!"
Span.Value 0 3 text
Span.Value (Range.new 0 3) text

import project.Data.Numbers.Integer
import project.Data.Pair.Pair
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
package org.enso.interpreter.node.expression.builtin.text;

import com.oracle.truffle.api.dsl.Cached;
import org.enso.interpreter.dsl.BuiltinMethod;
import org.enso.interpreter.runtime.data.text.Text;

import com.oracle.truffle.api.dsl.Specialization;
import com.oracle.truffle.api.nodes.Node;
import com.oracle.truffle.api.source.Source;
import org.enso.interpreter.runtime.EnsoContext;

@BuiltinMethod(
type = "Prim_Text_Helper",
name = "compile_regex",
description = "Compiles a regexp.",
autoRegister = false)
public abstract class RegexCompileNode extends Node {
static RegexCompileNode build() {
return RegexCompileNodeGen.create();
}

abstract Object execute(Object self, Object pattern, Object options);
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved



@Specialization(limit = "3", guards = {
"pattern.toString().equals(cachedPattern)",
"options.toString().equals(cachedOptions)"
})
Object parseRegexPattern(Object self, Text pattern, Text options,
@Cached("pattern.toString()") String cachedPattern,
@Cached("options.toString()") String cachedOptions,
@Cached("compile(cachedPattern, cachedOptions)") Object regex
) {
return regex;
}

@Specialization
Object alwaysCompile(Object self, Text pattern, Text options) {
return compile(pattern.toString(), options.toString());
}

Object compile(String pattern, String options) {
var ctx = EnsoContext.get(this);
var env = ctx.getEnvironment();
var s = "Flavor=ECMAScript/" + pattern + "/" + options;
var src =
Source.newBuilder("regex", s, "myRegex")
.mimeType("application/tregex")
.internal(true)
.build();
var regex = env.parseInternal(src).call();
return regex;
}
}
53 changes: 53 additions & 0 deletions regs.enso
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
from Standard.Base import all

import Standard.Base.Data.Text.Regex_2
import Standard.Base.Data.Text.Regex.Pattern_2.Pattern_2

main =
# 0301 == 0xc1
# 0xe9 = 0351
## result:
Nothing
(Match_2.Value TRegexResult[0, 1] 'é')
#accent_1 = '\u00E9'
## result:
(Match_2.Value TRegexResult[1, 2] 'áéó')
(Match_2.Value TRegexResult[0, 1] '́') # This is an unmoored accent between two single quotes, fml
accent_1 = '\u{301}'
accents = 'a\u{301}e\u{301}o\u{301}'
IO.println (accents.find accent_1)
IO.println (accent_1.find accent_1)
IO.println (accents.find accent_1 case_sensitivity=Case_Sensitivity.Sensitive)
IO.println (accent_1.find accent_1 case_sensitivity=Case_Sensitivity.Sensitive)

##
p = Regex_2.compile_2 "ab"

matches_no = p.match "qqq"
IO.println matches_no

matches = p.match_all "heyabyeahab yo abc no cab ab abab ye "
IO.println matches
IO.println (matches.map m-> m.span 0 . text)
match = p.match "heyabyeahab yo abc no cab ab abab ye "
IO.println match
IO.println (match.span 0 . text)

p2 = Regex_2.compile_2 "a(b(cd))"
m2 = p2.match "ajeavabcdwri"
IO.println p2.to_text_debug
IO.println (m2.span 0 . text)
IO.println (m2.span 1 . text)
IO.println (m2.span 2 . text)

p3a = Regex_2.compile_2 "abc"
p3b = Regex_2.compile_2 "abc" case_insensitive=True
m3a = p3a.match "ABC"
m3b = p3b.match "ABC"
IO.println m3a
IO.println m3b

IO.println ("abcdab".match "ab")
IO.println ("ab".match "ab")
IO.println ("abcdab".find "ab")
IO.println ("abcdab".find_all "ab")
14 changes: 11 additions & 3 deletions test/Tests/src/Data/Text_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,11 @@ type Manual
- Note that currently the regex-based operations may not handle the edge
cases described above too well.
spec =
check_span result span = result.span 0 . to_grapheme_span . should_equal span
check_span_all result spans = result . map (m-> (m.span 0).to_grapheme_span) . should_equal spans
check_span match span =
match . should_not_equal Nothing
match.span 0 . should_equal span
check_span_all match spans = match . map (m-> (m.span 0)) . should_equal spans
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved

Test.group "Text" <|
kshi = '\u0915\u094D\u0937\u093F'
facepalm = '\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F'
Expand Down Expand Up @@ -1186,7 +1189,12 @@ spec =
## Regex matching does not do case folding
"Strasse".find "ß" Case_Sensitivity.Insensitive . should_equal Nothing

## But it should handle the Unicode normalization
Test.specify "should handle the Unicode normalization" pending="Use this to test exposed normalization methods" <|
GregoryTravis marked this conversation as resolved.
Show resolved Hide resolved
## This test passed for the builtin Java regex library, using
Pattern.CANON_EQ, but since that option is buggy and rarely use,
we won't attempt to recreate it with Truffle regex. Instead,
expose normalization methods to allow developers to do it
themselves.
accents = 'a\u{301}e\u{301}o\u{301}'
check_span (accents.find accent_1) (Span.Value (1.up_to 2) 'a\u{301}e\u{301}o\u{301}')

Expand Down