picat_regex

regex module for Picat (http://picat-lang,org).

This is an experimental module for handling regexes in Picat (requres Picat v3.3#3 or higher).

It is a interface to PCRE2 (https://www.pcre.org/current/doc/html/index.html) with some selected functionality. Especially interesting are probably these pages:

pcre2pattern: https://www.pcre.org/current/doc/html/pcre2pattern.html
pcre2syntax: https://www.pcre.org/current/doc/html/pcre2syntax.html
pcre2perform: https://www.pcre.org/current/doc/html/pcre2perform.html

For more information and examples regarding regexes see for example:

https://www.rexegg.com/ Here's an overview of operators and symbols: https://www.rexegg.com/regex-quickstart.html
https://www.regular-expressions.info/

This regex module does not cover the complete PCRE2 API, far from it. Instead it containts some few predicates/functions, mainly taken from my own use cases of regular expressions from my experience as a long time Perl programmer.

Some initial notes:

I have only tested this on a Linux Ubuntu 20.04.
As you might see in emu/bp_pcre2.c, I am not a C programmer (there is a reason why I prefer VHLL programming languages).

Files:

The files in this repo are:

emu/bp_pcre2.c

The C function with calls to PCRE.
emu/cpreds.c

Includes the new regex functions.
emu/Makefile.linux64_pcre2

Makefile for the Linux version of Picat.
emu/test_regex.pi

Some tests (go/0. go2/0 .. go9/0) testing different ascpects of the regex module.
lib/regex.pi

The Picat predicates/functions using the interface defined in bp_pcre2.c.

In ./ there are some test programs for the regex module and some other regex related programs.

Testing the regex module (i.e. import regex.)

regex_test_emu.pi

Some othe tests of the regex module.
wordle_regex.pi

Wordle solver. It uses the wordlist wordle_small.txt

. regex_match_number.pi

make_regex2.pi

Creates regexes given a list of words, and test them using the regex module. (This is a variant of my http://hakank.org/picat/make_regex.pi).

Some other regex related programs which does not require the regex module but I thought was appropritate to include

regex_crossword.pi

Solving some of the regex crosswords from https://regexcrossword.com
regex_generating_strings_v3.pi

Generating the possible strings given a regex. Note that the supported regexes are quite limited, e.g. no backreferences etc.

Compiling and testing

The regex module has been developed on a Linux Ubuntu 20.04 system using Picat version v3.7. To compile it you need the C source code for Picat, available at http://picat-lang.org/download.html (named picatnn.tar.gz, e.g. picat37_*.tar.gz).

Unpack the contents for the picat *.tar.gz in a directory, say /home/hakank/picat.
Copy the files in this repo's directories lib and emu to the corresponding directories in the newly created directory (/home/hakank/picat/lib, /home/hakank/picat/emu/.

Note: If you have used an earlier version of regex.pi, it's recommended that you remove the previously generated regex.qi before testing.
Ensure that the pcre2-8 library is installed.

In my Ubuntu 20.04 this is done by
```
$ sudo apt update
$ sudo apt install pcre2-8
```
(The -8 part is for the size of the characters used in the regexps.)
cd to the directory emu and run the make
```
$ cd emu
$ time sh make_picat_scip_linux
```
The Picat executable picat is now created (if no errors occurred) in this directory.
make the Picat program be aware of the regex module

The next step is to let the Picat executable be aware of the regex module. This can be done in two ways. Here we assume that regex.pi is in the direcotry /home/hakank/picat/lib.

a) With the -path flag:
```
$ picat -path /home/hakank/picat/lib  program.pi
```
b) Or set the environment variable PICATPATH
```
$ expport PICATPATH=".;/home/hakank/picat/lib"
$ picat program.pi
```

test that it works as expected

$ ./picat
Picat> import regex
Picat> regex("(?i)(hello),?\\s+(world|picat)!?","Hello, Picat!",Capture)
Capture = [['H',e,l,l,o,',',' ','P',i,c,a,t,'!'],['H',e,l,l,o],['P',i,c,a,t]]
yes

Some more tests are collected in the program regex_test.pi.

$ ./picat -g go test_regex.pi
...
$ ./picat -g go2 test_regex.pi 
...

Supported predicates/functions in the regex module

After compiling the new picat program the import regex. a couple of Picat predicates/functions available (see lib/regex.pi for details). Here is a short description of the module.

Here are the (type of) parameters for the functions:

Pattern is a PCRE2 pattern. For details, see https://www.pcre.org/current/doc/html/pcre2pattern.html

The regexes with a backslash (\s, \w, etc) must be double quoted, e.g. the regex "^(\w){3}\s+(\w)+$" must be written as
```
 "^(\\w){3}\\s+(\\w)+$"
```
Another gotcha is that string quotes ('"') must be quotes as well, e.g.
```
  "^([^\"]+?\"(.+?)\"\\s+(.+?)$"
```
Subject is a string. It is the string/text that we want to do some operation on with the regex.
Capture is a list of the captures (empty if there is no capture groups). See below for more details about the captures.

Here are the functions/predicates:

regex(Pattern,Subject) True of the string Subject matches the regex Pattern.
regex(Pattern,Subject,Capture) True of the string Subject matches the regex Pattern. Also the list Capture will contain the captures of the Pattern (if any).
regex_compile(Pattern) Compile (caches) the regex Pattern to be used with regex_match/2-3.

Note: as of now, this is a global pattern so one cannot cache more than one pattern at each time.
regex_match(String) regex_match(String,Capture)

Same as regex/2 and regex/3 respectively, except that it uses the compiled pattern from regex_compile/1.
regex_replace(Pattern,Replacement,Subject,Replaced) Replaced = regex_replace(Pattern,Replacement,Subject)

Replaces all occurrences of Pattern in the Subject with the string Replacement. The output is the string Replaced. In Replacement one can use backreferences such as "$1", etc.
regex_replace2(Subject,Pattern,Replacement) = Replaced This is a variant with the string Subject is in the first position to simplify chaining of replacements.

Here's a simple example of categorizing characters in vowels and consonants:
```
Picat> regex_replace2("picat programming","[aeiou]","V").regex_replace2("[^aeiouV ]","C")=X
X = ['C','V','C','V','C',' ','C','C','V','C','C','V','C','C','V','C','C']
```
regex_replace_first(Pattern,Replacement,Subject,Replaced) Replaced = regex_replace_first(Pattern,Replacement,Subject)

Replaces the first occurrence of Pattern in the Subject with the string Replacement. The output is the string Replaced. In Replacement one can use backreferences such as "$1", etc.
regex_replace_first2(Subject,Pattern,Replacement) = Replaced This is a variant with the string Subject is in the first position to simplify chaining of replacements.
regex_find(Pattern,Subject,Captures) regex_find(Pattern,Subject) = Captures

Find the first occurrence of Pattern in Subject and outputs it in Capture. Note: The list Captures contains ontly the captured groups, not the "global" matched string.
regex_find_all(Pattern,Subject,Matches) regex_find_all(Pattern,Subject) = Matches

The list Matches contain all occurrences of Pattern in the string Subject.

This tries to simulate Perl's
```
    while($text =~ !Pattern!gsm) {
     # ...
    }
```
For strings with newlines, the Pattern should start with "(?s)" or "(?sm)" for matching newlines. Note: The list Matches contains ontly the captured groups, not the "global" matched string.
regex_find_num(Pattern,Subject,Num,Matches)

The list Matches contains the first Num occurrences of Pattern in the string Subject. Note: The list Matches contains ontly the captured groups, not the "global" matched string.

Flags

The program bp_pcre2.c is compiled without any flags (except for regex_replace which replaces all occurrences).

The flags - that we all know and love from Perl - are the following from https://www.pcre.org/current/doc/html/pcre2pattern.html#internaloptions )

- (?i)  for PCRE2_CASELESS
- (?m)  for PCRE2_MULTILINE
- (?n)  for PCRE2_NO_AUTO_CAPTURE
- (?s)  for PCRE2_DOTALL
- (?x)  for PCRE2_EXTENDED
- (?xx) for PCRE2_EXTENDED_MORE

The flags can be combined, e.g. (?sm) for combining PCRE2_MULTILINE and PCRE2_DOTALL.

Unsetting is done with "-", e.g. (?sx) Setting and unsetting can also be done, e.g. (?im-sx).

Captures

Except for the predicates in the regex_find* family, the first element in Captures is the string from the first matched position to the last matched position. The rest of the elements are the individual grouping captures (if any).

For example, in the call

Picat> regex_match_capture("^(hakank?) was a Swedish (dancer|bassist|singer|dueller)$","hakan was a Swedish bassist",Captures3),

the Capture is

[hakan was a Swedish bassist,hakan,bassist]

i.e.

Capture[1] = "hakan was a Swedish bassist" Capture[2] = "hakan" Capture[3] = "bassist"

This can be confusing since often one wants to ignore the first "global" capture. One way might be enumerate the captured fields and ignore that first capture, e. g.

Picat> regex_match_capture("^(hakank?) was a Swedish (dancer|bassist|singer|dueller)$",
         "hakan was a Swedish bassist",[_, Who, What]),
  -> 
   Who = "hakan"
   What = "bassist"

Note: To complicate the matter, the captures in regex_find/3 does not have the first element as the full matched string; All[1] (or Capture[1]) is there the first matched group. This might be considered a feature...

The C functions

The Picat definitions in lib/regex.pi calls the C function written in emu/bp_pcre2.c. Here is a short description of these with the syntax how to call them from Picat:

bp.regex(Pattern,Subject)
bp.regex_capture(Pattern,Subject,Capture)
bp.regex_compile(Pattern)
bp.regex_match(Subject)
bp.regex_match_capture(Subject,Capture)
bp.regex_replace(Pattern,Replacement,Subject,Replaced)
bp.regex_find_matches(Pattern,Subject,Num,Matched).

Picat

For more about Picat see http://picat-lang.org and my Picat page: http://hakank.org/picat/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

picat_regex

Files:

Compiling and testing

Supported predicates/functions in the regex module

Flags

Captures

The C functions

Picat

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
emu		emu
lib		lib
LICENSE		LICENSE
README.md		README.md
make_regex2.pi		make_regex2.pi
regex_crossword.pi		regex_crossword.pi
regex_generating_strings_v3.pi		regex_generating_strings_v3.pi
regex_match_number.pi		regex_match_number.pi
regex_test_emu.pi		regex_test_emu.pi
test_regex.pi		test_regex.pi
wordle_regex.pi		wordle_regex.pi
wordle_small.txt		wordle_small.txt
words_lower.txt		words_lower.txt

License

hakank/picat_regex

Folders and files

Latest commit

History

Repository files navigation

picat_regex

Files:

Compiling and testing

Supported predicates/functions in the regex module

Flags

Captures

The C functions

Picat

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages