rematch2

Match Regular Expressions with a Nicer ‘API’

A small wrapper on regular expression matching functions regexpr and gregexpr to return the results in tidy data frames.

Installation
Rematch vs rematch2
Usage
License

Installation

Stable version:

install.packages("rematch2")

Development version:

pak::pak("r-lib/rematch2")

Rematch vs rematch2

Note that rematch2 is not compatible with the original rematch package. There are at least three major changes:

The order of the arguments for the functions is different. In rematch2 the text vector is first, and pattern is second.
In the result, .match is the last column instead of the first.
rematch2 returns tibble data frames. See https://github.com/tidyverse/tibble.

Usage

First match

library(rematch2)

With capture groups:

dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
  "76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)
#> # A tibble: 7 × 5
#>   ``    ``    ``    .text            .match    
#>   <chr> <chr> <chr> <chr>            <chr>     
#> 1 2016  04    20    2016-04-20       2016-04-20
#> 2 1977  08    08    1977-08-08       1977-08-08
#> 3 <NA>  <NA>  <NA>  not a date       <NA>      
#> 4 <NA>  <NA>  <NA>  2016             <NA>      
#> 5 <NA>  <NA>  <NA>  76-03-02         <NA>      
#> 6 2012  06    30    2012-06-30       2012-06-30
#> 7 2015  01    21    2015-01-21 19:58 2015-01-21

Named capture groups:

isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)
#> # A tibble: 7 × 5
#>   year  month day   .text            .match    
#>   <chr> <chr> <chr> <chr>            <chr>     
#> 1 2016  04    20    2016-04-20       2016-04-20
#> 2 1977  08    08    1977-08-08       1977-08-08
#> 3 <NA>  <NA>  <NA>  not a date       <NA>      
#> 4 <NA>  <NA>  <NA>  2016             <NA>      
#> 5 <NA>  <NA>  <NA>  76-03-02         <NA>      
#> 6 2012  06    30    2012-06-30       2012-06-30
#> 7 2015  01    21    2015-01-21 19:58 2015-01-21

A slightly more complex example:

github_repos <- c(
    "metacran/crandb",
    "jeroenooms/[email protected]",
    "jimhester/covr#47",
    "hadley/dplyr@*release",
    "r-lib/remotes@550a3c7d3f9e1493a2ba",
    "/$&@R64&3"
)
owner_rx   <- "(?:(?<owner>[^/]+)/)?"
repo_rx    <- "(?<repo>[^/@#]+)"
subdir_rx  <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx     <- "(?:@(?<ref>[^*].*))"
pull_rx    <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"

subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx  <- sprintf(
    "^(?:%s%s%s%s|(?<catchall>.*))$",
    owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)
#> # A tibble: 6 × 9
#>   owner        repo      subdir ref          pull  release catchall .text .match
#>   <chr>        <chr>     <chr>  <chr>        <chr> <chr>   <chr>    <chr> <chr> 
#> 1 "metacran"   "crandb"  ""     ""           ""    ""      ""       meta… metac…
#> 2 "jeroenooms" "curl"    ""     "v0.9.3"     ""    ""      ""       jero… jeroe…
#> 3 "jimhester"  "covr"    ""     ""           "47"  ""      ""       jimh… jimhe…
#> 4 "hadley"     "dplyr"   ""     ""           ""    "*rele… ""       hadl… hadle…
#> 5 "r-lib"      "remotes" ""     "550a3c7d3f… ""    ""      ""       r-li… r-lib…
#> 6 ""           ""        ""     ""           ""    ""      "/$&@R6… /$&@… /$&@R…

All matches

Extract all names, and also first names and last names:

name_rex <- paste0(
  "(?<first>[[:upper:]][[:lower:]]+) ",
  "(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
  "  Ben Franklin and Jefferson Davis",
  "\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not
#> # A tibble: 2 × 4
#>   first     last      .text                                .match   
#>   <list>    <list>    <chr>                                <list>   
#> 1 <chr [2]> <chr [2]> "  Ben Franklin and Jefferson Davis" <chr [2]>
#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore"                 <chr [1]>

not$first
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
not$last
#> [[1]]
#> [1] "Franklin" "Davis"   
#> 
#> [[2]]
#> [1] "Fillmore"
not$.match
#> [[1]]
#> [1] "Ben Franklin"    "Jefferson Davis"
#> 
#> [[2]]
#> [1] "Millard Fillmore"

Match positions

re_exec and re_exec_all are similar to re_match and re_match_all, but they also return match positions. These functions return match records. A match record has three components: match, start, end, and each component can be a vector. It is similar to a data frame in this respect.

pos <- re_exec(notables, name_rex)
pos
#> # A tibble: 2 × 4
#>   first            last             .text                           .match      
#>   <rmtch_rc>       <rmtch_rc>       <chr>                           <rmtch_rc>  
#> 1 <named list [3]> <named list [3]> "  Ben Franklin and Jefferson … <named list>
#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore"            <named list>

Unfortunately R does not allow hierarchical data frames (i.e. a column of a data frame cannot be another data frame), but rematch2 defines some special classes and an $ operator, to make it easier to extract parts of re_exec and re_exec_all matches. You simply query the match, start or end part of a column:

pos$first$match
#> [1] "Ben"     "Millard"
pos$first$start
#> [1] 3 2
pos$first$end
#> [1] 5 8

re_exec_all is very similar, but these queries return lists, with arbitrary number of matches:

allpos <- re_exec_all(notables, name_rex)
allpos
#> # A tibble: 2 × 4
#>   first            last             .text                           .match      
#>   <rmtch_ll>       <rmtch_ll>       <chr>                           <rmtch_ll>  
#> 1 <named list [3]> <named list [3]> "  Ben Franklin and Jefferson … <named list>
#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore"            <named list>

allpos$first$match
#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"
allpos$first$start
#> [[1]]
#> [1]  3 20
#> 
#> [[2]]
#> [1] 2
allpos$first$end
#> [[1]]
#> [1]  5 28
#> 
#> [[2]]
#> [1] 8

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
codecov.yml		codecov.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

rematch2

Installation

Rematch vs rematch2

Usage

First match

All matches

Match positions

License

About

Licenses found

Releases 3

Packages

Languages

License

Licenses found

r-lib/rematch2

Folders and files

Latest commit

History

Repository files navigation

rematch2

Installation

Rematch vs rematch2

Usage

First match

All matches

Match positions

License

About

Topics

Resources

License

Licenses found

Code of conduct

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages