Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regular expression parsing #3244

Merged
merged 3 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 21 additions & 7 deletions resources/RegexGrammar.pp
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,23 @@
%skip nl \n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n can be in pcre class as well

Copy link
Contributor Author

@Seldaek Seldaek Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am unsure why this is here, I haven't experimented much, but it is rare to have actual newlines in regexes as they're usually escaped as \n, while this here matches a real newline.

I'll create a new issue for this to look at it later, together with proper support for the x modifier, because it's very much related. Edit: phpstan/phpstan#11360


// Character classes.
%token negative_class_ \[\^
%token class_ \[
%token _class \]
%token range \-
%token negative_class_ \[\^ -> class
%token class_ \[ -> class
%token class:posix_class \[:\^?[a-z]+:\]
%token class:class_ \[
%token class:_class_literal (?<=[^\\]\[|[^\\]\[\^)\]
%token class:_class \] -> default
%token class:range \-
%token class:escaped_end_class \\\]
// taken over from literals but class:character has \b support on top (backspace in character classes)
%token class:character \\([aefnrtb]|c[\x00-\x7f])
%token class:dynamic_character \\([0-7]{3}|x[0-9a-zA-Z]{2}|x{[0-9a-zA-Z]+})
%token class:character_type \\([CdDhHNRsSvVwWX]|[pP]{[^}]+})
%token class:literal \\.|.

// Internal options.
%token internal_option \(\?[\-+]?[imsx]\)
// See https://www.regular-expressions.info/refmodifiers.html
%token internal_option \(\?([imsxnJUX^]|xx)?-?([imsxnJUX^]|xx)\)

// Lookahead and lookbehind assertions.
%token lookahead_ \(\?=
Expand All @@ -77,6 +87,7 @@
%token nc:_named_capturing > -> default
%token nc:capturing_name .+?(?=(?<!\\)>)
%token non_capturing_ \(\?:
%token non_capturing_internal_option \(\?([imsxnJUX^]|xx)?-?([imsxnJUX^]|xx):
%token non_capturing_reset_ \(\?\|
%token atomic_group_ \(\?>
%token capturing_ \(
Expand Down Expand Up @@ -168,7 +179,7 @@
::negative_class_:: #negativeclass
| ::class_::
)
( <class_> | range() | literal() )+
( <range> | <_class_literal> )? ( <posix_class> | <class_> | range() | literal() | <escaped_end_class> )* <range>?
::_class::

#range:
Expand All @@ -183,15 +194,18 @@
| (
::named_capturing_:: <capturing_name> ::_named_capturing:: #namedcapturing
| ::non_capturing_:: #noncapturing
| non_capturing_internal_options() #noncapturing
| ::non_capturing_reset_:: #noncapturingreset
| ::atomic_group_:: #atomicgroup
| ::capturing_::
)
alternation() ::_capturing::

non_capturing_internal_options:
<non_capturing_internal_option>

literal:
<character>
| <range>
| <dynamic_character>
| <character_type>
| <anchor>
Expand Down
145 changes: 94 additions & 51 deletions src/Type/Php/RegexArrayShapeMatcher.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
use Hoa\Compiler\Llk\TreeNode;
use Hoa\Exception\Exception;
use Hoa\File\Read;
use Nette\Utils\RegexpException;
use Nette\Utils\Strings;
use PhpParser\Node\Expr;
use PhpParser\Node\Name;
Expand All @@ -31,7 +32,11 @@
use function in_array;
use function is_int;
use function is_string;
use function rtrim;
use function sscanf;
use function str_replace;
use function strlen;
use function substr;
use const PREG_OFFSET_CAPTURE;
use const PREG_UNMATCHED_AS_NULL;

Expand Down Expand Up @@ -375,6 +380,13 @@ private function parseGroups(string $regex): ?array
self::$parser = Llk::load(new Read(__DIR__ . '/../../../resources/RegexGrammar.pp'));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ondrejmirtes Are we ok loading the file from here instead? Not sure what this does in terms of licensing, nor if there's a better place to put it. If so I'll delete the patch file and drop the hoa/regex requirement.

Copy link
Contributor

@staabm staabm Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this path will work in the packaged phpstan.phar - while I agree that having the RegexGrammar.pp in the phpstan-src repo might ease the contribution experience - as the patching is really cumbersome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree moving from the patching as the patching is very hard to understand and the changes needed are more and more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I got rid of it, we'll see what @ondrejmirtes says :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to add the whole and unpatched RegexGrammar.pp as first/separate commit, then the patched/actual grammar and then the changes from this PR. This will hugely improve the review and git history.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done 👍🏻

}

try {
Strings::match('', $regex);
} catch (RegexpException) {
// pattern is invalid, so let the RegularExpressionPatternRule report it
return null;
}
Seldaek marked this conversation as resolved.
Show resolved Hide resolved

try {
$ast = self::$parser->parse($regex);
Seldaek marked this conversation as resolved.
Show resolved Hide resolved
} catch (Exception) {
Expand Down Expand Up @@ -516,25 +528,37 @@ private function getQuantificationRange(TreeNode $node): array
$lastChild = $node->getChild($node->getChildrenNumber() - 1);
$value = $lastChild->getValue();

if ($value['token'] === 'n_to_m') {
if (sscanf($value['value'], '{%d,%d}', $n, $m) !== 2 || !is_int($n) || !is_int($m)) {
// normalize away possessive and lazy quantifier-modifiers
$token = str_replace(['_possessive', '_lazy'], '', $value['token']);
$value = rtrim($value['value'], '+?');

if ($token === 'n_to_m') {
if (sscanf($value, '{%d,%d}', $n, $m) !== 2 || !is_int($n) || !is_int($m)) {
throw new ShouldNotHappenException();
}

$min = $n;
$max = $m;
} elseif ($value['token'] === 'exactly_n') {
if (sscanf($value['value'], '{%d}', $n) !== 1 || !is_int($n)) {
} elseif ($token === 'n_or_more') {
if (sscanf($value, '{%d,}', $n) !== 1 || !is_int($n)) {
throw new ShouldNotHappenException();
}

$min = $n;
} elseif ($token === 'exactly_n') {
if (sscanf($value, '{%d}', $n) !== 1 || !is_int($n)) {
throw new ShouldNotHappenException();
}

$min = $n;
$max = $n;
} elseif ($value['token'] === 'zero_or_one') {
} elseif ($token === 'zero_or_one') {
$min = 0;
$max = 1;
} elseif ($value['token'] === 'zero_or_more') {
} elseif ($token === 'zero_or_more') {
$min = 0;
} elseif ($token === 'one_or_more') {
$min = 1;
}

return [$min, $max];
Expand Down Expand Up @@ -591,20 +615,8 @@ private function walkGroupAst(TreeNode $ast, TrinaryLogic &$isNonEmpty, TrinaryL
if ($literalValue !== null) {
if (Strings::match($literalValue, '/^\d+$/') === null) {
$isNumeric = TrinaryLogic::createNo();
}

if (!$inOptionalQuantification) {
$isNonEmpty = TrinaryLogic::createYes();
}
}

if ($ast->getValueToken() === 'character_type') {
if ($ast->getValueValue() === '\d') {
if ($isNumeric->maybe()) {
$isNumeric = TrinaryLogic::createYes();
}
} else {
$isNumeric = TrinaryLogic::createNo();
} elseif ($isNumeric->maybe()) {
$isNumeric = TrinaryLogic::createYes();
Seldaek marked this conversation as resolved.
Show resolved Hide resolved
}

if (!$inOptionalQuantification) {
Expand All @@ -613,32 +625,11 @@ private function walkGroupAst(TreeNode $ast, TrinaryLogic &$isNonEmpty, TrinaryL
}
}

if ($ast->getId() === '#range' || $ast->getId() === '#class') {
if ($isNumeric->maybe()) {
$allNumeric = null;
foreach ($children as $child) {
$literalValue = $this->getLiteralValue($child);

if ($literalValue === null) {
break;
}

if (Strings::match($literalValue, '/^\d+$/') === null) {
$allNumeric = false;
Seldaek marked this conversation as resolved.
Show resolved Hide resolved
break;
}

$allNumeric = true;
}

if ($allNumeric === true) {
$isNumeric = TrinaryLogic::createYes();
}
}

if (!$inOptionalQuantification) {
$isNonEmpty = TrinaryLogic::createYes();
}
// [^0-9] should not parse as numeric-string, and [^list-everything-but-numbers] is technically
// doable but really silly compared to just \d so we can safely assume the string is not numeric
// for negative classes
if ($ast->getId() === '#negativeclass') {
$isNumeric = TrinaryLogic::createNo();
}

foreach ($children as $child) {
Expand All @@ -653,13 +644,65 @@ private function walkGroupAst(TreeNode $ast, TrinaryLogic &$isNonEmpty, TrinaryL

private function getLiteralValue(TreeNode $node): ?string
{
if ($node->getId() === 'token' && $node->getValueToken() === 'literal') {
return $node->getValueValue();
if ($node->getId() !== 'token') {
return null;
}

// token is the token name from grammar without the namespace so literal and class:literal are both called literal here
$token = $node->getValueToken();
$value = $node->getValueValue();

if (in_array($token, ['literal', 'escaped_end_class'], true)) {
if (strlen($node->getValueValue()) > 1 && $value[0] === '\\') {
return substr($value, 1);
}

return $value;
}

// literal "-" in front/back of a character class like '[-a-z]' or '[abc-]', not forming a range
if ($token === 'range') {
return $value;
}

// literal "[" or "]" inside character classes '[[]' or '[]]'
if (in_array($token, ['class_', '_class_literal'], true)) {
return $value;
}

// character escape sequences, just return a fixed string
if (in_array($token, ['character', 'dynamic_character', 'character_type'], true)) {
if ($token === 'character_type' && $value === '\d') {
return '0';
}

return $value;
}

// [:digit:] and the like, more support coming later
if ($token === 'posix_class') {
if ($value === '[:digit:]') {
return '0';
}
if (in_array($value, ['[:alpha:]', '[:alnum:]', '[:upper:]', '[:lower:]', '[:word:]', '[:ascii:]', '[:print:]', '[:xdigit:]', '[:graph:]'], true)) {
return 'a';
}
if ($value === '[:blank:]') {
return " \t";
}
if ($value === '[:cntrl:]') {
return "\x00\x1F";
}
if ($value === '[:space:]') {
return " \t\r\n\v\f";
}
if ($value === '[:punct:]') {
return '!"#$%&\'()*+,\-./:;<=>?@[\]^_`{|}~';
}
}

// literal "-" outside of a character class like '~^((\\d{1,6})-)$~'
if ($node->getId() === 'token' && $node->getValueToken() === 'range') {
return $node->getValueValue();
if ($token === 'anchor' || $token === 'match_point_reset') {
return '';
}

return null;
Expand Down
2 changes: 1 addition & 1 deletion tests/PHPStan/Analyser/LegacyNodeScopeResolverTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -8003,7 +8003,7 @@ public function dataPassedByReference(): array
'$arr',
],
[
'array{0?: string}',
'array<string>',
Seldaek marked this conversation as resolved.
Show resolved Hide resolved
'$matches',
],
[
Expand Down
75 changes: 61 additions & 14 deletions tests/PHPStan/Analyser/nsrt/preg_match_shapes.php
Original file line number Diff line number Diff line change
Expand Up @@ -127,28 +127,18 @@ function doUnknownFlags(string $s, int $flags): void {
assertType('array<array{string|null, int<-1, max>}|string|null>', $matches);
}

function doNonAutoCapturingModifier(string $s): void {
if (preg_match('/(?n)(\d+)/', $s, $matches)) {
// could be assertType('array{string}', $matches);
assertType('array<string>', $matches);
}
assertType('array<string>', $matches);
}

function doMultipleAlternativeCaptureGroupsWithSameNameWithModifier(string $s): void {
if (preg_match('/(?J)(?<Foo>[a-z]+)|(?<Foo>[0-9]+)/', $s, $matches)) {
// could be assertType('array{0: string, Foo: string, 1: string}', $matches);
assertType('array<string>', $matches);
assertType('array{0: string, Foo: numeric-string|non-empty-string, 1: non-empty-string, 2?: numeric-string}', $matches);
}
assertType('array<string>', $matches);
assertType('array{}|array{0: string, Foo: numeric-string|non-empty-string, 1: non-empty-string, 2?: numeric-string}', $matches);
}

function doMultipleConsecutiveCaptureGroupsWithSameNameWithModifier(string $s): void {
if (preg_match('/(?J)(?<Foo>[a-z]+)|(?<Foo>[0-9]+)/', $s, $matches)) {
// could be assertType('array{0: string, Foo: string, 1: string}', $matches);
assertType('array<string>', $matches);
assertType('array{0: string, Foo: numeric-string|non-empty-string, 1: non-empty-string, 2?: numeric-string}', $matches);
Seldaek marked this conversation as resolved.
Show resolved Hide resolved
}
assertType('array<string>', $matches);
assertType('array{}|array{0: string, Foo: numeric-string|non-empty-string, 1: non-empty-string, 2?: numeric-string}', $matches);
}

// https://github.com/hoaproject/Regex/issues/31
Expand Down Expand Up @@ -472,3 +462,60 @@ function (string $s): void {
assertType("array{string, non-empty-string}", $matches);
}
};

function bug11323(string $s): void {
if (preg_match('/([*|+?{}()]+)([^*|+[:digit:]?{}()]+)/', $s, $matches)) {
Seldaek marked this conversation as resolved.
Show resolved Hide resolved
assertType('array{string, non-empty-string, non-empty-string}', $matches);
}
if (preg_match('/\p{L}[[\]]+([-*|+?{}(?:)]+)([^*|+[:digit:]?{a-z}(\p{L})\a-]+)/', $s, $matches)) {
assertType('array{string, non-empty-string, non-empty-string}', $matches);
}
if (preg_match('{([-\p{L}[\]*|\x03\a\b+?{}(?:)-]+[^[:digit:]?{}a-z0-9#-k]+)(a-z)}', $s, $matches)) {
assertType('array{string, non-empty-string, non-empty-string}', $matches);
}
if (preg_match('{(\d+)(?i)insensitive((?x-i)case SENSITIVE here(?i:insensitive non-capturing group))}', $s, $matches)) {
assertType('array{string, numeric-string, non-empty-string}', $matches);
}
if (preg_match('{([]] [^]])}', $s, $matches)) {
assertType('array{string, non-empty-string}', $matches);
}
if (preg_match('{([[:digit:]])}', $s, $matches)) {
assertType('array{string, numeric-string}', $matches);
}
if (preg_match('{([\d])(\d)}', $s, $matches)) {
assertType('array{string, numeric-string, numeric-string}', $matches);
}
if (preg_match('{([0-9])}', $s, $matches)) {
assertType('array{string, numeric-string}', $matches);
}
if (preg_match('{(\p{L})(\p{P})(\p{Po})}', $s, $matches)) {
assertType('array{string, non-empty-string, non-empty-string, non-empty-string}', $matches);
}
if (preg_match('{(a)??(b)*+(c++)(d)+?}', $s, $matches)) {
assertType('array{string, string, string, non-empty-string, non-empty-string}', $matches);
}
if (preg_match('{(.\d)}', $s, $matches)) {
assertType('array{string, non-empty-string}', $matches);
}
if (preg_match('{(\d.)}', $s, $matches)) {
assertType('array{string, non-empty-string}', $matches);
}
if (preg_match('{(\d\d)}', $s, $matches)) {
assertType('array{string, numeric-string}', $matches);
}
if (preg_match('{(.(\d))}', $s, $matches)) {
assertType('array{string, non-empty-string, numeric-string}', $matches);
}
if (preg_match('{((\d).)}', $s, $matches)) {
assertType('array{string, non-empty-string, numeric-string}', $matches);
}
if (preg_match('{(\d([1-4])\d)}', $s, $matches)) {
assertType('array{string, numeric-string, numeric-string}', $matches);
}
if (preg_match('{(x?([1-4])\d)}', $s, $matches)) {
assertType('array{string, non-empty-string, numeric-string}', $matches);
}
if (preg_match('{([^1-4])}', $s, $matches)) {
assertType('array{string, non-empty-string}', $matches);
}
}
8 changes: 8 additions & 0 deletions tests/PHPStan/Analyser/nsrt/preg_match_shapes_php80.php
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,11 @@ function doOffsetCaptureWithUnmatchedNull(string $s): void {
}
assertType('array{}|array{array{string|null, int<-1, max>}, array{non-empty-string|null, int<-1, max>}, array{non-empty-string|null, int<-1, max>}, array{non-empty-string|null, int<-1, max>}}', $matches);
}

function doNonAutoCapturingModifier(string $s): void {
if (preg_match('/(?n)(\d+)/', $s, $matches)) {
// should be assertType('array{string}', $matches);
assertType('array{string, numeric-string}', $matches);
}
assertType('array{}|array{string, numeric-string}', $matches);
}
Loading