-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catastrofic backtracking fix #426
base: main
Are you sure you want to change the base?
Conversation
… a PREG_BACKTRACK_LIMIT_ERROR and check result of preg_replace to be sure we do not set content or description to null
I also tested the updated regex again with a modified version of the feed item, see: |
I see this problem with your link. |
$this->description = preg_replace('!(<*\s*[^>]*)(src=)(.?)(\/[^\/])!','\1 src=\3'.$host.'\4', $this->description ); | ||
} | ||
if (property_exists($this, 'content') && !is_null($this->content)){ | ||
$this->content = preg_replace('!(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])!','\1\2\3'.$host.'\4', $this->content) ?? $this->content; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a lot of nested tags.
Change to:
$this->content = preg_replace('!(href=|src=)(.?)(\/[^\/])!','\1\2'.$host.'\3', $this->content) ?? $this->content;
Similarly for "$this->description"
This will fix error for preg_replace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll write the most correct regular expression a little later.
Wait a bit. I have a lot to do now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already fixed the regex in my PR. The cycles went down from over 5.8 million to 10.000 by removing the *
after the first <
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also necessary to exclude the processing of "href" in tags <code>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, maybe the regex should only search inside <a>
and <img>
tags or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is most correct regular expression:
preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$host.'\4', $this->content)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the regex should only search inside
<a>
and<img>
tags
I'm not 100% sure about this, but I'll think about it. In addition, it is still necessary to replace relative links like href="xxx..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if analyze only <a>
and <img>
, don’t need to change them in <code>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the function to include all the other possible replacements for relative links.
Can you test my changes and make them in this PR?
All replacements can be moved to a separate function since there are more of them.
All links in https://go.dev/blog/feed.atom are replaced correctly.
protected function setHostInContent(string $host = null): void
{
if (is_null($host)) {
return;
}
// Replaced links like href="/aaa/bbb.xxx"
if (property_exists($this, 'content') && !is_null($this->content)) {
$this->content = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$host.'\4', $this->content) ?? $this->content;
}
if (property_exists($this, 'description') && !is_null($this->description)) {
$this->description = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$host.'\4', $this->description) ?? $this->description;
}
$itemFullLink = $this->getLink();
if (property_exists($this, 'link') && !is_null($itemFullLink)) {
$itemLink = implode("/", array_slice(explode("/", $itemFullLink), 0, -1))."/";
if (property_exists($this, 'content') && !is_null($this->content)){
// Replaced links like href="#aaa/bbb.xxx"
$this->content = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(#)(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemFullLink.'\4', $this->content) ?? $this->content;
// Replaced links like href="aaa/bbb.xxx"
$this->content = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\w+\b)(?![:])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemLink.'\4', $this->content) ?? $this->content;
}
if (property_exists($this, 'description') && !is_null($this->description)) {
// Replaced links like href="#aaa/bbb.xxx"
$this->description = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(#)(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemFullLink.'\4', $this->description) ?? $this->description;
// Replaced links like href="aaa/bbb.xxx"
$this->description = preg_replace('~(<\s*[^>]*)(href=|src=)(.?)(\w+\b)(?![:])(?!(.(?!<code))*<\/code>)~','\1\2\3'.$itemLink.'\4', $this->description) ?? $this->description;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanest code:
protected function setHostInContent(string $host = null): void
{
if (is_null($host)) {
return;
}
// Replaced links like href="/aaa/bbb.xxx"
$pattern = '(<\s*[^>]*)(href=|src=)(.?)(\/[^\/])(?!(.(?!<code))*<\/code>)';
$this->pregReplaceInProperty('content', $pattern, '\1\2\3'.$host.'\4');
$this->pregReplaceInProperty('description', $pattern, '\1\2\3'.$host.'\4');
$itemFullLink = $this->getLink();
$itemLink = implode("/", array_slice(explode("/", $itemFullLink), 0, -1))."/";
// Replaced links like href="#aaa/bbb.xxx"
$pattern = '(<\s*[^>]*)(href=|src=)(.?)(#)(?!(.(?!<code))*<\/code>)';
$this->pregReplaceInProperty('content', $pattern, '\1\2\3'.$itemFullLink.'\4');
$this->pregReplaceInProperty('description', $pattern, '\1\2\3'.$itemFullLink.'\4');
// Replaced links like href="aaa/bbb.xxx"
$pattern = '(<\s*[^>]*)(href=|src=)(.?)(\w+\b)(?![:])(?!(.(?!<code))*<\/code>)';
$this->pregReplaceInProperty('content', $pattern, '\1\2\3'.$itemLink.'\4');
$this->pregReplaceInProperty('description', $pattern, '\1\2\3'.$itemLink.'\4');
}
public function pregReplaceInProperty(string $property, string $pattern, string $replacement): void
{
if (property_exists($this, $property) && !is_null($this->{$property})) {
$this->{$property} = preg_replace('~'.$pattern.'~', $replacement, $this->{$property}) ?? $this->{$property};
}
}
Fixes #425
Update
setHostInContent
regexes to reduce steps and hopefully prevent aPREG_BACKTRACK_LIMIT_ERROR
and check result ofpreg_replace
to be sure we do not set content or description tonull
.Also refactor
setHostInContent
andgetHostFromLink
a bit for readability.cc @IgorA100