Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: make ipv4 regex faster #73

Merged
merged 6 commits into from
Jun 7, 2024
Merged

perf: make ipv4 regex faster #73

merged 6 commits into from
Jun 7, 2024

Conversation

kurtextrem
Copy link
Contributor

@kurtextrem kurtextrem commented Oct 20, 2023

All credits for this regex to https://stackoverflow.com/a/36760050.

The regex is both more correct and faster (https://esbench.com/bench/6532596e7ff73700a4debb6a). Examples:
previous regex recognized "01.01.01.01" or "1.1.000.1" as correct ipv4 regex, but those aren't valid as per https://datatracker.ietf.org/doc/html/rfc5954#section-4.1.

I made this PR as draft, because tests are failing due to expected normalization of the ip (https://github.com/fastify/fast-uri/blob/77b659c7853730bb07e5b888b5136c4c5a1aca86/test/parse.test.js#L178C27-L178C41), but should an IP be normalized that is not valid?

Checklist

All credits for this regex to https://stackoverflow.com/a/36760050. 

The regex is both more correct and faster (https://esbench.com/bench/6532596e7ff73700a4debb6a).
Examples:
previous regex recognized "01.01.01.01" or "1.1.000.1" as correct ipv4 regex, but those aren't valid as per https://datatracker.ietf.org/doc/html/rfc5954#section-4.1.
@kurtextrem kurtextrem marked this pull request as draft October 20, 2023 11:15
@Uzlopak
Copy link
Contributor

Uzlopak commented Oct 20, 2023

Hi Kurt,

Func fact: Actually "01.01.01.01" is a valid ip. the 0 as prefix means octal number. But node does not support octals as IPv4 segments. :D

BTT:

I seems, that we only need the match and not sub groups. So you could use ?: to ignore the subgroups, which should have better performance.

But I find it strange that you use \b, which means word boundary. So you could do '100. 100. 100. 100'?
Also \.?\b){4} seems wrong. Doesnt it then pass also '100 100 100 100', because '.' is optionally and also '100.100.100.100.'?

@kurtextrem
Copy link
Contributor Author

kurtextrem commented Oct 20, 2023

Func fact: Actually "01.01.01.01" is a valid ip. the 0 as prefix means octal number. But node does not support octals as IPv4 segments. :D

Not according to the URI spec, which explicitly prohibits this: https://datatracker.ietf.org/doc/html/rfc3986#section-7.4

These additional IP address formats are not allowed in the URI syntax due to differences between platform implementations.

I know e.g. Windows will interpret it as octet number though, but I guess "fast-uri" is meant to parse a URI (so the mentioned spec applies).


The author on SO mentions this:

I've also omitted the ?: non-capturing group part as we don't really care about the captured items, they would not be captured either way if we didn't have a full-match in the first place.

... but good catch! I've updated the PR

\b is used to prevent a trailing dot:

^(?:(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(\.(?!$)|$)){4}$
It uses the negative lookahead (?!) to remove the case where the ip might end with a .

just much shorter and faster (as it does not involve lookaheads):

You can replace "(.(?!$)|$)" with ".?\b" to make it shorter.

Hands down, I'd also have no idea how to come up with that replacement. Not even regexp-tree's optimizer does it. I'd have to dig in deeper why this works.

@kurtextrem
Copy link
Contributor Author

Coming from this thread: ota-meshi/eslint-plugin-regexp#659

/^(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)(?:\.(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)){3}$/

is now the fastest (at least in my microbenchmark) and most readable one. The linked thread pretty much explains why the negative lookahead and the word-boundry 'hack' (?) are the same and why this regex is now faster while being equivalent.

@Eomm
Copy link
Member

Eomm commented Nov 17, 2023

Did #74 superseded this PR?

@zekth
Copy link
Member

zekth commented Apr 17, 2024

@kurtextrem should we close this one?

@gurgunday
Copy link
Member

The regex is both more correct and faster (https://esbench.com/bench/6532596e7ff73700a4debb6a). Examples:
previous regex recognized "01.01.01.01" or "1.1.000.1" as correct ipv4 regex, but those aren't valid as per https://datatracker.ietf.org/doc/html/rfc5954#section-4.1.

Func fact: Actually "01.01.01.01" is a valid ip. the 0 as prefix means octal number. But node does not support octals as IPv4 segments. :D

Not according to the URI spec, which explicitly prohibits this: https://datatracker.ietf.org/doc/html/rfc3986#section-7.4

What about this claim? So in the end is 01.01.01.01 correct or not?

@kurtextrem
Copy link
Contributor Author

I updated the benchmark: https://esbench.com/bench/6532596e7ff73700a4debb6a
this seems still to be the fastest one -> #73 (comment) (called "3" in the test), 4 is the currently used one.
image

Depending on what we want, we should pick 3 (regex that matches the spec) or keep 4 (regex that retains the current semantics of matching the spec-invalid IP)

@gurgunday
Copy link
Member

Seems relevant

colinhacks/zod#3413

@kurtextrem
Copy link
Contributor Author

nice find. that one is the fastest now - the question regarding whether we want to become spec compatible or not remains (the zod one also doesn't match the ones I mentioned)

@gurgunday
Copy link
Member

gurgunday commented Apr 18, 2024

Okay, so 3 is spec compliant and faster than current?

Seems like an easy win

Can you make that the PR? (It already is, can you resolve conflicts?)

We can investigate colinhacks/zod#3413 later imo, and I suspect others will not agree to not being correct spec wise

@kurtextrem
Copy link
Contributor Author

kurtextrem commented Apr 18, 2024

Sorry, I expressed it a bit complicated. Zod's is the fastest one and also does not match 01.01.01.01 and also not 1.1.000.1, which is good, because those should not be matched as per spec.

I've updated to the one used by zod now. So, if we want to match IPs from the spec, let's drop the failing test that uses a non-spec IP (https://github.com/fastify/fast-uri/blob/a85e8941a3c539dd2d16f9b59cb723a64ed663bb/test/parse.test.js#L178C27-L178C41)?

@gurgunday
Copy link
Member

gurgunday commented Apr 18, 2024

I like it! What do others think?

Definitely SemVer major but note that it'll be faster and still spec compliant (RFC 3968)

@fastify/libraries, @mcollina

@Uzlopak
Copy link
Contributor

Uzlopak commented Apr 18, 2024

What about this claim? So in the end is 01.01.01.01 correct or not?

Well, yes and no. prefixing with 0 means octal notation. So for example 09 should be invalid.
And then, nodejs itself does not support octal notation. Not even long notation.

Only decimal notation.

@zekth
Copy link
Member

zekth commented Apr 19, 2024

I think it's good, we should update the tests then

@gurgunday gurgunday changed the base branch from main to next April 19, 2024 07:11
Signed-off-by: Jacob Groß <[email protected]>
@kurtextrem kurtextrem marked this pull request as ready for review April 19, 2024 09:34
@gurgunday
Copy link
Member

gurgunday commented Jun 4, 2024

Can you comment this out and say it's not valid? We now parse it correctly to 10.10.0.10 but it's no longer compatible...

'//10.10.000.10',

@kurtextrem
Copy link
Contributor Author

done

@zekth
Copy link
Member

zekth commented Jun 5, 2024

Please update the lint

@gurgunday gurgunday mentioned this pull request Jun 6, 2024
2 tasks
Copy link
Member

@zekth zekth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@gurgunday gurgunday merged commit 787a904 into fastify:next Jun 7, 2024
20 checks passed
@gurgunday
Copy link
Member

Thanks @kurtextrem

@kurtextrem
Copy link
Contributor Author

Thank you for the collab!

jsumners added a commit that referenced this pull request Jun 23, 2024
* Update for v5 (#80)

* update

* up

* workflows: update and pin node LTS versions (#81)

* update workflow

* start from 16

* perf: make ipv4 regex faster (#73)

* perf: make ipv4 regex faster

All credits for this regex to https://stackoverflow.com/a/36760050. 

The regex is both more correct and faster (https://esbench.com/bench/6532596e7ff73700a4debb6a).
Examples:
previous regex recognized "01.01.01.01" or "1.1.000.1" as correct ipv4 regex, but those aren't valid as per https://datatracker.ietf.org/doc/html/rfc5954#section-4.1.

* avoid capture groups

* update to fastest regexp variant

* comment out 10.10.000.1110

* lint

---------

Signed-off-by: Jacob Groß <[email protected]>
Co-authored-by: Gürgün Dayıoğlu <[email protected]>

---------

Signed-off-by: Jacob Groß <[email protected]>
Co-authored-by: Gürgün Dayıoğlu <[email protected]>
Co-authored-by: Jacob Groß <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants