-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffer: add isAscii method #46046
buffer: add isAscii method #46046
Conversation
Is this really checking for valid ASCII (<=0x7F) or something else? That should help determine the name. There's really no way to validate a single byte encoding that uses all 256 byte values (e.g. latin1). |
@mscdex It is literally checking for valid ASCII (<=0x7F), referencing the simdutf repository. Happy new year btw! |
Since basically wrapping
I'd go with |
76af099
to
2fee6ca
Compare
Maybe this is pointing out something obvious, but |
After reading through #45823, I still don't really understand the motivation for this API. IIRC I guess most of the times I implement a check for a subset of ASCII, it's a strict subset, so most of my implementations would still have to iterate over the string. |
2fee6ca
to
d81eb34
Compare
@tniessen There are lots of packages that checks & validates ascii only input by either iterating through all inputs, or by using regexp. My reasoning for adding this is to provide a Ideally, I'd put this to
You are right, but with this pull request, we will have |
There is no way to validate this. Latin1 is a single byte encoding with values from 0x00 to 0xFF. |
Why is that? From my understanding:
If we are aiming for latin1-only, we can easily calculate it by checking if it contains only single-byte characters and is within the range of 0x00 to 0xFF. |
Here is a candidate implementation: function isLatin1() {
return true;
} A single-byte value is always within the range of 0 to 255, so if there are no invalid value for |
Am I missing something? I think you are looking from an ArrayBuffer perspective, whereas I'm looking from a Naive way of understanding if it's a valid latin1 character is:
For example, "Yağiz" is not latin1, because "ğ" has a representation of 2 bytes. If you don't have a string input, but have an ArrayBuffer input, you have to check trailing code points, if multiple bytes are used to construct an unicode character. An example implementation a.k.a |
Yeah I'm looking from an
Hum I'm not sure I follow, all chars have a representation of one-byte in latin1,
That's assuming the |
This was one of the reasons I wanted to introduce a new module for encoding. Maybe that discussion should continue. |
Do you have some performance-critical examples? I see that
My concern is that all these added APIs are not really specific to node, and "fast & native" in this case also means non-portable. |
I tried looking for any; even though several are not widespread, developers might develop their solution (js based) and have yet to release a package.
I agree but I don't think there is any other way to endorse developers to get access to performant functions (as far as I know). Initially, I wanted this function to live inside |
Appreciate any reviews. cc @nodejs/cpp-reviewers @nodejs/performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Based on the lack of evidence and our guidelines, I guess the only semi-strong argument for including this in core is
which is not exactly true because it might as well be implemented outside of core, but most people won't do it. Partially, perhaps, because inputs tend to be small so shaving off a few microseconds here and there might not matter. And even when it's in core, using it comes at the high price of losing portability, being unable to run the same code in browsers, bun, deno, CF workers, ... (None of my comments are blocking; those who create/approve the PR take responsibility for it.) |
I'm +0 on this and don't see the big motivation either for the reasons @tniessen mentioned above, but I acknowledge others may have a use case where this being performant and in core is an advantage and I don't understand this space enough to block. |
So, this is basically the code below but SIMDified? (def a word) function isAscii(b) {
for (let i = 0, n = b.length; i < n; i++)
if (b[i] > 127)
return false;
return true;
} I don't expect performance to be all that different until the input starts getting large (10s of kilobytes, maybe even 100s) and it's also... dunno, kind of niche and trivial? Useful if you're implementing Kermit, otherwise not so much. |
Landed in babe6d7 |
Given that most |
PR-URL: #46046 Reviewed-By: Antoine du Hamel <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
Notable changes: buffer: * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046 deps: * upgrade npm to 9.4.0 (npm team) #46353 esm: * leverage loaders when resolving subsequent loaders (Maël Nison) #43772 fs: * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358 src,lib: * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218 test_runner: * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712 v8: * (SEMVER-MINOR) support gc profile (theanarkh) #46255 vm: * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46455
Notable changes: buffer: * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046 deps: * upgrade npm to 9.4.0 (npm team) #46353 esm: * leverage loaders when resolving subsequent loaders (Maël Nison) #43772 fs: * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358 src,lib: * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218 test_runner: * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712 v8: * (SEMVER-MINOR) support gc profile (theanarkh) #46255 vm: * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46455
PR-URL: #46046 Reviewed-By: Antoine du Hamel <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
Notable Changes buffer: * add isAscii method (Yagiz Nizipli) #46046 fs: * add statfs() functions (Colin Ihrig) #46358 src,lib: * add constrainedMemory API for process (theanarkh) #46218 v8: * support gc profile (theanarkh) #46255 vm: * expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: TDB
Notable Changes buffer: * add isAscii method (Yagiz Nizipli) #46046 fs: * add statfs() functions (Colin Ihrig) #46358 src,lib: * add constrainedMemory API for process (theanarkh) #46218 v8: * support gc profile (theanarkh) #46255 vm: * expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46920
Notable Changes: buffer: * add isAscii method (Yagiz Nizipli) #46046 fs: * add statfs() functions (Colin Ihrig) #46358 src,lib: * add constrainedMemory API for process (theanarkh) #46218 v8: * support gc profile (theanarkh) #46255 vm: * expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46920
Notable Changes: buffer: * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046 doc,lib,src,test: * rename --test-coverage (Colin Ihrig) #46017 fs: * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358 src,lib: * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218 test_runner: * add initial code coverage support (Colin Ihrig) #46017 * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712 v8: * (SEMVER-MINOR) support gc profile (theanarkh) #46255 vm: * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46920
PR-URL: #46046 Reviewed-By: Antoine du Hamel <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
Notable changes: buffer: * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046 doc,lib,src,test: * rename --test-coverage (Colin Ihrig) #46017 fs: * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358 src,lib: * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218 test_runner: * add initial code coverage support (Colin Ihrig) #46017 * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712 v8: * (SEMVER-MINOR) support gc profile (theanarkh) #46255 vm: * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: https://github.com/nodejs/node/pull/46920/commits
Notable changes: buffer: * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046 doc,lib,src,test: * rename --test-coverage (Colin Ihrig) #46017 fs: * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358 src,lib: * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218 test_runner: * add initial code coverage support (Colin Ihrig) #46017 * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712 v8: * (SEMVER-MINOR) support gc profile (theanarkh) #46255 vm: * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46920
Notable changes: buffer: * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046 doc,lib,src,test: * rename --test-coverage (Colin Ihrig) #46017 fs: * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358 src,lib: * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218 test_runner: * add initial code coverage support (Colin Ihrig) #46017 * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712 v8: * (SEMVER-MINOR) support gc profile (theanarkh) #46255 vm: * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46920
Notable changes: buffer: * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046 doc,lib,src,test: * rename --test-coverage (Colin Ihrig) #46017 fs: * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358 src,lib: * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218 test_runner: * add initial code coverage support (Colin Ihrig) #46017 * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712 v8: * (SEMVER-MINOR) support gc profile (theanarkh) #46255 vm: * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320 PR-URL: #46920
This is the second pull request on my node::encoding proposal.
This pull request will expose
require('buffer').isAscii
to validate if a buffer contains ascii-encoded data. For example:require('buffer').isAscii((new TextEncoder()).encode('Yağız'))
will return false, sinceğ
is not ascii/latin1 but UTF-8.I'm adding the
notable-change
label due to adding a new API.cc @nodejs/buffer