-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trim/0, ltrim/0 and rtrim/0 that trims leading and trailing whitespace #3056
Conversation
src/jv_unicode.c
Outdated
@@ -118,3 +118,16 @@ int jvp_utf8_encode(int codepoint, char* out) { | |||
assert(out - start == jvp_utf8_encode_length(codepoint)); | |||
return out - start; | |||
} | |||
|
|||
// space codepoints for unicode basic latin and latin-1 supplement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is same as what golang strings.TrimSpace
/unicode.IsSpace
considers whitespace https://pkg.go.dev/unicode#IsSpace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that Go's strings.TrimSpace
considers White_Space
property as well.
docs/content/manual/manual.yml
Outdated
`"\f"`, | ||
`"\u000b"` (vertical tab), | ||
`"\u0085"` (next line) and | ||
`"\u00a0"` (no-break space). These are the whitespace characters in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this exhaustive for Unicode, or just Latin scripts in Unicode? If the latter, why not be more exhaustive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is only whitespace from basic block and latin-1, so not exhaustive. Reason is mostly to match what other implementations do. There are quite a lot of other whitespace characters in other blocks, wikipedia has a good list https://en.wikipedia.org/wiki/Whitespace_character.
Reading about PCRE's \s
-class it seem to match something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust trim
/is_whitespace
uses characters with White_Space
property https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt maybe more resonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should document that the list is not stable, that we may add to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust
trim
/is_whitespace
uses characters withWhite_Space
property https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt maybe more resonable?
I do like that better, yeah, especially in light of @itchyny's comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait for utf8proc to be included for upcase/downcase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update to White_Space
but i guess we can wait. Also update docs and tests.
I found a list of languages on this function. I have no objection to name this trim not strip, because we already have ltrimstr. Does anyone want ltrim and rtrim as well? |
Personally mostly have needed |
48a7b8c
to
913457b
Compare
Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
@emanuele6 thanks for review! maybe wait for one more apporval? thinking adding new functions might be good idea with more consensus About waiting for utf8proc from #2547: i think we can merge this separately and fix that later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
After utf8proc is included later, let's see whether we can clean up (or get rid of) jv_unicode.c.
Yeap, had a quick look at it a few week ago and it seemed like it would be easy, something like |
Trims leading and trailing whitespace. Was added to jq in jqlang/jq#3056
No description provided.