Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trim/0, ltrim/0 and rtrim/0 that trims leading and trailing whitespace #3056

Merged
merged 1 commit into from
Mar 20, 2024

Conversation

wader
Copy link
Member

@wader wader commented Mar 1, 2024

No description provided.

src/jv_unicode.c Outdated
@@ -118,3 +118,16 @@ int jvp_utf8_encode(int codepoint, char* out) {
assert(out - start == jvp_utf8_encode_length(codepoint));
return out - start;
}

// space codepoints for unicode basic latin and latin-1 supplement
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is same as what golang strings.TrimSpace/unicode.IsSpace considers whitespace https://pkg.go.dev/unicode#IsSpace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Go's strings.TrimSpace considers White_Space property as well.

`"\f"`,
`"\u000b"` (vertical tab),
`"\u0085"` (next line) and
`"\u00a0"` (no-break space). These are the whitespace characters in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exhaustive for Unicode, or just Latin scripts in Unicode? If the latter, why not be more exhaustive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is only whitespace from basic block and latin-1, so not exhaustive. Reason is mostly to match what other implementations do. There are quite a lot of other whitespace characters in other blocks, wikipedia has a good list https://en.wikipedia.org/wiki/Whitespace_character.

Reading about PCRE's \s-class it seem to match something similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust trim/is_whitespace uses characters with White_Space property https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt maybe more resonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should document that the list is not stable, that we may add to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust trim/is_whitespace uses characters with White_Space property https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt maybe more resonable?

I do like that better, yeah, especially in light of @itchyny's comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for utf8proc to be included for upcase/downcase?

Copy link
Member Author

@wader wader Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to White_Space but i guess we can wait. Also update docs and tests.

src/builtin.c Outdated Show resolved Hide resolved
@itchyny
Copy link
Contributor

itchyny commented Mar 3, 2024

I found a list of languages on this function. I have no objection to name this trim not strip, because we already have ltrimstr. Does anyone want ltrim and rtrim as well?

@wader
Copy link
Member Author

wader commented Mar 4, 2024

I found a list of languages on this function. I have no objection to name this trim not strip, because we already have ltrimstr. Does anyone want ltrim and rtrim as well?

Personally mostly have needed trim, and i agree that also having ltrim/rtrim would fits well with ltrimstr/rtrimstr.

@wader wader force-pushed the trim branch 2 times, most recently from 48a7b8c to 913457b Compare March 6, 2024 15:04
@wader wader changed the title Add trim/0 that trims leading and trailing whitespace Add trim/0, ltrim/0 and rtrim/0 that trims leading and trailing whitespace Mar 6, 2024
@wader
Copy link
Member Author

wader commented Mar 6, 2024

Added ltrim/0 and rtrim/0

Copy link
Member

@emanuele6 emanuele6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
@wader
Copy link
Member Author

wader commented Mar 19, 2024

@emanuele6 thanks for review! maybe wait for one more apporval? thinking adding new functions might be good idea with more consensus

About waiting for utf8proc from #2547: i think we can merge this separately and fix that later

Copy link
Contributor

@itchyny itchyny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
After utf8proc is included later, let's see whether we can clean up (or get rid of) jv_unicode.c.

@wader
Copy link
Member Author

wader commented Mar 20, 2024

LGTM After utf8proc is included later, let's see whether we can clean up (or get rid of) jv_unicode.c.

Yeap, had a quick look at it a few week ago and it seemed like it would be easy, something like utf8proc_get_property(c)->category == UTF8PROC_CATEGORY_ZS

@wader wader merged commit be437ec into jqlang:master Mar 20, 2024
29 checks passed
@wader wader deleted the trim branch March 20, 2024 10:04
wader added a commit to wader/jaq that referenced this pull request Sep 23, 2024
Trims leading and trailing whitespace.

Was added to jq in jqlang/jq#3056
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants