Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22876 C++ UnicodeSet/USet easy item iteration #3120

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Aug 28, 2024

Modern C++ iteration over the code points, ranges, or strings, or all of the elements, in a C++ UnicodeSet or C USet.

API approved by ICU-TC on 2024sep05.
Minor change sep09 during implementation testing:
Changed CodePointRange::iterator::operator++() from void to returning iterator & (usual signature)

  • UnicodeSet.ranges() returns a USetRanges which has begin() & end() which return USetRangeIterator which works like an idiomatic C++ iterator.
  • Parallel for UnicodeSet.codePoints() & UnicodeSet.strings().
  • For iterating over all elements (code points & strings) in C++, by popular demand, I added begin() & end() directly to class UnicodeSet, similar to Java UnicodeSet which is an Iterable. The header-only wrapper still has separate range and iterator classes.
  • For strings, adding new C API to get strings without copying their contents.
  • Also, UnicodeSet.strings() wants to be called that in C++ just like in Java, and just like codePoints() & ranges(). That then required me to rename the field strings to strings_.
  • Operates on C API USet, so that C API users should be able to use it. C++ UnicodeSet.ranges() and it siblings are trivial convenience functions.
  • As discussed some time ago, this is written as a header-only API. It is the first of its kind.
  • As such, I am defining a U_HEADER_NESTED_NAMESPACE and a U_HEADER_ONLY_NAMESPACE which put the new API into either icu::header when compiling user code or icu::internal when compiling ICU library code, so that even if we instantiate the API inside ICU it won't be the same as what a caller sees.
    • Exception: This does not work on Windows where we explicitly export APIs, because the library does not export the member functions that return the icu::header types.
    • We also don't need this on Windows because we explicitly export APIs.
    • So on Windows the namespace is always icu::header, and the UnicodeSet member functions are exported, but the USet wrapper types are not.
  • Naming: I am using USetXyz class names because these classes actually operate on a USet, the C wrapper around a C++ UnicodeSet, and they are defined in unicode/uset.h. This would also allow future additions of UnicodeSetXyz that don't go through the C API, if we ever wanted that.

Sample real, working call sites:

    UnicodeSet set(u"[abcçカ🚴]", errorCode);
    for (auto [start, end] : set.ranges()) {
        printf("set.range U+%04lx..U+%04lx\n", (long)start, (long)end);
    }

    UnicodeSet set(u"[abcçカ🚴{}{abc}{de}]", errorCode);
    for (auto el : set) {
        std::string u8;
        printf("set.string length %ld \"%s\"\n", (long)el.length(), el.toUTF8String(u8).c_str());
    }

Output:

   utility {
      UnicodeSetTest {
         TestUnicodeSetRanges {
set.range U+0061..U+0063
set.range U+00e7..U+00e7
set.range U+30ab..U+30ab
set.range U+1f6b4..U+1f6b4
         } OK:   TestUnicodeSetRanges 
      } OK:   UnicodeSetTest 
   } OK:   utility 

   utility {
      UnicodeSetTest {
         TestUnicodeSetElements {
set.string length 1 "a"
set.string length 1 "b"
set.string length 1 "c"
set.string length 1 "ç"
set.string length 1 "カ"
set.string length 2 "🚴"
set.string length 0 ""
set.string length 3 "abc"
set.string length 2 "de"
         } OK:   TestUnicodeSetElements 
      } OK:   UnicodeSetTest 
   } OK:   utility 
Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22876
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@jira-pull-request-webhook

This comment was marked as outdated.

@markusicu markusicu changed the title ICU-22876 first code for UnicodeSet.ranges() ICU-22876 C++ UnicodeSet/USet easy item iteration Aug 28, 2024
@markusicu
Copy link
Member Author

Hi everyone:

  • I think I have a complete set of new APIs now for this purpose, including the popular "iterate over all elements" thing.
  • I have updated the PR description, please read.
  • As noted there, I am considering changing the prefixes of the new classes from UnicodeSetXyz to USetXyz, since they are really operating on a C USet. Ok? (One could counter-argue that in C++ we use UnicodeSet. Or that few people will actually type these class names.)
  • Please take a look at the sample code in usettest.cpp. (No real unit test code yet.) Looks nice and colloquial?
  • Please take a look at the API signatures.
  • Please take a look at the NOTEs and TODOs.
  • @eggrobin I would like to ask you in particular to look at the implementation code as well.

@markusicu
Copy link
Member Author

@richgillam do you care whether the prefix of the new classes, which operate on USet but are in C++ and also created by UnicodeSet, is USetXyz or UnicodeSetXyz? For example,

  • USetCodePointIterator
  • UnicodeSetCodePointIterator

Right now, I am leaning towards the shorter version which seems slightly more accurate.
I should really send the approval email today...

@aheninger opinion?

@markusicu
Copy link
Member Author

I used Craig as a sounding board for the class prefix naming question. What tipped him over the top is that the new types are in uset.h, so he would expect them to start with "USet". I like that. Will rename.

@markusicu
Copy link
Member Author

Renamed. This would also allow future additions of UnicodeSetXyz that don't go through the C API, if we ever wanted that.

@richgillam
Copy link
Contributor

@richgillam do you care whether the prefix of the new classes, which operate on USet but are in C++ and also created by UnicodeSet, is USetXyz or UnicodeSetXyz? For example,

  • USetCodePointIterator
  • UnicodeSetCodePointIterator

Yes, I like USet here, for the reasons you outline.

richgillam
richgillam previously approved these changes Aug 30, 2024
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a lot of code to do something that seems like it should be fairly simple!

I'm not sure I completely followed everything, but it all seems to make sense. But I'll defer to the other reviewers, all of whom understand what's going on here better than I do.

@markusicu
Copy link
Member Author

Wow, that's a lot of code to do something that seems like it should be fairly simple!

I agree!
It's all because C++ didn't create a real iterator pattern and instead stuck with implementing iterators that look and quack like pointers.

@markusicu
Copy link
Member Author

@eggrobin FYI commit ed5e8e3 addresses the ICU-TC API proposal review

@markusicu markusicu marked this pull request as ready for review September 9, 2024 20:56
@markusicu
Copy link
Member Author

I think I am all done.
It seems to all work, the API reflects what has been approved, with one minor exception noted in the PR description.
API docs all there, although minimal for boilerplate C++ idioms.
Tested via range-based for loops as well as explicit calls of each new function.

It might still make sense to review this one commit at a time.
Once approved I will squash, then merge.

@markusicu
Copy link
Member Author

Dear reviewers, it would be lovely to get this reviewed, feedback addressed, approved, and merged... fairly quickly, to unblock @roubert's follow-up work with the header-only collator predicates. I would be happy to jump into a video meeting to talk things through.

richgillam
richgillam previously approved these changes Sep 10, 2024
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all that's changed since the last time I reviewed this is documentation and tests. The new documentation and tests look good.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New documentation also looks good.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu markusicu merged commit 37b4149 into unicode-org:main Sep 11, 2024
91 checks passed
@markusicu markusicu deleted the uniset-cpp-iter branch September 11, 2024 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants