Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utf16 strings codepoint iteration and appending #91

Closed
ceztko opened this issue Jun 15, 2022 · 4 comments
Closed

Utf16 strings codepoint iteration and appending #91

ceztko opened this issue Jun 15, 2022 · 4 comments
Assignees

Comments

@ceztko
Copy link

ceztko commented Jun 15, 2022

A couple of important missing features is the ability to directly iterate utf16 strings codepoints and appending codepoints to existing utf16 encoded strings. For iterating codepoints, one implementation can be found in ICU documentation[1][2].

[1] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#a844bb48486904fdca40c8b883e9c80ee
[2] https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utf16_8h.html#ae98a64ae0f42bc6ad4179293c3638be4

@ceztko ceztko changed the title Utf16 strings codepoint iteration Utf16 strings codepoint iteration and appending Jun 15, 2022
@ceztko
Copy link
Author

ceztko commented Jun 15, 2022

For appending codepoints to existing utf16 strings I am currently testing the following methods:

    template <typename word_iterator>
    word_iterator append16(uint32_t cp, word_iterator result)
    {
        if (!utf8::internal::is_code_point_valid(cp))
            throw invalid_code_point(cp);

        if (cp < 0x10000u) {                    // one word
            *(result++) = static_cast<uint16_t>(cp);
        }
        else {                                  // two words
            uint32_t cp_1 = cp - 0x10000u;
            *(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
            *(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
        }

        return result;
    }

    namespace unchecked
    {
        template <typename word_iterator>
        word_iterator append16(uint32_t cp, word_iterator result)
        {
            if (cp < 0x10000u) {                    // one word
                *(result++) = static_cast<uint16_t>(cp);
            }
            else {                                  // two words
                uint32_t cp_1 = cp - 0x10000u;
                *(result++) = static_cast<uint16_t>(cp_1 / 0x400u + 0xd800u);
                *(result++) = static_cast<uint16_t>(cp_1 % 0x400u + 0xdc00u);
            }

            return result;
        }
    }

    inline void append(char32_t cp, std::u16string& s)
    {
        append16(uint32_t(cp), std::back_inserter(s));
    }

@nemtrif
Copy link
Owner

nemtrif commented Dec 29, 2022

Planned for release 4.0. Thanks for the proposal.

@nemtrif nemtrif self-assigned this Dec 29, 2022
@ceztko
Copy link
Author

ceztko commented Feb 13, 2023

Thank you. In the mean time I notice is quite easy to iterate codepoints on utf16 strings using existing facilities. I did in podofo. I ask you if can add a valide_next like function reading from utf16 content as well.

nemtrif added a commit that referenced this issue Jun 25, 2023
Support for appending codepoints to existing utf16 encoded strings.

See #91
nemtrif added a commit that referenced this issue Oct 21, 2023
* Redefined and renamed types for code units.

* Remove -Wsign-conversion from test builds.

* find_invalid and is_valid that work with C-style strings.

* Lifted the C++11 requirement for some functions
 that take std::string as an argument.

* Support for C++20 u8string

Issue #89

* Update test docker image to 4.0.0

* Update Dockerfile to run tests with a recent gcc compiler.

* Make some internal helper functions non-template

* Add append16 function

Support for appending codepoints to existing utf16 encoded strings.

See #91

* next16

* Tests and documentation for next16

* Rewrite CMakeLists

Drop the existing CMake structure and write the new one from scratch. The root CMakeLists.txt is used for installing the package without building and running tests. Testing is done via a separate CMakeLists.txt in the tests directory.

* Remove "samples" directory.

The content of that file is already in the documentation.

* Update README.md

Restructure the reference, add installation instructions, toc, other minor changes
@nemtrif
Copy link
Owner

nemtrif commented Oct 22, 2023

Fixed in release 4.0.0

@nemtrif nemtrif closed this as completed Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants