Undefined behavior for combining marks #1

deglm006 · 2019-12-03T03:58:54Z

It is unclear what the expected behavior should be when we have combining marks such as is demonstrated here: https://github.com/deglm006/disemvowel-in-rust/commit/0a9e738c2f9589e46c871dabc860b8e68b92742a

The naïve approach of just removing any vowel character doesn't seem right since this could leave us with floating accent marks or accents on characters we would not expect them on. I think this just a minute detail though and for the most part can be ignored.

NicMcPhee · 2020-11-10T18:42:53Z

I think we'll probably never actually address this, TBH, but I'm going to leave it here for now just in case someone has a clever thought about it.

Ultimately the question of what constitutes a vowel and how vowels are represented via glyphs in various languages is quite complex, and well beyond the scope of this lab (or course). So we'll just stick to our simplistic notion of vowel-ness and move on.

NicMcPhee · 2022-05-18T16:21:38Z

I just learned through this video that the crate unicode_segmentation allows us to iterate over the graphemes of a string, which are bundles of characters (including combining marks) that collectively generate user perceived characters. (See Unicode® Standard Annex #29 "Unicode text segmentation" for the gory details.) I think that we could use this to ensure that a vowel doesn't get separated from its diacritical marks.

The problem is that a grapheme is returned as a &str, which it has to since it's a collection of bytes that has all the elements that form that grapheme. How would you decide that such a thing was a vowel? If the "primary" character (I think I'm being very Latin alphabet centric here) is the first byte, then we could extract it and see if it's a vowel. If it is, then we drop that entire grapheme.

Or, perhaps more simply, we break things into graphemes and only remove graphemes that are exactly "regular" vowels, i.e., we ignore graphemes that have combining marks (accents, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undefined behavior for combining marks #1

Undefined behavior for combining marks #1

deglm006 commented Dec 3, 2019

NicMcPhee commented Nov 10, 2020

NicMcPhee commented May 18, 2022

Undefined behavior for combining marks #1

Undefined behavior for combining marks #1

Comments

deglm006 commented Dec 3, 2019

NicMcPhee commented Nov 10, 2020

NicMcPhee commented May 18, 2022