Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior for combining marks #1

Open
deglm006 opened this issue Dec 3, 2019 · 2 comments
Open

Undefined behavior for combining marks #1

deglm006 opened this issue Dec 3, 2019 · 2 comments

Comments

@deglm006
Copy link

deglm006 commented Dec 3, 2019

It is unclear what the expected behavior should be when we have combining marks such as is demonstrated here: https://github.com/deglm006/disemvowel-in-rust/commit/0a9e738c2f9589e46c871dabc860b8e68b92742a

The naïve approach of just removing any vowel character doesn't seem right since this could leave us with floating accent marks or accents on characters we would not expect them on. I think this just a minute detail though and for the most part can be ignored.

@NicMcPhee
Copy link
Contributor

I think we'll probably never actually address this, TBH, but I'm going to leave it here for now just in case someone has a clever thought about it.

Ultimately the question of what constitutes a vowel and how vowels are represented via glyphs in various languages is quite complex, and well beyond the scope of this lab (or course). So we'll just stick to our simplistic notion of vowel-ness and move on.

@NicMcPhee
Copy link
Contributor

I just learned through this video that the crate unicode_segmentation allows us to iterate over the graphemes of a string, which are bundles of characters (including combining marks) that collectively generate user perceived characters. (See Unicode® Standard Annex #29 "Unicode text segmentation" for the gory details.) I think that we could use this to ensure that a vowel doesn't get separated from its diacritical marks.

The problem is that a grapheme is returned as a &str, which it has to since it's a collection of bytes that has all the elements that form that grapheme. How would you decide that such a thing was a vowel? If the "primary" character (I think I'm being very Latin alphabet centric here) is the first byte, then we could extract it and see if it's a vowel. If it is, then we drop that entire grapheme.

Or, perhaps more simply, we break things into graphemes and only remove graphemes that are exactly "regular" vowels, i.e., we ignore graphemes that have combining marks (accents, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants