You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The naïve approach of just removing any vowel character doesn't seem right since this could leave us with floating accent marks or accents on characters we would not expect them on. I think this just a minute detail though and for the most part can be ignored.
The text was updated successfully, but these errors were encountered:
I think we'll probably never actually address this, TBH, but I'm going to leave it here for now just in case someone has a clever thought about it.
Ultimately the question of what constitutes a vowel and how vowels are represented via glyphs in various languages is quite complex, and well beyond the scope of this lab (or course). So we'll just stick to our simplistic notion of vowel-ness and move on.
I just learned through this video that the crate unicode_segmentation allows us to iterate over the graphemes of a string, which are bundles of characters (including combining marks) that collectively generate user perceived characters. (See Unicode® Standard Annex #29 "Unicode text segmentation" for the gory details.) I think that we could use this to ensure that a vowel doesn't get separated from its diacritical marks.
The problem is that a grapheme is returned as a &str, which it has to since it's a collection of bytes that has all the elements that form that grapheme. How would you decide that such a thing was a vowel? If the "primary" character (I think I'm being very Latin alphabet centric here) is the first byte, then we could extract it and see if it's a vowel. If it is, then we drop that entire grapheme.
Or, perhaps more simply, we break things into graphemes and only remove graphemes that are exactly "regular" vowels, i.e., we ignore graphemes that have combining marks (accents, etc.).
It is unclear what the expected behavior should be when we have combining marks such as is demonstrated here: https://github.com/deglm006/disemvowel-in-rust/commit/0a9e738c2f9589e46c871dabc860b8e68b92742a
The naïve approach of just removing any vowel character doesn't seem right since this could leave us with floating accent marks or accents on characters we would not expect them on. I think this just a minute detail though and for the most part can be ignored.
The text was updated successfully, but these errors were encountered: