title | author | category | tags | excerpt |
---|---|---|---|---|
CFStringTransform |
Mattt Thompson |
Cocoa |
nshipster, popular |
NSString is the crown jewel of Foundation. But as powerful as it is, one would be remiss not to mention its toll-free bridged cousin, CFMutableString—or more specifically, CFStringTransform. |
There are two indicators that tell you everything you need to know about how nice a language is to use:
- API Consistency
- Quality of String Implementation
NSString
is the crown jewel of Foundation. In an age where other languages still struggle to handle Unicode correctly, NSString
is especially impressive. Not content to just work with whatever is thrown at it, NSString
can parse strings into linguistic tags, determine the dominant language of the content, and convert between every string encoding imaginable. It's unfairly good.
But as powerful as NSString
/ NSMutableString
are, one would be remiss not to mention their toll-free bridged cousin, CFMutableString
—or more specifically, CFStringTransform
.
As denoted by the CF
prefix, CFStringTransform
is part of Core Foundation. The function takes the following arguments, and returns a Boolean
for whether or not the transform was successful:
string
: The string to be transformed. Since this argument is aCFMutableStringRef
, anNSMutableString
can be passed using toll-free bridging cast.range
: The range of the string over which the transformation should be applied. This argument is aCFRange
, rather than anNSRange
value.transform
: The transformation to apply. This argument takes an ICU transform string, including any one of the string constants described below.reverse
: Whether to run the transformation in reverse, where applicable.
CFStringTransform
covers a lot of ground with its transform
argument. Here's a rundown of what it can do:
Énġlišh långuãge lẳcks iñterêßţing diaçrïtičş. As such, it can be useful to normalize extended Latin characters into ASCII-friendly representations. Rid any string of its squiggly bits using the kCFStringTransformStripCombiningMarks
transformation.
kCFStringTransformToUnicodeName
allows you to determine the Unicode standard name for special characters, including Emoji. For instance, "🐑💨✨" is transformed into "{SHEEP} {DASH SYMBOL} {SPARKLES}", and "🐷" becomes "{PIG FACE}".
With the notable exception of English (and its delightful spelling inconsistencies), writing systems generally encode speech sounds into a consistent written representation. European languages generally use the Latin alphabet (with a few added diacritics), Russian uses Cyrillic, Japanese uses Hiragana & Katakana, and Thai, Korean, & Arabic each have their own scripts.
Although each language has a particular inventory of sounds, some of which other languages may lack, the overlap across all of the major writing systems is remarkably high—enough so that one can rather effectively transliterate (not to be confused with translation) from one script to another.
CFStringTransform
can transliterate back and forth between Latin and Arabic, Cyrillic, Greek, Korean (Hangul), Hebrew, Japanese (Hiragana & Katakana), Mandarin Chinese, and Thai.
Transformation | Input | Output |
---|---|---|
kCFStringTransformLatinArabic | mrḥbạ | مرحبا |
kCFStringTransformLatinCyrillic | privet | привет |
kCFStringTransformLatinGreek | geiá sou | γειά σου |
kCFStringTransformLatinHangul | annyeonghaseyo | 안녕하세요 |
kCFStringTransformLatinHebrew | şlwm | שלום |
kCFStringTransformLatinHiragana | hiragana | ひらがな |
kCFStringTransformLatinKatakana | katakana | カタカナ |
kCFStringTransformLatinThai | s̄wạs̄dī | สวัสดี |
kCFStringTransformHiraganaKatakana | にほんご | ニホンゴ |
kCFStringTransformMandarinLatin | 中文 | zhōng wén |
And that's only using the constants defined in Core Foundation! By passing an ICU transform directly,
CFStringTransform
can transliterate between Latin and Arabic, Armenian, Bopomofo, Cyrillic, Georgian, Greek, Han, Hangul, Hebrew, Hiragana, Indic ( Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, & Telegu), Jamo, Katakana, Syriac, Thaana, & Thai.
One of the more practical applications for string transformation is to normalize unpredictable user input. Even if your application doesn't specifically deal with other languages, you should be able to intelligently process anything the user types into your app.
For example, let's say you want to build a searchable index of movies on the device, which includes greetings from around the world:
var mutableString = NSMutableString(string: "Hello! こんにちは! สวัสดี! مرحبا! 您好!") as CFMutableStringRef
- First, apply the
kCFStringTransformToLatin
transform to transliterate all non-English text into a Latin alphabetic representation.
CFStringTransform(mutableString, nil, kCFStringTransformToLatin, Boolean(0))
Hello! こんにちは! สวัสดี! مرحبا! 您好! → Hello! kon'nichiha! s̄wạs̄dī! mrḥbạ! nín hǎo!
- Next, apply the
kCFStringTransformStripCombiningMarks
transform to remove any diacritics or accents.
CFStringTransform(mutableString, nil, kCFStringTransformStripCombiningMarks, Boolean(0))
Hello! kon'nichiha! s̄wạs̄dī! mrḥbạ! nín hǎo! → Hello! kon'nichiha! swasdi! mrhba! nin hao!
- Finally, downcase the text with
CFStringLowercase
, and split the text into tokens withCFStringTokenizer
to use as an index for the text.
let tokenizer = CFStringTokenizerCreate(nil, mutableString, CFRangeMake(0, CFStringGetLength(mutableString)), 0, CFLocaleCopyCurrent())
var mutableTokens: [String] = []
var type: CFStringTokenizerTokenType
do {
type = CFStringTokenizerAdvanceToNextToken(tokenizer)
let range = CFStringTokenizerGetCurrentTokenRange(tokenizer)
let token = CFStringCreateWithSubstring(nil, mutableString, range) as NSString
mutableTokens.append(token)
} while type != .None
(hello, kon'nichiha, swasdi, mrhba, nin, hao)
By applying the same set of transformations on search text entered by the user, you have a universal way to search regardless of either the language of the search string or content!
For anyone wanting to be especially clever, all of the necessary transformations can actually be done in a single pass, by specifying the ICU transform
"Any-Latin; Latin-ASCII; Any-Lower"
.
CFStringTransform
can be an insanely powerful way to bend language to your will. And it's but one of many powerful features that await you if you're brave enough to explore outside of Objective-C's warm OO embrace.