Skip to content

Character encoding detection, brought to you by ICU

License

Notifications You must be signed in to change notification settings

stanhu/charlock_holmes

 
 

Repository files navigation

CharlockHolmes

Character encoding detecting library for Ruby using ICU

Usage

First you'll need to require it

require 'charlock_holmes'

Encoding detection

contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100, :type => :text}

# optionally there will be a :language key as well, but
# that's mostly only returned for legacy encodings like ISO-8859-1

NOTE: CharlockHolmes::EncodingDetector.detect will return nil if it was unable to find an encoding.

For binary content, :type will be set to :binary

Though it's more efficient to reuse once detector instance:

detector = CharlockHolmes::EncodingDetector.new

detection1 = detector.detect(File.read('test.xml'))
detection2 = detector.detect(File.read('test2.json'))

# and so on...

String monkey patch

Alternatively, you can just use the detect_encoding method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

detection = contents.detect_encoding

Ruby 1.9 specific

NOTE: This method only exists on Ruby 1.9+

If you want to use this library to detect and set the encoding flag on strings, you can use the detect_encoding! method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

# this will detect and set the encoding of `contents`, then return self
contents.detect_encoding!

Transcoding

Being able to detect the encoding of some arbitrary content is nice, but what you probably want is to be able to transcode that content into an encoding your application is using.

content = File.read('test2.txt')
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'

The first parameter is the content to transcode, the second is the source encoding (the encoding the content is assumed to be in), and the third parameter is the destination encoding.

Installing

If the traditional gem install charlock_holmes doesn't work, you may need to specify the path to your installation of ICU using the --with-icu-dir option during the gem install or by configuring Bundler to pass those arguments to Gem:

Configure Bundler to always use the correct arguments when installing:

bundle config build.charlock_holmes --with-icu-dir=/path/to/installed/icu4c

Using Gem to install directly without Bundler:

gem install charlock_holmes -- --with-icu-dir=/path/to/installed/icu4c

If you get a compile time error that looks like error: delegating constructors are permitted only in C++11 or something else related to C++11, you need to set the --with-cxxflags=-std=c++11 options

Bundler: bundle config build.charlock_holmes --with-icu-dir=/path/to/installed/icu4c --with-cxxflags=-std=c++11

Installing directly: gem install charlock_holmes -- --with-icu-dir=/path/to/installed/icu4c --with-cxxflags=-std=c++11

Homebrew

If you're installing on Mac OS X then using Homebrew is the easiest way to install ICU.

However, be warned; it is a Keg-Only (see homedir issue #167 for more info) install meaning RubyGems won't find it when installing without specifying --with-icu-dir

To install ICU with Homebrew:

brew install icu4c

Configure Bundler to always use the correct arguments when installing:

bundle config build.charlock_holmes --with-icu-dir=/usr/local/opt/icu4c

Using Gem to install directly without Bundler:

gem install charlock_holmes -- --with-icu-dir=/usr/local/opt/icu4c

About

Character encoding detection, brought to you by ICU

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Ruby 59.3%
  • C 32.2%
  • C++ 8.5%