Skip to content

character coding

Kimmo Koskenniemi edited this page Feb 24, 2018 · 7 revisions

Characters used by beta.py and their coding

The beta.py is implemented in Python 3 which uses internally Unicode coding which includes, in principle, all characters used by any (established and still used) writing system, see also the article in Wikipedia. Many modern systems and programming languages have gone over to Unicode which has a parallel standard, ISO/IEC 10646. At the same time the use of eight-bit codings, such as Latin-1/ISO 8859-1 is fading away.

In files, the Unicode characters are usually encoded according to UTF-8 scheme, where characters are stored in one or several bytes. Typically, English letters and punctuation characters are stored in one byte, and any other national characters such as é, ü, ö, þ require two bytes (even if they had an one-byte code in Latin-1).

Editors, such as Emacs, are often too clever with character codings. They are prepared to read in, process and save files in many different codings. See the discussion at the end of this page to understand and handle codings with Emacs.

Your terminal has to be in UTF-8

You can find out what coding your terminal is using by typing:

    $ locale LC_CTYPE
    UTF-8

If the answer was UTF-8, the coding of the terminal is correct. If the answer was ISO-8859-1 or such, the coding is wrong (Latin-1) or something. You must change the coding, see e.g. $ man locale and the setting of your terminal program.

Finding out the coding of your rules and data

Finding out what the coding might be

Emacs editor might be the best tool for finding out the coding system of your old files. Load the file in a buffer and look at the lower left corner of the window. U = Unicode UTF-8, 1 = Latin-1/ISO-8859-1, DOS = Microsoft DOS eight bit code (possibly with code page 437). If you look closely at some national character and ask the hex and oct codes of that by typing ´^X =´ , you find the code value. Wikipedia has good information about code pages and code values. Once you know the coding, you probably can use the iconv program to convert the data.

Checking whether the coding is correct

If the file is in UTF-8 coding and it contains characters for which Latin-1 and UTF-8 differ, such as (ö), you will get the answer by using command file:

    $ file sylfi.bta
    sylfi.bta: UTF-8 Unicode English text

If the answer contained "UTF-8" then the coding is correct. If it does not contain "UTF-8" then the coding is incorrect (or maybe there were no characters beyond code value 127).

If your shell/terminal uses the correct coding, you can find out whether your rules and data are in UTF-8 by various commands, e.g. cat or head:

    $ head myrules.beta
    ...
    UIT; +VF:U*I=T;
    Y?; +N:Y?;

The question marks (?) instead of (ö) reveal that the coding of the rules is probably Latin-1, not UTF-8.

You can also try less:

    ...
    UIT; +VF:U*I=T;
    Y<99>; +N:Y<99>;
    ...

Again, we see that (ö) is not rendered properly. It is shown as <99> which is some hexadecimal code for (ö) in DOS Swedish/Finnish code page 437.

Error messages because of wrong coding

Thus, both the beta rule grammars and the input data to beta.py have to be in Unicode UTF-8. If the input is Latin-1 with characters whose code value is 128 or higher, a Python program like beta.py will produce a cryptic error message such as:

    $ python3.5 beta.py -r foo.bta
    Traceback (most recent call last):
      File "beta.py", line 342, in <module>
        read_beta_grammar(args.rules, args.verbosity)
      File "beta.py", line 274, in read_beta_grammar
        for line in f:
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321,
        in decode (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 117: invalid start byte

The last line of the error message tells the reason.

Converting data or rules from Latin-1 into UTF-8

If your data or rule grammars happens to be in eight-bit Latin-1/ISO-8859-1 code, in Gnu/Linux and in Mac iOS, you can convert the coding with a command:

    $ iconv -f iso8859-1 -t utf-8 < old-file.txt > new-file.txt

On Emacs, one may also convert the coding by loading in the Latin-1 encoded file, telling Emacs that the buffer is to be saved as UTF-8 by a command:

    ^X RET F utf-8-unix

The buffer will be saved (ˆX ^S) in Unicode UTF-8 encoding which is exactly what one wanted.

Converting data or rules from DOS to UTF-8

Some very old files may be in DOS Code page 437 eight bit encoding which is quite different from Latin-1. Fortunately, the iconv program knows that conversion as well:

$ iconv -f 437 -t utf-8 < myrules.text > myrules.utf-8