-
Notifications
You must be signed in to change notification settings - Fork 2
character coding
The beta.py is implemented in Python 3 which uses internally Unicode coding which includes, in principle, all characters used by any (established and still used) writing system, see also the article in Wikipedia. Many modern systems and programming languages have gone over to Unicode which has a parallel standard, ISO/IEC 10646. At the same time the use of eight-bit codings, such as Latin-1/ISO 8859-1 is fading away.
In files, the Unicode characters are usually encoded according to UTF-8 scheme, where characters are stored in one or several bytes. Typically, English letters and punctuation characters are stored in one byte, and any other national characters such as é, ü, ö, þ require two bytes (even if they had an one-byte code in Latin-1).
Editors, such as Emacs, are often too clever with character codings. They are prepared to read in, process and save files in many different codings. See the discussion at the end of this page to understand and handle codings with Emacs.
You can find out what coding your terminal is using by typing:
$ locale LC_CTYPE
UTF-8
If the answer was UTF-8, the coding of the terminal is correct. If the answer was ISO-8859-1 or such, the coding is wrong (Latin-1) or something. You must change the coding, see e.g. $ man locale
and the setting of your terminal program.
Emacs editor might be the best tool for finding out the coding system of your old files. Load the file in a buffer and look at the lower left corner of the window. U
= Unicode UTF-8, 1
= Latin-1/ISO-8859-1, DOS
= Microsoft DOS eight bit code (possibly with code page 437). If you look closely at some national character and ask the hex and oct codes of that by typing ´^X =´ , you find the code value. Wikipedia has good information about code pages and code values. Once you know the coding, you probably can use the iconv
program to convert the data.
If the file is in UTF-8 coding and it contains characters for which Latin-1 and UTF-8 differ, such as (ö), you will get the answer by using command file
:
$ file sylfi.bta
sylfi.bta: UTF-8 Unicode English text
If the answer contained "UTF-8" then the coding is correct. If it does not contain "UTF-8" then the coding is incorrect (or maybe there were no characters beyond code value 127).
If your shell/terminal uses the correct coding, you can find out whether your rules and data are in UTF-8 by various commands, e.g. cat
or head
:
$ head myrules.beta
...
UIT; +VF:U*I=T;
Y?; +N:Y?;
The question marks (?) instead of (ö) reveal that the coding of the rules is probably Latin-1, not UTF-8.
You can also try less
:
...
UIT; +VF:U*I=T;
Y<99>; +N:Y<99>;
...
Again, we see that (ö) is not rendered properly. It is shown as <99> which is some hexadecimal code for (ö) in DOS Swedish/Finnish code page 437.
Thus, both the beta rule grammars and the input data to beta.py have to be in Unicode UTF-8. If the input is Latin-1 with characters whose code value is 128 or higher, a Python program like beta.py will produce a cryptic error message such as:
$ python3.5 beta.py -r foo.bta
Traceback (most recent call last):
File "beta.py", line 342, in <module>
read_beta_grammar(args.rules, args.verbosity)
File "beta.py", line 274, in read_beta_grammar
for line in f:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321,
in decode (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 117: invalid start byte
The last line of the error message tells the reason.
If your data or rule grammars happens to be in eight-bit Latin-1/ISO-8859-1 code, in Gnu/Linux and in Mac iOS, you can convert the coding with a command:
$ iconv -f iso8859-1 -t utf-8 < old-file.txt > new-file.txt
On Emacs, one may also convert the coding by loading in the Latin-1 encoded file, telling Emacs that the buffer is to be saved as UTF-8 by a command:
^X RET F utf-8-unix
The buffer will be saved (ˆX ^S) in Unicode UTF-8 encoding which is exactly what one wanted.
Some very old files may be in DOS Code page 437 eight bit encoding which is quite different from Latin-1. Fortunately, the iconv
program knows that conversion as well:
$ iconv -f 437 -t utf-8 < myrules.text > myrules.utf-8