A Python package to enable Unicode input and display when running Python from Windows console.
The package is not needed on Python 3.6 and newer since the underlying issue has been resolved (see https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep528).
When running Python (3.5 and older) in the standard console on Windows, there are several problems when one tries to enter or display Unicode characters. The relevant issue is http://bugs.python.org/issue1602. This package solves some of them.
First, when you want to display Unicode characters in Windows console, you have to select a font able to display them. Similarly, if you want to enter Unicode characters, you have to have you keyboard properly configured. This has nothing to do with Python, but is included here for completeness.
The standard stream objects (
sys.stdin
,sys.stdout
,sys.stderr
) are not capable of reading and displaying Unicode characters in Windows console. This has nothing to do with encoding, since evensys.stdin.buffer.raw.readline()
returnsb"?\n"
when enteringα
and there is no encoding under whichsys.stdout.buffer.raw.write
displaysα
.The
streams
module provides several alternative stream objects.stdin_raw
,stdout_raw
, andstderr_raw
are raw stream objects using WinAPI functionsReadConsoleW
andWriteConsoleW
to interact with Windows console through UTF-16-LE encoded bytes. Thestdin_text
,stdout_text
, andstderr_text
are standard text IO wrappers over standard buffered IO over our raw streams, and are intended to be primary replacements tosys.std*
streams. Unfortunately, other wrappers aroundstd*_text
are needed (see below), so there are more stream objects instreams
module.The function
streams.enable
installs chosen stream objects instead of the original ones. By default, it chooses appropriate stream objects itself. The functionstreams.disable
restores the original stream objects (these are stored insys.__std*__
attributes by Python).After replacing the stream objects, also using
print
with a string containing Unicode characters and displaying Unicode characters in the interactive loop works. Forinput
, see below.Python interactive loop doesn't use
sys.stdin
to read input so fixing it doesn't help. Also theinput
function may or may not usesys.stdin
depending on whethersys.stdin
andsys.stdout
have the standard filenos and whether they are interactive. See http://bugs.python.org/issue17620 for more information.To solve this, we install a custom readline hook. Readline hook is a function which is used to read a single line interactively by Python REPL. It may also be used by
input
function under certain conditions (see above). On Linux, this hook is usually set to GNU readline function, which provides features like autocompletion, history, …The module
readline_hook
provides our custom readline hook, which usessys.stdin
to get the input and is (de)activated by functionsreadline_hook.enable
,readline_hook.disable
.As we said, readline hook can be called from two places – from the REPL and from
input
function. In the first case the prompt is encoded usingsys.stdin.encoding
, but in the second casesys.stdout.encoding
is used. So Python currently makes an assumption that these two encodings are equal.Python tokenizer, which is used when parsing the input from REPL, cannot handle UTF-16 or generally any encoding containing null bytes. Because UTF-16-LE is the encoding of Unicode used by Windows, we have to additionally wrap our text stream objects (
std*_text
). Thus,streams
module contains also stream objectsstdin_text_transcoded
,stdout_text_transcoded
, andstderr_text_transcoded
. They basically just hide the underlying UTF-16-LE encoded buffered IO, and sets encoding to UTF-8. These transcoding wrappers are used by default bystreams.enable
.
There are additional issues on Python 2.
Since default Python 2 strings correspond to
bytes
rather thanunicode
, people are usually callingprint
withbytes
argument. Therefore,sys.stdout.write
andsys.stderr.write
should supportbytes
argument. That is why we addstdout_text_str
andstderr_text_str
stream objects tostreams
module. They are used by default on Python 2.When we enter a Unicode literal into interactive interpreter, it gets processed by the Python tokenizer, which is bytes-based. When we enter
u"\u03b1"
into the interactive interpreter, the tokenizer gets essentiallyb'u"\xce\xb1"'
plus the information that the encoding used is UTF-8. The problem is that the tokenizer uses the encoding only ifsys.stdin
is a file object (see https://hg.python.org/cpython/file/d356e68de236/Parser/tokenizer.c#l797). Hence, we introduce another stream objectstreams.stdin_text_fileobj
that wrapsstdin_text_transcoded
and also is structurally compatible with Python file object. This object is used by default on Python 2.The check for interactive streams done by
raw_input
unfortunately requires that bothsys.stdin
andsys.stdout
are file objects. Besidesstdin_text_fileobj
for stdin we could use alsostdout_text_str_fileobj
for stdout. Unfortunately, that breaksprint
.Using
print
statement or function leads to callingPyFile_WriteObject
withsys.stdout
as argument. Unfortunately, its genericwrite
method is used only if it is not a file object. Otherwise,PyObject_Print
is called, and this function is file-based, so it ends with afprintf
call, which is not something we want. In conclusion, we need stdout not to be a file object.Given the situation described, the best solution seems to be reimplementing
raw_input
andinput
builtin functions and monkeypatching__builtins__
. This is done by ourraw_input
module on Python 2.Similarly to the input from from
sys.stdin
the arguments insys.argv
are alsobytes
on Python 2 and the original ones may not be reconstructable. To overcome this we addunicode_argv
module. The functionunicode_argv.get_unicode_argv
returns Unicode version ofsys.argv
obtained by WinAPI functionsGetCommandLineW
andCommandLineToArgvW
. The functionunicode_argv.enable
monkeypatchessys.argv
with the Unicode arguments.
Install the package from PyPI via pip install win-unicode-console
(recommended), or download the archive and install it from the archive (e.g. pip install win_unicode_console-0.x.zip
), or install the package manually by placing directory win_unicode_console
and module run.py
from the archive to the site-packages
directory of your Python installation.
The top-level win_unicode_console
module contains a function enable
, which install various fixes offered by win_unicode_console
modules, and a function disable
, which restores the original environment. By default, custom stream objects are installed as well as a custom readline hook. On Python 2, raw_input
and input
functions are monkeypatched. sys.argv
is not monkeypatched by default since unfortunately some Python 2 code strictly assumes str
instances in sys.argv
list. Use enable(use_unicode_argv=True)
if you want the monkeypathcing. For further customization, see the sources. The logic should be clear.
Generic usage of the package is just calling win_unicode_console.enable()
whenever the fixes should be applied and win_unicode_console.disable()
to revert all the changes. Note that it should be a responsibility of a Python user on Windows to install win_unicode_console
and fix his Python environment regarding Unicode interaction with console, rather than of a third-party developer enabling win_unicode_console
in his application, which adds a dependency. Our package should be seen as an external patch to Python on Windows rather than a feature package for other packages not directly related to fixing Unicode issues.
Different ways of how win_unicode_console
can be used to fix a Python environment on Windows follow.
- Python patch (recommended). Just call
win_unicode_console.enable()
in yoursitecustomize
orusercustomize
module (see https://docs.python.org/3/tutorial/appendix.html#the-customization-modules for more information). This will enablewin_unicode_console
on every run of the Python interpreter (unlesssite
is disabled). Doing so should not break executed scripts in any way. Otherwise, it is a bug ofwin_unicode_console
that should be fixed. - Opt-in runner. You may easily run a script with
win_unicode_console
enabled by using ourrunner
module and its helperrun
script. To do so, executepy -i -m run script.py
instead ofpy -i script.py
for interactive mode, and similarlypy -m run script.py
instead ofpy script.py
for non-interactive mode. Of course you may provide arguments to your script:py -i -m run script.py arg1 arg2
. To run the bare interactive interpreter withwin_unicode_console
enabled, executepy -i -m run
. - Opt-out runner. In case you are using
win_unicode_console
as Python patch, but you want to run a particular script withwin_unicode_console
disabled, you can also use the runner. To do so, executepy -i -m run --init-disable script.py
. - Customized runner. To move arbitrary initialization (e.g. enabling
win_unicode_console
with non-default arguments) fromsitecustomize
to opt-in runner, move it to a separate module and usepy -i -m run --init-module module script.py
. That will import a modulemodule
on startup instead of enablingwin_unicode_console
with default arguments.
win_unicode_console
package was tested on Python 3.4, Python 3.5, and Python 2.7 (it is not needed on Python 3.6+). 32-bit or 64-bit shouldn't matter. It also interacts well with the following packages:
colorama
package (https://pypi.python.org/pypi/colorama) makes ANSI escape character sequences (for producing colored terminal text and cursor positioning) work under MS Windows. It does so by wrappingsys.stdout
andsys.stderr
streams. Sincewin_unicode_console
replaces the streams in order to support Unicode,win_unicode_console.enable
has to be called beforecolorama.init
so everything works as expected.As of
colorama
v0.3.3, there was an early binding issue (tartley/colorama#32), sowin_unicode_console.enable
has to be called even before importingcolorama
. Note that is already the case whenwin_unicode_console
is used as Python patch or as opt-in runner. The issue was already fixed.pyreadline
package (https://pypi.python.org/pypi/pyreadline/2.0) implements GNU readline features on Windows. It provides its own readline hook, which actually supports Unicode input.win_unicode_console.readline_hook
detects whenpyreadline
is active, and in that case, by default, reuses its readline hook rather than installing its own, so GNU readline features are preserved on top of our Unicode streams.IPython
(https://pypi.python.org/pypi/ipython) can be also used withwin_unicode_console
.As of
IPython
3.2.1, there is an early binding issue (ipython/ipython#8669), sowin_unicode_console.enable
has to be called even before importingIPython
. That is the case whenwin_unicode_console
is used as Python patch.There was also an issue that IPython was not compatible with the builtin function
raw_input
returning unicode on Python 2 (ipython/ipython#8670). If you hit this issue, you can makewin_unicode_console.raw_input.raw_input
return bytes by enabling it aswin_unicode_console.enable(raw_input__return_unicode=False)
. This was fixed in IPython 4.
Since version 0.4, the signature of
streams.enable
has been changed because there are now more options for the stream objects to be used. It now accepts a keyword argument for eachstdin
,stdout
,stderr
, setting the corresponding stream.None
means “do not set”,Ellipsis
means “use the default value”.A function
streams.enable_only
was added. It works the same way asstreams.enable
, but the default value for each parameter isNone
.Functions
streams.enable_reader
,streams.enable_writer
, andstreams.enable_error_writer
have been removed. Example: instead ofstreams.enable_reader(transcode=True)
usestreams.enable_only(stdin=streams.stdin_text_transcoding)
.There are also corresponding changes in top-level
enable
function.Since version 0.3, the custom stream objects have the standard filenos, so calling
input
doesn't handle Unicode without custom readline hook.
- The code of
streams
module is based on the code submitted to http://bugs.python.org/issue1602. - The idea of providing custom readline hook and the code of
readline_hook
module is based on https://github.com/pyreadline/pyreadline. - The code related to
unicode_argv.get_full_unicode_argv
is based on http://code.activestate.com/recipes/572200/. - The idea of using path hooks and the code related to
unicode_argv.argv_setter_hook
is based on https://mail.python.org/pipermail/python-list/2016-June/710183.html.