sys._enablelegacywindowsfsencoding() don't apply to os.fsencode and os.fsdecode #73427

JGoutin · 2017-01-11T12:28:01Z

BPO	29241
Nosy	@malemburg, @pfmoore, @vstinner, @tjguk, @methane, @zware, @zooba, @JGoutin, @iritkatriel

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-01-11.12:28:01.427>
labels = ['type-bug', '3.7', 'OS-windows']
title = "sys._enablelegacywindowsfsencoding() don't apply to os.fsencode and os.fsdecode"
updated_at = <Date 2022-01-17.01:40:59.439>
user = 'https://github.com/JGoutin'

bugs.python.org fields:

activity = <Date 2022-01-17.01:40:59.439>
actor = 'methane'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Windows']
creation = <Date 2017-01-11.12:28:01.427>
creator = 'JGoutin'
dependencies = []
files = []
hgrepos = []
issue_num = 29241
keywords = []
message_count = 18.0
messages = ['285220', '285238', '285290', '285354', '285374', '285393', '285398', '285421', '285423', '285424', '285429', '285888', '285969', '285975', '285979', '312378', '410719', '410738']
nosy_count = 9.0
nosy_names = ['lemburg', 'paul.moore', 'vstinner', 'tim.golden', 'methane', 'zach.ware', 'steve.dower', 'JGoutin', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue29241'
versions = ['Python 3.6', 'Python 3.7']

Linked PRs

gh-73427: deprecate _enablelegacywindowsfsencoding #107729

The text was updated successfully, but these errors were encountered:

JGoutin · 2017-01-11T12:28:01Z

The doc say that calling "sys._enablelegacywindowsfsencoding()" is equivalent to use "PYTHONLEGACYWINDOWSFSENCODING" environment variable.

In fact, this no apply to "os.fsencode" and "os.fsdecode".

Example with Python 3.6 64Bits on Windows 7 64 bits :

EXAMPLE CODE 1 (sys._enablelegacywindowsfsencoding()):

import sys
import os

# Force the use of legacy encoding like versions of Python prior to 3.6.
sys._enablelegacywindowsfsencoding()

# Show actual file system encoding
encoding = sys.getfilesystemencoding()
print('File system encoding:', encoding)

# os.fsencode(filename) VS filename.encode(File system encoding)
print(os.fsencode('é'), 'é'.encode(encoding))

>> File system encoding: mbcs
>> b'\xc3\xa9' b'\xe9'

First is encoded with "utf-8" and not "mbcs" (The actual File system encoding)

EXAMPLE CODE 2 (PYTHONLEGACYWINDOWSFSENCODING):

import sys
import os

# Force the use of legacy encoding like versions of Python prior to 3.6.
# "PYTHONLEGACYWINDOWSFSENCODING" environment variable set before running Python.

# Show actual file system encoding
encoding = sys.getfilesystemencoding()
print('File system encoding:', encoding)

# os.fsencode(filename) VS filename.encode(File system encoding)
print(os.fsencode('é'), 'é'.encode(encoding))

>> File system encoding: mbcs
>> b'\xe9' b'\xe9'

Everything encoded with "mbcs" (The actual File system encoding)

In "os.fsencode" and "os.fsdecode" encoding and errors are cached on start and never updated by "sys._enablelegacywindowsfsencoding()" after.

zooba · 2017-01-11T16:43:23Z

If you import os first then that's acceptable and we should document it more clearly. Try calling enable before importing os.

I wouldn't be surprised if os is imported automatically, in which case we need to figure out some alternate caching mechanism that can be reset by the enable call (or demonstrate that the caching is of no benefit).

JGoutin · 2017-01-12T07:07:47Z

import sys

# Force the use of legacy encoding like versions of Python prior to 3.6.
sys._enablelegacywindowsfsencoding()

# Show actual file system encoding
encoding = sys.getfilesystemencoding()
print('File system encoding:', encoding)

# os.fsencode(filename) VS filename.encode(File system encoding)
import os
print(os.fsencode('é'), 'é'.encode(encoding))

>> File system encoding: mbcs
>> b'\xc3\xa9' b'\xe9'

The result is the same.

zooba · 2017-01-12T23:49:59Z

Then we do in fact need to make os.fsencode/fsdecode either stop caching the encoding completely, or figure out a way to reset the cache when that function is called.

malemburg · 2017-01-13T10:03:12Z

Adding Victor, who implemented the fs codec.

AFAIK, it's not possible to change the encoding after interpreter initialization, since it will have been already used for many different things by the time you get to executing code.

vstinner · 2017-01-13T12:46:24Z

My experience with changing the Python "filesystem encoding" (sys.getfilesystemencoding()) at runtime: it doesn't work.

The filesystem encoding must be set as soon as possible and must never change later. As soon as possible: before the first call to os.fsdecode(), which is implemented in C as Py_DecodeLocale(). For example, the encoding must be set before Python imports the first module.

The filesystem encoding must be set before Python decodes *any* operating system data: command line arguments, any filename or path, environment variables, etc.

Hopefully, Windows provides most operating system data as Unicode directly: command line arguments and environment variables are exposed as Unicode for example.

os.fsdecode() and os.fsencode() have an important property:
assert os.fsencode(os.fsdecode(data)) == data

On Windows, the other property is maybe more imporant:
assert os.fsdecode(os.fsencode(data)) == data

If the property becomes false, for example if the filesystem encoding is changed at runtime, you get mojibake. Example:

os.fsdecode() decodes the filename b'h\xc3\xa9llo' from UTF-8 => 'h\xe9llo'
sys._enablelegacywindowsfsencoding()
os.fsencode() encodes the filename to cp1252 => you get 'h\xc3\xa9llo'
instead of 'h\xe9llo', say hello to mojibake

--

Sorry, I didn't play with sys._enablelegacywindowsfsencoding() on Windows. I don't know if it would "work" if sys._enablelegacywindowsfsencoding() is the first instruction of an application. I expect that Python almost decodes nothing at startup on Windows, so it may work.

sys._enablelegacywindowsfsencoding() is a private method, so it shouldn't be used.

Maybe we could add a "fuse" (flag only sets to 1, cannot be reset to 0) to raise an exception if sys._enablelegacywindowsfsencoding() is called "too late", after the first call to os.fsdecode() / Py_DecodeLocale()?

zooba · 2017-01-13T14:16:40Z

Windows doesn't use the fs encoding at all until Python code requests/provides something in bytes. Except for the caching in fsencode/fsdecode, there's no problem setting it once at the start of your program (and it can only be set once - there's no parameter and it cannot be undone).

What I'm most interested in is whether caching the encoding in fsencode/fsdecode is actually an optimization - if not, remove it, and if so make a way to reset it. I'll get around to this sooner or later but I don't want to stop someone else from doing it.

JGoutin · 2017-01-13T17:02:06Z

Personally, I call "sys._enablelegacywindowsfsencoding()" for only one reason :
Temporary fixing issues with some third party libraries which use C code for files I/O (With filename as "mbcs" encoded bytes internally).

Theses libraries generally call "filename.encode(sys.getfilesystemencoding())" or "os.fsencode(filename)" before sending filenames from Python to C code.

Actually, I didn't see any side effect for using this function. Maybe because I call it at start before anything else.

Using the environment variable is not easy in my case.

I can look if encoding caching in fsencode is efficient (On Windows). And eventually propose a code change for this.

vstinner · 2017-01-13T17:13:17Z

Temporary fixing issues with some third party libraries which use C code for files I/O (With filename as "mbcs" encoded bytes internally).

Theses libraries generally call "filename.encode(sys.getfilesystemencoding())" or "os.fsencode(filename)" before sending filenames from Python to C code.

Hum, Python lacks a function to encode to/decode from the ANSI code page, something like codecs.code_page_encode() / code_page_decode() with CP_ACP. It would allow to get the same encoding in UTF-8 and legacy modes.

vstinner · 2017-01-13T17:14:28Z

Hum, it was long time ago since I worked on Windows. Well, Python has a "mbcs" codec which uses the ANSI code page which exists like "forever". These libraries should be patched to use "mbcs" instead of sys.getfilesystemencoding().

JGoutin · 2017-01-13T17:31:51Z

Yes, I reported this encoding issue to some of them.

But, there is still some problems :

Some libraries are not updated frequently (Or not still maintained), and still use fsencode.
Tests and CI don't see this problem if they don't have a test case for filename with accents or other uncommon characters in english.

This problem will not be easy to eliminate totally...

JGoutin · 2017-01-20T07:14:19Z

A little encoding cache benchmark.

Current Code:
=============

import sys

def _fscodec():
    encoding = sys.getfilesystemencoding()
    errors = sys.getfilesystemencodeerrors()

    def fsencode(filename):
        filename = fspath(filename)  # Does type-checking of `filename`.
        if isinstance(filename, str):
            return filename.encode(encoding, errors)
        else:
            return filename

    def fsdecode(filename):
        filename = fspath(filename)  # Does type-checking of `filename`.
        if isinstance(filename, bytes):
            return filename.decode(encoding, errors)
        else:
            return filename

    return fsencode, fsdecode

fsencode, fsdecode = _fscodec()
del _fscodec

---------

import os

%timeit os.fsdecode(b'\xc3\xa9')
The slowest run took 21.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 449 ns per loop

%timeit os.fsencode('é')
The slowest run took 24.20 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 412 ns per loop

Modified Code:
==============

from sys import getfilesystemencoding, getfilesystemencodeerrors

def fsencode(filename):
    filename = fspath(filename)  # Does type-checking of `filename`.
    if isinstance(filename, str):
        return filename.encode(getfilesystemencoding(),
                               getfilesystemencodeerrors())
    else:
        return filename

def fsdecode(filename):
    filename = fspath(filename)  # Does type-checking of `filename`.
    if isinstance(filename, bytes):
        return filename.decode(getfilesystemencoding(),
                               getfilesystemencodeerrors())
    else:
        return filename

---------

import os

%timeit os.fsdecode(b'\xc3\xa9')
The slowest run took 15.88 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 541 ns per loop

%timeit os.fsencode('é')
The slowest run took 19.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 502 ns per loop

Result:
=======

Cache is a 17% speed up optimization.

zooba · 2017-01-21T22:45:55Z

Thanks for checking that.

I don't think it's worth retaining the cache on Windows in the face of the broken behaviour. Any real-world case where a lot of paths are being encoded or decoded is also likely to involve file-system access which will dwarf the encoding time. Further, passing bytes on Windows will result in another decode/encode cycle anyway, so there will be a bigger performance impact in using str (though even then, probably only when the str is already represented using 16-bit characters).

Unless somebody wants to make a case for having a more complex mechanism to reset the cache, I'll make the change to remove it (protected by an 'if sys.platform.startswith('win')' check).

vstinner · 2017-01-22T00:29:07Z

Can't we just update the cache when the function changes the encoding?

zooba · 2017-01-22T01:20:26Z

How?

zooba · 2018-02-19T23:01:19Z

I took another look at this and it's still unclear whether it's worth the performance loss.

Perhaps moving fsencode and fsdecode (almost) entirely into C would be a better approach? That shouldn't send us backwards at all, and all they really do is a typecheck and then calling a function that's already written in C.

iritkatriel · 2022-01-16T20:06:15Z

With 3.6 being over, is _enablelegacywindowsfsencoding still needed or is it time to deprecate it?

methane · 2022-01-17T01:40:59Z

Mercurial still use it.
https://www.mercurial-scm.org/repo/hg-stable/file/tip/mercurial/pycompat.py#l113

Mercurial has plan to move filesystem name from ANSI Code Page to UTF-8, but I don't know about its progress.
https://www.mercurial-scm.org/wiki/WindowsUTF8Plan

methane · 2023-06-25T15:29:20Z

Mercurial was affected by this bug.
@indygreg fixed this bug by replacing os.fsencode() with a.encode("mbcs", "ignore").

https://repo.mercurial-scm.org/hg/rev/00e0c5c06ed5

I don't know much about mercurial internal, but they may use legacy fs encoding like passing bytes path to functions like open.

But they use legacystdio by setting envvar. I think they can use legacyfsencoding by setting envvar or Python/C API too.

> rg PYTHONLEGACY
contrib/win32/hg.bat
7:set PYTHONLEGACYWINDOWSSTDIO=1

tests/run-tests.py
1456:            env['PYTHONLEGACYWINDOWSSTDIO'] = '1'

mercurial/exewrapper.c
51:	_wputenv(L"PYTHONLEGACYWINDOWSSTDIO=1");

@indygreg Would you try PYTHONLEGACYWINDOWSFSENCODING in Mercurial?
Or can you recommend a Mercurial developer who is familiar with Windows?

vstinner · 2023-06-26T06:19:38Z

IMO it's time to deprecate it since https://peps.python.org/pep-0686/ got approved for Python 3.15. As you can see, supporting other encodings than UTF-8 remains complicated and is less tested. For me, this legacy encoding was supposed to be a temporary thing to ease the Python on Windows migration to UTF-8.

vstinner · 2023-06-26T06:21:26Z

I wrote an article about PEP 529: https://vstinner.github.io/python36-utf8-windows.html

vstinner · 2023-06-26T06:26:59Z

This issue can be fixed by adding a private os._update_fs_encoding() function which would update fsencode(), fsdecode(), and os.environ functions. sys._enablelegacyfsencoding() would call it if os is cached in sys.modules. Something like that.

methane · 2023-06-26T06:42:59Z

We can not fix every possible cached fsencoding, and paths already encoded in utf-8.
If Mercurail can use PYTHONLEGACYWINDOWSFSENCODING instead of _enablelegacywindowsfsencoding, no need to fix/keep it.

vstinner · 2023-06-26T08:01:29Z

Every attempt to change the filesystem encoding at runtime was a big failure: https://vstinner.github.io/painful-history-python-filesystem-encoding.html

I already proposed to deprecate sys._enablelegacywindowsfsencoding() some years ago, but apparently it was too early.

methane · 2023-06-26T15:22:19Z

I filed an issue to Mercurial bugzilla.
https://bz.mercurial-scm.org/show_bug.cgi?id=6829

zooba · 2023-06-26T16:37:53Z

I already proposed to deprecate sys._enablelegacywindowsfsencoding() some years ago, but apparently it was too early.

I don't remember your proposal, but it was always meant to be deprecated. Just never actually seemed important enough to do it.

Fixing this would require making fsencode and fsdecode less efficient for everyone, all the time. I don't think it's worth it.

methane · 2023-06-28T09:40:14Z

Almost only user of legacywindowsfsencoding is Mercurial. So we need help from Mercurial developer...

methane · 2023-12-28T08:33:35Z

Since there is no reply from Mercurial developers, I merged the deprecation of _enablelegacywindowsfsencoding.
Can we close this issue as "won't fix"?

zooba · 2023-12-28T12:16:44Z

Sounds good to me

…7729)

JGoutin mannequin added OS-windows type-bug An unexpected behavior, bug, or error labels Jan 11, 2017

zooba added the 3.7 (EOL) end of life label Jan 21, 2017

zooba self-assigned this Jan 21, 2017

zooba removed their assignment Feb 19, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

methane mentioned this issue Jun 25, 2023

fsencode and fsdecode() ignore sys._enablelegacywindowsfsencoding() #106042

Closed

bedevere-bot mentioned this issue Aug 7, 2023

gh-73427: deprecate _enablelegacywindowsfsencoding #107729

Merged

iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 29, 2023

methane added a commit that referenced this issue Dec 28, 2023

gh-73427: deprecate _enablelegacywindowsfsencoding (#107729)

bfee2f7

zooba closed this as not planned Won't fix, can't repro, duplicate, stale Dec 28, 2023

kulikjak pushed a commit to kulikjak/cpython that referenced this issue Jan 22, 2024

pythongh-73427: deprecate _enablelegacywindowsfsencoding (python#10…

47243d3

…7729)

aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024

pythongh-73427: deprecate _enablelegacywindowsfsencoding (python#10…

1edf7f1

…7729)

Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024

pythongh-73427: deprecate _enablelegacywindowsfsencoding (python#10…

a776e4f

…7729)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sys._enablelegacywindowsfsencoding() don't apply to os.fsencode and os.fsdecode #73427

sys._enablelegacywindowsfsencoding() don't apply to os.fsencode and os.fsdecode #73427

JGoutin mannequin commented Jan 11, 2017 •

edited by bedevere-bot

Loading

JGoutin mannequin commented Jan 11, 2017

zooba commented Jan 11, 2017

JGoutin mannequin commented Jan 12, 2017

zooba commented Jan 12, 2017

malemburg commented Jan 13, 2017

vstinner commented Jan 13, 2017

zooba commented Jan 13, 2017

JGoutin mannequin commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

JGoutin mannequin commented Jan 13, 2017

JGoutin mannequin commented Jan 20, 2017

zooba commented Jan 21, 2017

vstinner commented Jan 22, 2017

zooba commented Jan 22, 2017

zooba commented Feb 19, 2018

iritkatriel commented Jan 16, 2022

methane commented Jan 17, 2022

methane commented Jun 25, 2023 •

edited

Loading

vstinner commented Jun 26, 2023

vstinner commented Jun 26, 2023

vstinner commented Jun 26, 2023

methane commented Jun 26, 2023

vstinner commented Jun 26, 2023

methane commented Jun 26, 2023

zooba commented Jun 26, 2023

methane commented Jun 28, 2023

methane commented Dec 28, 2023

zooba commented Dec 28, 2023

sys._enablelegacywindowsfsencoding() don't apply to os.fsencode and os.fsdecode #73427

sys._enablelegacywindowsfsencoding() don't apply to os.fsencode and os.fsdecode #73427

Comments

JGoutin mannequin commented Jan 11, 2017 • edited by bedevere-bot Loading

Linked PRs

JGoutin mannequin commented Jan 11, 2017

zooba commented Jan 11, 2017

JGoutin mannequin commented Jan 12, 2017

zooba commented Jan 12, 2017

malemburg commented Jan 13, 2017

vstinner commented Jan 13, 2017

zooba commented Jan 13, 2017

JGoutin mannequin commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

JGoutin mannequin commented Jan 13, 2017

JGoutin mannequin commented Jan 20, 2017

zooba commented Jan 21, 2017

vstinner commented Jan 22, 2017

zooba commented Jan 22, 2017

zooba commented Feb 19, 2018

iritkatriel commented Jan 16, 2022

methane commented Jan 17, 2022

methane commented Jun 25, 2023 • edited Loading

vstinner commented Jun 26, 2023

vstinner commented Jun 26, 2023

vstinner commented Jun 26, 2023

methane commented Jun 26, 2023

vstinner commented Jun 26, 2023

methane commented Jun 26, 2023

zooba commented Jun 26, 2023

methane commented Jun 28, 2023

methane commented Dec 28, 2023

zooba commented Dec 28, 2023

JGoutin mannequin commented Jan 11, 2017 •

edited by bedevere-bot

Loading

methane commented Jun 25, 2023 •

edited

Loading