Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-91810: ElementTree: Use text file's encoding by default in XML declaration #91903

Merged

Conversation

serhiy-storchaka
Copy link
Member

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.

Closes #91810.

…ML declaration

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
@serhiy-storchaka serhiy-storchaka added type-bug An unexpected behavior, bug, or error needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes labels Apr 25, 2022
@miss-islington
Copy link
Contributor

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10, 3.11.
🐍🍒⛏🤖

@serhiy-storchaka serhiy-storchaka deleted the etree-write-default-encoding branch May 11, 2022 06:31
@bedevere-bot bedevere-bot removed the needs backport to 3.11 only security fixes label May 11, 2022
@bedevere-bot
Copy link

GH-92663 is a backport of this pull request to the 3.11 branch.

@bedevere-bot
Copy link

GH-92664 is a backport of this pull request to the 3.10 branch.

@bedevere-bot bedevere-bot removed needs backport to 3.10 only security fixes needs backport to 3.9 only security fixes labels May 11, 2022
@bedevere-bot
Copy link

GH-92665 is a backport of this pull request to the 3.9 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 11, 2022
…ML declaration (pythonGH-91903)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)

Co-authored-by: Serhiy Storchaka <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 11, 2022
…ML declaration (pythonGH-91903)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)

Co-authored-by: Serhiy Storchaka <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 11, 2022
…ML declaration (pythonGH-91903)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)

Co-authored-by: Serhiy Storchaka <[email protected]>
miss-islington added a commit that referenced this pull request May 11, 2022
…XML declaration (GH-91903) (GH-92663)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <[email protected]>

Automerge-Triggered-By: GH:serhiy-storchaka
miss-islington added a commit that referenced this pull request May 11, 2022
…XML declaration (GH-91903) (GH-92664)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <[email protected]>

Automerge-Triggered-By: GH:serhiy-storchaka
miss-islington added a commit that referenced this pull request May 11, 2022
…ML declaration (GH-91903) (GH-92665)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <[email protected]>

Automerge-Triggered-By: GH:serhiy-storchaka
@Yhg1s
Copy link
Member

Yhg1s commented Jun 1, 2022

FYI, @pablogsal @ambv: this change broke existing behaviour in a subtle way. I'm not sure if it's worth rolling back in 3.9, especially considering it went into the final regular bugfix release. It may be worth fixing the regression in 3.10 or 3.11, or it may be deemed to be a documentation issue.

The documentation of ElementTree.write() says: xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 or Unicode (default is None).

Before this change, passing a file opened in text mode regardless of the encoding used and passing encoding='unicode' to ElementTree.write() would not make ElementTree.write() produce the XML declaration.

After this change, passing a file opened in text mode with UTF-8 or ascii as encoding, and passing encoding='unicode' to ElementTree.write(), no XML declaration is produced (as before). However, passing a file opened in any other encoding (e.g. iso-8859-1) implicitly or explicitly, with encoding='unicode' passed to ElementTree.write() now does produce an XML declaration. This may come as a surprise to code that was manually adding an XML declaration before. For example:

from xml.etree import ElementTree as et

XML_HEADER = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE myelement>
"""

root = et.Element("myelement")
tree = et.ElementTree(root)

with open("myfile.xml", "w", encoding="iso-8859-1") as f:
    f.write(XML_HEADER)
    tree.write(f, encoding='unicode')
with open("myfile.xml", encoding="iso-8859-1") as f:
    print(f.read())

... will produce invalid XML because of this change:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE myelement>
<?xml version='1.0' encoding='iso-8859-1'?>
<myelement />

... where before this change it would not emit the second XML declaration. (This is arguably problematic code, but unfortunately the file's encoding can of course be implicit, which makes it harder to tell what's going on. This was synthesized from real code found while upgrading to 3.9.13 at Google.)

@pablogsal
Copy link
Member

I am a bit worried that for 3.10 people have already developed workarounds for this and we are going to break them again. On the other hand, one could argue that the situation is too specific and that's why nobody has raised issues so far. I am ok rolling it back, though, if that's the consensus.

@Yhg1s
Copy link
Member

Yhg1s commented Jun 1, 2022

Actually, this change didn't make it into 3.10.4, so it's only in 3.9.13 and 3.11 at this point.

@serhiy-storchaka
Copy link
Member Author

#93426 fixes this.

hello-adam pushed a commit to hello-adam/cpython that referenced this pull request Jun 2, 2022
…t in XML declaration (pythonGH-91903) (pythonGH-92665)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <[email protected]>

Automerge-Triggered-By: GH:serhiy-storchaka
@ambv
Copy link
Contributor

ambv commented Jun 16, 2022

I think we should fix it in 3.9.14 as well since it's a regression only in 3.9.13.

@ambv
Copy link
Contributor

ambv commented Jun 16, 2022

(In the mean time we did release 3.10.5 with the bug which made this version affected too)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-XML type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ElementTree should use UTF-8 for xml declaration.
8 participants