Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-82927: Update files related to HTML entities. #92504

Merged
merged 2 commits into from
Jun 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Python/pythonrun.c @iritkatriel
/Lib/html/ @ezio-melotti
/Lib/_markupbase.py @ezio-melotti
/Lib/test/test_html*.py @ezio-melotti
/Tools/scripts/*html5* @ezio-melotti

# Import (including importlib).
# Ignoring importlib.h so as to not get flagged on
Expand Down
4 changes: 2 additions & 2 deletions Doc/library/html.entities.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,12 @@ This module defines four dictionaries, :data:`html5`,

.. data:: name2codepoint

A dictionary that maps HTML entity names to the Unicode code points.
A dictionary that maps HTML4 entity names to the Unicode code points.


.. data:: codepoint2name

A dictionary that maps Unicode code points to HTML entity names.
A dictionary that maps Unicode code points to HTML4 entity names.


.. rubric:: Footnotes
Expand Down
9 changes: 6 additions & 3 deletions Lib/html/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
__all__ = ['html5', 'name2codepoint', 'codepoint2name', 'entitydefs']


# maps the HTML entity name to the Unicode code point
# from https://html.spec.whatwg.org/multipage/named-characters.html
# maps HTML4 entity name to the Unicode code point
name2codepoint = {
'AElig': 0x00c6, # latin capital letter AE = latin capital ligature AE, U+00C6 ISOlat1
'Aacute': 0x00c1, # latin capital letter A with acute, U+00C1 ISOlat1
Expand Down Expand Up @@ -261,7 +260,11 @@
}


# maps the HTML5 named character references to the equivalent Unicode character(s)
# HTML5 named character references
# Generated by 'Tools/scripts/parse_html5_entities.py'
# from https://html.spec.whatwg.org/entities.json and
# https://html.spec.whatwg.org/multipage/named-characters.html.
# Map HTML5 named character references to the equivalent Unicode character(s).
html5 = {
'Aacute': '\xc1',
'aacute': '\xe1',
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
The ``Tools/scripts/parseentities.py`` script used to parse HTML4 entities
has been removed.
27 changes: 18 additions & 9 deletions Tools/scripts/parse_html5_entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,14 @@
"""
Utility for parsing HTML5 entity definitions available from:

http://dev.w3.org/html5/spec/entities.json
https://html.spec.whatwg.org/entities.json
https://html.spec.whatwg.org/multipage/named-characters.html

Written by Ezio Melotti and Iuliia Proskurnia.
The page now contains the following note:

"This list is static and will not be expanded or changed in the future."

Written by Ezio Melotti and Iuliia Proskurnia.
"""

import os
Expand All @@ -14,7 +18,9 @@
from urllib.request import urlopen
from html.entities import html5

entities_url = 'http://dev.w3.org/html5/spec/entities.json'
PAGE_URL = 'https://html.spec.whatwg.org/multipage/named-characters.html'
ENTITIES_URL = 'https://html.spec.whatwg.org/entities.json'
HTML5_SECTION_START = '# HTML5 named character references'

def get_json(url):
"""Download the json file from the url and returns a decoded object."""
Expand Down Expand Up @@ -62,29 +68,32 @@ def write_items(entities, file=sys.stdout):
# be before their equivalent lowercase version.
keys = sorted(entities.keys())
keys = sorted(keys, key=str.lower)
print(HTML5_SECTION_START, file=file)
print(f'# Generated by {sys.argv[0]!r}\n'
f'# from {ENTITIES_URL} and\n'
f'# {PAGE_URL}.\n'
f'# Map HTML5 named character references to the '
f'equivalent Unicode character(s).', file=file)
print('html5 = {', file=file)
for name in keys:
print(' {!r}: {!a},'.format(name, entities[name]), file=file)
print(f' {name!r}: {entities[name]!a},', file=file)
print('}', file=file)


if __name__ == '__main__':
# without args print a diff between html.entities.html5 and new_html5
# with --create print the new html5 dict
# with --patch patch the Lib/html/entities.py file
new_html5 = create_dict(get_json(entities_url))
new_html5 = create_dict(get_json(ENTITIES_URL))
if '--create' in sys.argv:
print('# map the HTML5 named character references to the '
'equivalent Unicode character(s)')
print('# Generated by {}. Do not edit manually.'.format(__file__))
write_items(new_html5)
elif '--patch' in sys.argv:
fname = 'Lib/html/entities.py'
temp_fname = fname + '.temp'
with open(fname) as f1, open(temp_fname, 'w') as f2:
skip = False
for line in f1:
if line.startswith('html5 = {'):
if line.startswith(HTML5_SECTION_START):
write_items(new_html5, file=f2)
skip = True
continue
Expand Down
64 changes: 0 additions & 64 deletions Tools/scripts/parseentities.py

This file was deleted.