-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support for UTF-8 Character Class Processing
- Added UTF-8 utility functions to Tf to determine whether a code point is in the XID_Start / XID_Continue character classes - Added static data for holding code point flags for XID_Start / XID_Continue - Added tests for XID_Start / XID_Continue code point validity - Added pre-processing script to generate character class ranges for XID_Start / XID_Continue from source DerivedCoreProperties.txt
- Loading branch information
Showing
8 changed files
with
2,842 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Generating Character Classes from the Unicode Database | ||
|
||
To properly process UTF-8 encoded strings, the system needs to know what code | ||
points fall into what Unicode character class. This is useful for e.g., | ||
processing identifiers to determine whether the first character is represented | ||
by a Unicode code point that falls into the `XidStart` Unicode character class. | ||
Unicode defines a maximum of 17 * 2^16 code points, and we need an efficient | ||
way of representing character class containment for each of these code points. | ||
The chosen data structures of interest are implemented in the | ||
`pxr/base/tf/unicode/unicodeCharacterClasses.template.cpp` file, but the code | ||
points of interest for each character class must be generated from a source | ||
version of the Unicode database. | ||
|
||
This directory contains a script `tfGenCharacterClasses.py` that will read in | ||
character class information from a source version of the Unicode database and | ||
generate the `pxr/base/tf/unicodeCharacterClasses.cpp` file from the provided | ||
`pxr/base/tf/unicode/unicodeCharacterClasses.template.cpp` file. The Unicode | ||
database provides a post-processed file called `DerivedCoreProperties.txt` in | ||
its core collateral. For the script to function, this file must be present | ||
locally on disk (see below for information about where to obtain the source | ||
Unicode character class data). Once the script is present locally, the | ||
character classes can be generated via the following command: | ||
|
||
``` | ||
# example run from the pxr/base/tf/unicode directory | ||
python tfGenCharacterClasses.py --srcDir <path/to/DerivedCoreProperties.txt> | ||
--destDir .. --srcTemplate unicodeCharacterClasses.template.cpp | ||
``` | ||
|
||
This command will overwrite the current | ||
`pxr/base/tf/unicodeCharacterClasses.cpp` file with the newly generated | ||
version. | ||
|
||
**NOTE: This script need only be run once when upgrading to a new** | ||
**Unicode version** | ||
|
||
## Source Unicode Database | ||
|
||
The Unicode Character Database consists of a set of files representing | ||
Unicode character properties and can be found at https://unicode.org/ucd/ | ||
and the `DerivedCoreProperties.txt` file can be obtained in the `ucd` | ||
directory of the collateral at whatever version you are interested in | ||
supporting. | ||
|
||
The current version of `pxr/base/tf/unicodeCharacterClasses.cpp` | ||
was generated from `DerivedCoreProperties.txt` for Unicode Version 15.1.0. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
#!/usr/bin/env python | ||
# | ||
# Copyright 2023 Pixar | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "Apache License") | ||
# with the following modification; you may not use this file except in | ||
# compliance with the Apache License and the following modification to it: | ||
# Section 6. Trademarks. is deleted and replaced with: | ||
# | ||
# 6. Trademarks. This License does not grant permission to use the trade | ||
# names, trademarks, service marks, or product names of the Licensor | ||
# and its affiliates, except as required to comply with Section 4(c) of | ||
# the License and to reproduce the content of the NOTICE file. | ||
# | ||
# You may obtain a copy of the Apache License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the Apache License with the above modification is | ||
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the Apache License for the specific | ||
# language governing permissions and limitations under the Apache License. | ||
# | ||
# A script for generating the character class sets for XID_Start and | ||
# XID_Continue character classes. This takes a source UnicodeData.txt | ||
# from the Unicode standard and generates C++ source files that populate | ||
# data structures with the appropriate code points. | ||
'''This script reads the DerivedCoreProperties.txt from a versioned set of | ||
Unicode collateral and generates appropriate data structures for the XID_Start | ||
and XID_Continue character classes used to process UTF-8 encoded strings in | ||
the Tf library.''' | ||
|
||
import os | ||
|
||
from argparse import ArgumentParser | ||
|
||
DERIVEDCOREPROPERTIES_FILE = "DerivedCoreProperties.txt" | ||
TEMPLATE_FILE_NAME = "unicodeCharacterClasses.template.cpp" | ||
CPP_FILE_NAME = "unicodeCharacterClasses.cpp" | ||
|
||
xid_start_range_pairs = [] | ||
xid_continue_range_pairs = [] | ||
|
||
def _write_cpp_file(source_template_path : str, destination_directory : str): | ||
""" | ||
Writes the C++ code file that will initialize character class | ||
sets with the values read by this script. | ||
Args: | ||
source_template_path : A string defining the path at which the source | ||
template file exists. | ||
destination_directory: A string defining the path at which the | ||
generated cpp file will be written to. | ||
If the specified directory does not exist, | ||
it will be created. | ||
""" | ||
if not os.path.exists(source_template_path): | ||
raise ValueError(f"Provided source template file \ | ||
{source_template_path} does not exist!") | ||
|
||
source_template_content = None | ||
with open(source_template_path, 'r') as source_template_file: | ||
source_template_content = source_template_file.read() | ||
|
||
if not os.path.exists(destination_directory): | ||
os.mkdir(destination_directory) | ||
|
||
generated_cpp_file_name = os.path.join(destination_directory, | ||
CPP_FILE_NAME) | ||
with open(generated_cpp_file_name, 'w') as generated_cpp_file: | ||
# we need to replace two markers, {xid_start_ranges} | ||
# and {xid_continue_ranges} with the content we derived | ||
# from DerivedCoreProperties.txt | ||
xid_start_range_expression = "ranges = {\n" | ||
for x in xid_start_range_pairs: | ||
range_expression = "{" + str(x[0]) + ", " + str(x[1]) + "}" | ||
xid_start_range_expression += f" {range_expression},\n" | ||
xid_start_range_expression += " };" | ||
|
||
xid_continue_range_expression = "ranges = {\n" | ||
for x in xid_continue_range_pairs: | ||
range_expression = "{" + str(x[0]) + ", " + str(x[1]) + "}" | ||
xid_continue_range_expression += f" {range_expression},\n" | ||
xid_continue_range_expression += " };" | ||
|
||
destination_template_content = source_template_content.replace( | ||
r"{xid_start_ranges}", xid_start_range_expression) | ||
destination_template_content = destination_template_content.replace( | ||
r"{xid_continue_ranges}", xid_continue_range_expression) | ||
|
||
generated_cpp_file.write(destination_template_content) | ||
|
||
def _parseArguments(): | ||
""" | ||
Parses the arguments sent to the script. | ||
Returns: | ||
An object containing the parsed arguments as accessible fields. | ||
""" | ||
parser = ArgumentParser( | ||
description='Generate character class sets for Unicode characters.') | ||
parser.add_argument('--srcDir', required=False, default=os.getcwd(), | ||
help='The source directory where the DerivedCoreProperties.txt \ | ||
file exists.') | ||
parser.add_argument('--destDir', required=False, default=os.getcwd(), | ||
help='The destination directory where the processed cpp file will \ | ||
be written to.') | ||
parser.add_argument("--srcTemplate", required=False, | ||
default=os.path.join(os.getcwd(), TEMPLATE_FILE_NAME), | ||
help='The full path to the source template file to use.') | ||
|
||
return parser.parse_args() | ||
|
||
if __name__ == '__main__': | ||
arguments = _parseArguments() | ||
|
||
# parse the DerivedCoreProperties.txt file | ||
# sections of that file contain the derived properties XID_Start | ||
# and XID_Continue based on the allowed character classes and code points | ||
# sourced from UnicodeData.txt each line in the file that we are interested | ||
# in is of one of two forms: | ||
# codePoint ; XID_Start # character class Character Name | ||
# codePointRangeStart..codePointRangeEnd ; XID_Start | ||
# # character class [# of elements in range] Character Name | ||
file_name = os.path.join(arguments.srcDir, DERIVEDCOREPROPERTIES_FILE) | ||
if not os.path.exists(file_name): | ||
raise RuntimeError(f"Error in script: Could not find \ | ||
'DerivedCoreProperties.txt' at path {arguments.srcDir}!") | ||
|
||
with open(file_name, 'r') as file: | ||
for line in file: | ||
if "; XID_Start" in line: | ||
# this is an XID_Start single code point or range | ||
tokens = line.split(';') | ||
code_points = tokens[0].strip() | ||
if ".." in code_points: | ||
# this is a ranged code point | ||
code_point_ranges = code_points.split("..") | ||
start_code_point = int(code_point_ranges[0], 16) | ||
end_code_point = int(code_point_ranges[1], 16) | ||
else: | ||
# this is a single code point | ||
start_code_point = int(code_points, 16) | ||
end_code_point = start_code_point | ||
|
||
xid_start_range_pairs.append((start_code_point, | ||
end_code_point)) | ||
elif "; XID_Continue" in line: | ||
# this is an XID_Continue single code point or range | ||
tokens = line.split(';') | ||
code_points = tokens[0].strip() | ||
if ".." in code_points: | ||
# this is a ranged code point | ||
code_point_ranges = code_points.split("..") | ||
start_code_point = int(code_point_ranges[0], 16) | ||
end_code_point = int(code_point_ranges[1], 16) | ||
else: | ||
# this is a single code point | ||
start_code_point = int(code_points, 16) | ||
end_code_point = start_code_point | ||
|
||
xid_continue_range_pairs.append((start_code_point, | ||
end_code_point)) | ||
|
||
_write_cpp_file(arguments.srcTemplate, arguments.destDir) |
Oops, something went wrong.