Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SOAR-0001] Improved mapping of identifiers #89

Merged
merged 11 commits into from
Aug 2, 2023
126 changes: 62 additions & 64 deletions Sources/_OpenAPIGeneratorCore/Extensions/String.swift
Original file line number Diff line number Diff line change
Expand Up @@ -56,20 +56,19 @@ fileprivate extension String {

/// Returns a string sanitized to be usable as a Swift identifier.
///
/// For example, the string `$nake` would be returned as `_nake`, because
/// the dollar sign is not a valid character in a Swift identifier.
/// For example, the string `$nake…` would be returned as `_dollar_nake_x2026_`, because
/// both the dollar and ellipsis sign are not valid characters in a Swift identifier.
/// So, it replaces such characters with their html enity equivalents or unicode hex representation,
denil-ct marked this conversation as resolved.
Show resolved Hide resolved
/// in case its not present in the `specialCharsMap`. It marks this replacement with `_` as a delimiter.
denil-ct marked this conversation as resolved.
Show resolved Hide resolved
///
/// In addition to replacing illegal characters with an underscores, also
/// In addition to replacing illegal characters, it also
/// ensures that the identifier starts with a letter and not a number.
var sanitizedForSwiftCode: String {
guard !isEmpty else {
return "_empty"
}

// Only allow [a-zA-Z][a-zA-Z0-9_]*
// This is bad, is there something like percent encoding functionality but for general "allowed chars only"?

let firstCharSet: CharacterSet = .letters
let firstCharSet: CharacterSet = .letters.union(.init(charactersIn: "_"))
let numbers: CharacterSet = .decimalDigits
let otherCharSet: CharacterSet = .alphanumerics.union(.init(charactersIn: "_"))

Expand All @@ -83,13 +82,32 @@ fileprivate extension String {
sanitizedScalars.append("_")
outScalar = scalar
} else {
outScalar = "_"
sanitizedScalars.append("_")
if let entityName = Self.specialCharsMap[scalar] {
for char in entityName.unicodeScalars {
sanitizedScalars.append(char)
}
} else {
sanitizedScalars.append("x")
let hexString = String(scalar.value, radix: 16, uppercase: true)
for char in hexString.unicodeScalars {
sanitizedScalars.append(char)
}
}
sanitizedScalars.append("_")
continue
}
sanitizedScalars.append(outScalar)
}

let validString = String(UnicodeScalarView(sanitizedScalars))

//Special case for a single underscore.
//We can't add it to the map as its a valid swift identifier in other cases.
if validString == "_" {
return "_underscore_"
}

guard Self.keywords.contains(validString) else {
return validString
}
Expand Down Expand Up @@ -153,62 +171,6 @@ fileprivate extension String {
"true",
"try",
"throws",
"__FILE__",
"__LINE__",
"__COLUMN__",
"__FUNCTION__",
"__DSO_HANDLE__",
"_",
"(",
")",
"{",
"}",
"[",
"]",
"<",
">",
".",
".",
",",
"...",
":",
";",
"=",
"@",
"#",
"&",
"->",
"`",
"\\",
"!",
"?",
"?",
"\"",
"\'",
"\"\"\"",
"#keyPath",
"#line",
"#selector",
"#file",
"#fileID",
"#filePath",
"#column",
"#function",
"#dsohandle",
"#assert",
"#sourceLocation",
"#warning",
"#error",
"#if",
"#else",
"#elseif",
"#endif",
"#available",
"#unavailable",
"#fileLiteral",
"#imageLiteral",
"#colorLiteral",
")",
"yield",
"String",
"Error",
Expand All @@ -220,4 +182,40 @@ fileprivate extension String {
"Protocol",
"await",
]

/// A map of ASCII printable characters to their HTML entity names.
private static let specialCharsMap: [Unicode.Scalar: String] = [
" ": "space",
"!": "excl",
"\"": "quot",
"#": "num",
"$": "dollar",
"%": "percnt",
"&": "amp",
"'": "apos",
"(": "lpar",
")": "rpar",
"*": "ast",
"+": "plus",
",": "comma",
"-": "hyphen",
".": "period",
"/": "sol",
":": "colon",
";": "semi",
"<": "lt",
"=": "equals",
">": "gt",
"?": "quest",
"@": "commat",
"[": "lbrack",
"\\": "bsol",
"]": "rbrack",
"^": "hat",
"`": "grave",
"{": "lcub",
"|": "verbar",
"}": "rcub",
"~": "tilde",
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# SOAR-0001

Encoding for Property Names

## Overview

- Proposal: SOAR-0001
- Author(s): [Denil](https://github.com/denil-ct)
- Status: **Awaiting Review**
- Issue: https://github.com/apple/swift-openapi-generator/issues/21
- Implementation:
- https://github.com/apple/swift-openapi-generator/pull/89
- Affected components:
- generator

### Introduction

The goal of this proposal is to improve the way we handle unsupported characters in property names when generating code from specs. Currently, we use a block list approach, replacing offending characters with `_` which can cause name conflicts. By encoding the offending character we create unique and valid property names. This will avoid name collisions and ensure consistent code generation.

### Motivation

The current approach for handling unsupported characters in property names is not robust and can lead to unexpected and undesirable outcomes. For example, if there are two properties, `a_b` and `a b`, with the current implementation, this will result in the same generated property `a_b` for both, which would create a conflict. It can also result in loss of information or meaning from the original specification. Therefore, we need a better solution that can handle any unsupported character in a consistent and reliable way, without compromising the quality and functionality of the code.

### Proposed solution

The proposed solution to the problem is to use a mix of replacement words and hex encoding for any unsupported character in property names. We replace characters in the printable ASCII range (20-7E) with a wordified representation inspired by the HTML entity names [here](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references). Hex encoding is a simple and standard way of representing any character as a sequence of hexadecimal digits. For example, the asterisk (*) character is encoded as 2A, the space ( ) character is encoded as 20, and the slash (/) character is encoded as 2F. Hex encoding also has the added benefit of not introducing any additional special characters.
In addition to this, we will be prefixing the hex codes with an `x` to indicate they are hex values. There are also delimiters added in the form of the underscore character to indicate a possible replacement.

Some examples,

yaml | swift
-- | --
a b | a_space_b
a*b | a_ast_b
ab_ | ab_
ab* | ab_ast_
/ab | _sol_ab
Hu&J_?kin | Hu_amp_J__quest_kin
$nake… | \_dollar_nake_x2026\_
message | message

This would mean, that for the users of the generator, a future version of the generator might produce different names that what it currently produces right now and should be ready to make those changes before upgrading to this version.

### Detailed design

The implementation for this is quite simple as you can see in https://github.com/apple/swift-openapi-generator/pull/89, we just made changes to the substitution logic where it used to substitute with `_`. We have added an additional encoding step to the special character before substituting it. Contributors should be aware of this change and should review the places where they use this extension and evaluate if its suitable for them with this change.

### API stability

This is an API breaking change, as it will produce different symbol names than before. Other components such as the runtime and transports should not have any impacts.