Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ident mangling and unicode. #7539

Closed
huonw opened this issue Jul 1, 2013 · 11 comments
Closed

Ident mangling and unicode. #7539

huonw opened this issue Jul 1, 2013 · 11 comments
Labels
A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) A-linkage Area: linking into static, shared libraries and binaries A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one.

Comments

@huonw
Copy link
Member

huonw commented Jul 1, 2013

#7488 added a fix to mangle unicode identifiers because the android assembler can't handle them. It is essentially the easiest mangling possible (using char::escape_unicode and replacing the leading \ with a $), with no reference to any other compilers that perform unicode mangling.

Presumably, matching any precedent (if there is one) would be best, in terms of tool support etc.

This bug represents the task of researching this and fixing it.

@bblum
Copy link
Contributor

bblum commented Jul 3, 2013

related #2253, i think?

@huonw
Copy link
Member Author

huonw commented Aug 25, 2013

Tools like GDB and c++filt should be considered; but a cursory googling doesn't turn up anything about how they demangle/understand unicode idents.

cc @michaelwoerister for the gdb aspect.

@emberian
Copy link
Member

Triage: still an issue, but unicode identifiers are behind a feature gate. I looked into what clang does, which seems to be nothing:

int pörk() {
    return 1;
}
$ clang -c foo.cpp
$ objdump -t foo.o

foo.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 foo.cpp
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 l    d  .data  0000000000000000 .data
0000000000000000 l    d  .bss   0000000000000000 .bss
0000000000000000 l    d  .note.GNU-stack    0000000000000000 .note.GNU-stack
0000000000000000 l    d  .eh_frame  0000000000000000 .eh_frame
0000000000000000 g     F .text  000000000000000b _Z5pörkv

I'll look into what it does on ARM and android once the toolchains are installed....

@emberian
Copy link
Member

Yeah, on arm-linux-gnueabi, you get:

/tmp/foo-jkv3GM.s: Assembler messages:
/tmp/foo-jkv3GM.s:12: Error: Missing symbol name in directive
/tmp/foo-jkv3GM.s:12: Error: unrecognized symbol type "_Z5pörkv"
/tmp/foo-jkv3GM.s:12: Error: junk at end of line, first unrecognized character is `,'
/tmp/foo-jkv3GM.s:13: Error: junk at end of line, first unrecognized character is `"'
/tmp/foo-jkv3GM.s:17: Error: expected comma after name `' in .size directive
clang: error: assembler command failed with exit code 1 (use -v to see invocation)

Seems it's just plain unsupported on ARM. Perhaps we should mangle per-target?

@flaper87
Copy link
Contributor

Triage bump:

I believe just 1 mangling would be better to keep consistency throughout targets.

@steveklabnik steveklabnik added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Feb 16, 2015
@tromey
Copy link
Contributor

tromey commented Mar 13, 2016

On Linux the C++ name-mangling scheme is defined by the so-called Itanium C++ ABI. It doesn't define the encoding for identifiers outside of ASCII. See http://mentorembedded.github.io/cxx-abi/abi.html#mangling-structure and search for "A-Z":

is a pseudo-terminal representing the characters in the unqualified identifier for the entity in the source code. This ABI does not yet specify a mangling for identifiers containing characters outside of _A-Za-z0-9.

On most platforms, GCC sets the source charset to be UTF-8. I think the source charset is used when emitting symbols, and so this will be used. However, on a system where the host charset is EBCDIC (lol) the source charset will be UTF-EBCDIC. You probably don't need to worry about this.

GDB and BFD naively read the symbol names as bytes and hope the right thing happens. The demangler doesn't have any real support for any sort of Unicode mangling -- it does have a mode for Unicode support that was added for gcj, but this is only enabled if the Java demangling option is given, which isn't desirable for Rust. In this mode, a non-ASCII character is encoded as "__U[...hex digit...]+_" (e.g., "_U3FE"). (Though, hilariously, the decoding also bails on any result > 256 ... not sure how that makes any sense at all.) IIRC we added this sort of encoding precisely to work around assembler issues.

I think the best thing to do, on Linux, would be not to share mangling with C++ at all; define a sensible Rust mangling (it could share concepts or whatever but ideally would create distinct names -- basically don't start with "_Z"); and then add a Rust mode to the libiberty demangler and basic Rust demangling support to gdb. The gdb parts of this are not a large project and were done in recent times for D, so there's even recent patches that can be cribbed from.

@steveklabnik
Copy link
Member

Triage: not aware of any changes here.

@Centril
Copy link
Contributor

Centril commented Oct 29, 2018

Triage: cc #55467.

@bstrie
Copy link
Contributor

bstrie commented Nov 16, 2018

The recent pre-RFC for more principled symbol mangling would address this: https://internals.rust-lang.org/t/pre-rfc-a-new-symbol-mangling-scheme/8501

@michaelwoerister
Copy link
Member

Yes, it would.

@Mark-Simulacrum
Copy link
Member

I believe this can be closed now as we're eventually looking to switch to the new mangling scheme, and in particular this is already listed as an unresolved problem on the tracking issue (#60705); we can open a new bug if there's extensive discussion on the subject in the future.

flip1995 pushed a commit to flip1995/rust that referenced this issue Sep 3, 2021
Add new lint `negative_feature_names` and `redundant_feature_names`

Add new lint [`negative_feature_names`] to detect feature names with prefixes `no-` or `not-` and new lint [`redundant_feature_names`] to detect feature names with prefixes `use-`, `with-` or suffix `-support`
changelog: Add new lint [`negative_feature_names`] and [`redundant_feature_names`]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) A-linkage Area: linking into static, shared libraries and binaries A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one.
Projects
None yet
Development

No branches or pull requests

11 participants