Ident mangling and unicode. #7539

huonw · 2013-07-01T23:57:10Z

#7488 added a fix to mangle unicode identifiers because the android assembler can't handle them. It is essentially the easiest mangling possible (using char::escape_unicode and replacing the leading \ with a $), with no reference to any other compilers that perform unicode mangling.

Presumably, matching any precedent (if there is one) would be best, in terms of tool support etc.

This bug represents the task of researching this and fixing it.

The text was updated successfully, but these errors were encountered:

bblum · 2013-07-03T21:01:50Z

related #2253, i think?

huonw · 2013-08-25T11:54:27Z

Tools like GDB and c++filt should be considered; but a cursory googling doesn't turn up anything about how they demangle/understand unicode idents.

cc @michaelwoerister for the gdb aspect.

emberian · 2013-12-30T21:46:33Z

Triage: still an issue, but unicode identifiers are behind a feature gate. I looked into what clang does, which seems to be nothing:

int pörk() {
    return 1;
}

$ clang -c foo.cpp
$ objdump -t foo.o

foo.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 foo.cpp
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 l    d  .data  0000000000000000 .data
0000000000000000 l    d  .bss   0000000000000000 .bss
0000000000000000 l    d  .note.GNU-stack    0000000000000000 .note.GNU-stack
0000000000000000 l    d  .eh_frame  0000000000000000 .eh_frame
0000000000000000 g     F .text  000000000000000b _Z5pörkv

I'll look into what it does on ARM and android once the toolchains are installed....

emberian · 2013-12-30T21:49:32Z

Yeah, on arm-linux-gnueabi, you get:

/tmp/foo-jkv3GM.s: Assembler messages:
/tmp/foo-jkv3GM.s:12: Error: Missing symbol name in directive
/tmp/foo-jkv3GM.s:12: Error: unrecognized symbol type "_Z5pörkv"
/tmp/foo-jkv3GM.s:12: Error: junk at end of line, first unrecognized character is `,'
/tmp/foo-jkv3GM.s:13: Error: junk at end of line, first unrecognized character is `"'
/tmp/foo-jkv3GM.s:17: Error: expected comma after name `' in .size directive
clang: error: assembler command failed with exit code 1 (use -v to see invocation)

Seems it's just plain unsupported on ARM. Perhaps we should mangle per-target?

flaper87 · 2014-12-26T15:34:03Z

Triage bump:

I believe just 1 mangling would be better to keep consistency throughout targets.

tromey · 2016-03-13T16:07:22Z

On Linux the C++ name-mangling scheme is defined by the so-called Itanium C++ ABI. It doesn't define the encoding for identifiers outside of ASCII. See http://mentorembedded.github.io/cxx-abi/abi.html#mangling-structure and search for "A-Z":

is a pseudo-terminal representing the characters in the unqualified identifier for the entity in the source code. This ABI does not yet specify a mangling for identifiers containing characters outside of _A-Za-z0-9.

On most platforms, GCC sets the source charset to be UTF-8. I think the source charset is used when emitting symbols, and so this will be used. However, on a system where the host charset is EBCDIC (lol) the source charset will be UTF-EBCDIC. You probably don't need to worry about this.

GDB and BFD naively read the symbol names as bytes and hope the right thing happens. The demangler doesn't have any real support for any sort of Unicode mangling -- it does have a mode for Unicode support that was added for gcj, but this is only enabled if the Java demangling option is given, which isn't desirable for Rust. In this mode, a non-ASCII character is encoded as "__U[...hex digit...]+_" (e.g., "_U3FE"). (Though, hilariously, the decoding also bails on any result > 256 ... not sure how that makes any sense at all.) IIRC we added this sort of encoding precisely to work around assembler issues.

I think the best thing to do, on Linux, would be not to share mangling with C++ at all; define a sensible Rust mangling (it could share concepts or whatever but ideally would create distinct names -- basically don't start with "_Z"); and then add a Rust mode to the libiberty demangler and basic Rust demangling support to gdb. The gdb parts of this are not a large project and were done in recent times for D, so there's even recent patches that can be cribbed from.

steveklabnik · 2017-09-30T16:03:08Z

Triage: not aware of any changes here.

Centril · 2018-10-29T10:41:35Z

Triage: cc #55467.

bstrie · 2018-11-16T01:28:32Z

The recent pre-RFC for more principled symbol mangling would address this: https://internals.rust-lang.org/t/pre-rfc-a-new-symbol-mangling-scheme/8501

michaelwoerister · 2018-11-16T16:33:03Z

Yes, it would.

Mark-Simulacrum · 2019-09-18T00:34:31Z

I believe this can be closed now as we're eventually looking to switch to the new mangling scheme, and in particular this is already listed as an unresolved problem on the tracking issue (#60705); we can open a new bug if there's extensive discussion on the subject in the future.

Add new lint `negative_feature_names` and `redundant_feature_names` Add new lint [`negative_feature_names`] to detect feature names with prefixes `no-` or `not-` and new lint [`redundant_feature_names`] to detect feature names with prefixes `use-`, `with-` or suffix `-support` changelog: Add new lint [`negative_feature_names`] and [`redundant_feature_names`]

steveklabnik added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Feb 16, 2015

huonw mentioned this issue Nov 5, 2015

Tracking issue for non-ASCII identifiers (feature "non_ascii_idents") #28979

Closed

estebank mentioned this issue May 8, 2019

Introduce Rust symbol mangling scheme. #57967

Merged

Mark-Simulacrum closed this as completed Sep 18, 2019

Centril mentioned this issue Sep 18, 2019

Tracking issue for RFC 2603, "Rust Symbol Mangling (v0)" #60705

Open

16 tasks

rikkimax mentioned this issue Jun 13, 2022

[DO NOT MERGE] Fix Issue 23179 - Unicode in symbol names in DLLs breaks MSVC linker dlang/dmd#14207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ident mangling and unicode. #7539

Ident mangling and unicode. #7539

huonw commented Jul 1, 2013

bblum commented Jul 3, 2013

huonw commented Aug 25, 2013

emberian commented Dec 30, 2013

emberian commented Dec 30, 2013

flaper87 commented Dec 26, 2014

tromey commented Mar 13, 2016

steveklabnik commented Sep 30, 2017

Centril commented Oct 29, 2018

bstrie commented Nov 16, 2018

michaelwoerister commented Nov 16, 2018

Mark-Simulacrum commented Sep 18, 2019

Ident mangling and unicode. #7539

Ident mangling and unicode. #7539

Comments

huonw commented Jul 1, 2013

bblum commented Jul 3, 2013

huonw commented Aug 25, 2013

emberian commented Dec 30, 2013

emberian commented Dec 30, 2013

flaper87 commented Dec 26, 2014

tromey commented Mar 13, 2016

steveklabnik commented Sep 30, 2017

Centril commented Oct 29, 2018

bstrie commented Nov 16, 2018

michaelwoerister commented Nov 16, 2018

Mark-Simulacrum commented Sep 18, 2019