Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

symbol lookup error #7177

Closed
JacobUb opened this issue Dec 11, 2018 · 48 comments · Fixed by #8641
Closed

symbol lookup error #7177

JacobUb opened this issue Dec 11, 2018 · 48 comments · Fixed by #8641

Comments

@JacobUb
Copy link
Contributor

JacobUb commented Dec 11, 2018

After compiling successfully with 0.27 and LLVM 4.0.0 (Ubuntu on the WSL) running the executable gives this error:
./game_server: symbol lookup error: ./game_server: undefined symbol: errno, version Object::to_s<String::Builder>:String::Builder.
The specific method referenced can change between computers (computers that were able to run the executables before this error appeared).

Unfortunately I can't provide code to reproduce because the program has over 150k lines. I'm trying to comment out code in the hopes of narrowing down where this comes from.

The problem doesn't happen if I compile with --single-module or in MacOS.

It's not much to go on but does anyone have an inkling of where this could be coming from?

@asterite
Copy link
Member

Ubuntu on the WSL

Do you know if it works well in a regular Ubuntu?

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 11, 2018

Ubuntu on the WSL

Do you know if it works well in a regular Ubuntu?

Good question. As soon as I get home I'll check.

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 11, 2018

I just compiled it in an actual Ubuntu system and the same thing happens.
./game_server: symbol lookup error: ./game_server: undefined symbol: __libc_start_main, version L2Character#do_cast<Skill>:(Runnable::DelayedTask | Nil)
On WSL the undefined symbol is always errno even if the method mentioned changes every time.

@asterite asterite reopened this Dec 11, 2018
@asterite
Copy link
Member

Sorry, closed accidentally.

I don't know what could be going on. You could check whether the missing symbol is indeed missing from the generated .o files (inside the cache). Unfortunately I won't have time to debug this, less without some code to reproduce it.

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 11, 2018

The symbol errno appears in a file called "E-rrno.o" and nowhere else.
Edit: That's in the WSL, where the missing symbol is errno. In Ubuntu where it's __libc_start_main, the symbol doesn't appear in any file.

@RX14
Copy link
Contributor

RX14 commented Dec 15, 2018

This only reproduces on one program, right? Compiling other programs works fine?

@kostya
Copy link
Contributor

kostya commented Dec 15, 2018

seems i have similar problem once, fixed by removed ~/.cache/crystal/

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 15, 2018

This only reproduces on one program, right? Compiling other programs works fine?

Yes, only this one. But it happens in 3 different Linux installations. And not in MacOS.

seems i have similar problem once, fixed by removed ~/.cache/crystal/

That was among the first things I tried but nothing changed.

--

I've tried to comment out different code paths trying to pinpoint the source of the problem but I didn't get anywhere. Things that are unrelated to each other like logging or using class that wraps a Hash and a Mutex instead of a normal Hash look like triggers (the program runs if I remove either) but I have been using both extensively for a while without this happening. At one point even removing a single call to super (which other subclasses also call, for the same method, with no problems) has made it work.
Anyways this appears to be so random that if I don't figure it out in the next few days I'll close this and then open a proper issue if I can ever find out the cause and some code to reproduce.

@JacobUb JacobUb closed this as completed Dec 25, 2018
@asterite
Copy link
Member

@exilor Did you find the solution, or did you give up?

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 25, 2018

@exilor Did you find the solution, or did you give up?

As I said in my last comment:

Anyways this appears to be so random that if I don't figure it out in the next few days I'll close this and then open a proper issue if I can ever find out the cause and some code to reproduce.

I didn't find a singular cause. I'll reopen if I have something tangible to report.

@JacobUb
Copy link
Contributor Author

JacobUb commented Jan 18, 2019

The only way I've been able to reproduce the error is with this code:

{% for i in 0..35000 %}
  class Foo{{i}}
  end

  Foo{{i}}.new
{% end %}

Which gives the same kind of errors unless compiled with --single-module:
/home/jacob/.cache/crystal/crystal-run-test.tmp: symbol lookup error: /home/jacob/.cache/crystal/crystal-run-test.tmp: undefined symbol: errno, version Pointer(T)#clear:Nil
And again the method referenced at the end varies.
It looks like it has to do with the amount of code being compiled unless it's a separate issue with the same symptom.

@asterite
Copy link
Member

Each class is compiled into a separate dot o file, so it's possible we are hitting some limit.

@JacobUb
Copy link
Contributor Author

JacobUb commented Jan 19, 2019

After deleting the contents of the crystal cache I've compiled the application and there were 8193 files in it afterwards. It's a suspicious number ((1024 * 8) + 1) but it could be a coincidence.

@asterite
Copy link
Member

My experience with the compiler tells me that such numbers are never a coincidence :-)

@asterite asterite reopened this Jan 24, 2019
@JacobUb
Copy link
Contributor Author

JacobUb commented Jan 24, 2019

I've been commenting out classes to get the code size to decrease and eventually the executable runs. And then uncommenting that code and commenting out some other code just in case a few times. I'm pretty certain that those classes don't call any code that isn't also called elsewhere.
For every class that I comment out, the method referenced in the error changes.
In doing that, I've come across another suspicious number when compiling with -p:

[12/13] [817/4096] Codegen (bc+obj)

Apparently if it goes above 4096 the executable throws the error.
However in the crazy macro generated code I posted before this didn't happen (I had to get that /4096 to be much higher before it gave the same error).

Edit: It appears that as I get closer to 4096 (like 4098) the error changes from a symbol lookup error to any of the following:

__gmon_start__ version GLIBC_PRIVATE not defined in file libc.so.6
__gmon_start__ version GLIBC_2.2.5 not defined in file libm.so.6
__gmon_start__ version LIBXML2_2.4.30 not defined in file libxml2.so.2
__gmon_start__ version GCC_3.0 not defined in file libgcc_s.so.1

And these are "relocation errors" instead.
Is it possible that these... things are put together before or after all others and they sort of spill out because there are just too many?

@asterite
Copy link
Member

@exilor Could you try to compile a compiler using this branch and then compile the program that fails (not the one using a macro to generate many classes, but the real one that was failing for you) and tell me whether it works now? The branch tries to reduce the number of generated object files. It's not perfect, meaning that it's still possible to reach that limit, but the compiler's source code reaches about 1200 object files, and it's a pretty big project.

@JacobUb
Copy link
Contributor Author

JacobUb commented Jan 24, 2019

@asterite I've never compiled the compiler before but I'll try tomorrow (it's bedtime here). I'll keep you informed.

By the way, to be clear, this:

After deleting the contents of the crystal cache I've compiled the application and there were 8193 files in it afterwards. It's a suspicious number ((1024 * 8) + 1) but it could be a coincidence.

Was from compiling the actual program not the macro one.

@JacobUb
Copy link
Contributor Author

JacobUb commented Jan 25, 2019

@asterite It didn't work and it took more classes commented out to get it to work compared to Crystal 0.27.
These are the errors it gave, commenting out one class at a time:

jacob@jacob:~/Dropbox/Programming/Crystal/l2_cr/game_server$ /home/jacob/Downloads/crystal-1-refactor-obj-name/bin/crystal src/game_server.cr -p
Using compiled compiler at `.build/crystal'
/home/jacob/.cache/crystal/crystal-run-game_server.tmp: symbol lookup error: /home/jacob/.cache/crystal/crystal-run-game_server.tmp: undefined symbol: __libc_start_main, version L2CharacterAI#on_intention_interact<L2EffectPointInstance>:(Int32 | Nil)
jacob@jacob:~/Dropbox/Programming/Crystal/l2_cr/game_server$ /home/jacob/Downloads/crystal-1-refactor-obj-name/bin/crystal src/game_server.cr -p
Using compiled compiler at `.build/crystal'
/home/jacob/.cache/crystal/crystal-run-game_server.tmp: symbol lookup error: /home/jacob/.cache/crystal/crystal-run-game_server.tmp: undefined symbol: __libc_start_main, version Array(T)#concat<Tuple(Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32, Int32)>:Array(Int32)
jacob@jacob:~/Dropbox/Programming/Crystal/l2_cr/game_server$ /home/jacob/Downloads/crystal-1-refactor-obj-name/bin/crystal src/game_server.cr -p
Using compiled compiler at `.build/crystal'
/home/jacob/.cache/crystal/crystal-run-game_server.tmp: symbol lookup error: /home/jacob/.cache/crystal/crystal-run-game_server.tmp: undefined symbol: __libc_start_main, version Quest#add_talk_id<Int32, Int32, Int32, Int32, Int32, Int32>:Array(AbstractEventListener+)
jacob@jacob:~/Dropbox/Programming/Crystal/l2_cr/game_server$ /home/jacob/Downloads/crystal-1-refactor-obj-name/bin/crystal src/game_server.cr -p
Using compiled compiler at `.build/crystal'
/home/jacob/.cache/crystal/crystal-run-game_server.tmp: symbol lookup error: /home/jacob/.cache/crystal/crystal-run-game_server.tmp: undefined symbol: __libc_start_main, version Loggable#to_log<String::Builder>:(String::Builder | Nil)
jacob@jacob:~/Dropbox/Programming/Crystal/l2_cr/game_server$ /home/jacob/Downloads/crystal-1-refactor-obj-name/bin/crystal src/game_server.cr -p
Using compiled compiler at `.build/crystal'
/home/jacob/.cache/crystal/crystal-run-game_server.tmp: relocation error: /home/jacob/.cache/crystal/crystal-run-game_server.tmp: symbol __gmon_start__ version GLIBC_2.3.2 not defined in file libpthread.so.0 with link time reference

After taking out one more class it worked. As before, when approaching the point of working the error changed from symbol lookup error to relocation error.

@RX14
Copy link
Contributor

RX14 commented Sep 22, 2019

I'm hitting this with the crystal compiler: make spec.

make std_spec works, so I suggest too that this is probably a number of object files (classes) bug.

Could it be we're exceding the maximum commandline size to the linker?

@asterite
Copy link
Member

asterite commented Sep 22, 2019

We should probably generate one obj file for all generic instantiations of a same generic type. That means less reuse but also less explotion of obj files .

@RX14
Copy link
Contributor

RX14 commented Sep 22, 2019

@asterite would be nice to be able to confirm that's the problem somehow...

@asterite
Copy link
Member

You can compile with -v and see the list of linked objects. If it's something close to a power of 2 that might be it.

@JacobUb
Copy link
Contributor Author

JacobUb commented Sep 22, 2019

I counted 5077 files with --verbose.

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

$ ls ~/.cache/crystal/data-programming-crystal-lang-crystal-master-spec-all_spec.cr/*.o | wc -l
4088

looks close enough to 4096 to me, especially taking into account the .so files from C libraries.

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

A workaround: install lld and use CC="cc -fuse-ld=lld".

This is a binutils bug (for fucks sake)

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

binutils LD is producing an invalid executable, nm -D on the output executable

nm: all_spec.ld: .gnu.version_d invalid entry
nm: all_spec.ld: no symbols

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 19, 2019

CC="cc -fuse-ld=lld"

Where should that go?

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

it's an environment variable, you can prefix the crystal command with it in a shell

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 19, 2019

It fails:

/usr/bin/ld.lld: error: unknown argument: --push-state
/usr/bin/ld.lld: error: unknown argument: --pop-state
/usr/bin/ld.lld: error: unknown argument: --push-state
/usr/bin/ld.lld: error: unknown argument: --pop-state
collect2: error: ld returned 1 exit status

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

@exilor you need at least LLVM 7, or use CC="clang -fuse-ld=lld"

@bcardiff
Copy link
Member

Using lld-8 seems to work locally. Using crystallang/crystal:0.32.1-build

$ apt-get install lld-8
$ ln -s /usr/bin/ld.lld-8 /usr/bin/ld.lld
$ CC="cc -fuse-ld=lld" make std_spec

Should we update the -build images to include those tweaks?

@JacobUb
Copy link
Contributor Author

JacobUb commented Dec 19, 2019

@RX14 I already had LLVM 7 installed and using that command outputs a warning from clang that -fuse-ld=lldwill be unused during compilation.

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

@bcardiff this is a serious bug in binutils, and has to be reported upstream. These workarounds should be temporary, not attempted to be applied automatically.

--single-module is another workaround

I'd prefer either workaround be applied in crystal's CI setup, instead of at the docker image level though.

@bcardiff
Copy link
Member

I was just checking with --single-module right now. I think we might need to add that to the current CI scripts.

The current infrastructure use -build images directly. Installing new things and tweaking the environment is not smooth.

@bcardiff
Copy link
Member

  • --single-module requires more memory and failed in the CI.
  • Changing the compiler as suggested in symbol lookup error #7177 (comment)
    • requires one more release cycle so it can't be used as of today to build 0.33
    • is pushing the limit but the problem is still there
      • (current std_spec uses 3826 obj files, 198 are tuples/named_tuples, ~1400 seems to be generic instances, the whole path names are ~150KB but they are passed in an array args using execvp)
  • Splitting the std-lib and build it in phases will not be comfortable as a development cycle

I feel comfortable requiring that for developing crystal and collaborating with the std-lib lld is suggested/required to overcome this issue. Finding the reason in ld will not be helpful in time for us.

I vouch for:

  • shipping lld as default linker in docker -build images and
  • update docs to suggest installing/configuring lld
  • (optionally) detecting lld and use it if available

@RX14
Copy link
Contributor

RX14 commented Dec 20, 2019

How much does this reduce the number of object files in practice? I feel any granularity change is going to hurt incremental build times.

Finding the reason in ld will not be helpful in time for us.

It won't be helpful in the short-term, but it is neccesary in the long-term. Supporting lld only is just a workaround.

  • shipping lld as default linker in docker -build images

Are the -build images only used for crystal's CI, or are they more widely used? I forget. I'm fine with installing it there either way.

  • update docs to suggest installing/configuring lld

I think this should remain a workaround clearly stated/documented in this github issue, I'm not sure it belongs in the official docs (though I don't feel strongly about this). Especially since this bug only shows up on really big projects. i.e. the crystal compiler test suite itself.

  • (optionally) detecting lld and use it if available

I think that the crystal Makefile should add -fuse-ld=lld if lld is found. Doing it in the compiler would be neat, but i really am wary of "hiding" problems with workaround and magic. In my experience, workarounds must remain minimal-scope and minimal-invasiveness, or you end up with a mess. Since in practice this bug is only affecting this repo, we should limit the scope to here.

@bcardiff
Copy link
Member

How much does this reduce the number of object files in practice? I feel any granularity change is going to hurt incremental build times.

current std_spec uses 3826 obj files, 198 are tuples/named_tuples, ~1400 seems to be generic instances

It won't be helpful in the short-term, but it is neccesary in the long-term. Supporting lld only is just a workaround.

I have no idea were to diagnose what is happening with ld. For me is using X vs Y based on limitation. I failed to find any valuable information during the whole day.

Are the -build images only used for crystal's CI, or are they more widely used?

Only in the CI. They are built at https://github.com/crystal-lang/crystal-dist/blob/master/docker/crystal/Dockerfile#L16-L25 .

I was planning on adding apt-get install lld-8 and ln -sf /usr/bin/ld.lld-8 /usr/bin/ld and there will be no need to change anything else.

This is equivalent to let contributors set up lld as ld somehow for building the spec/compiler.

I think this should remain a workaround clearly stated/documented in this github issue, I'm not sure it belongs in the official docs

The thing is that contributors might come to this issue while running make std_spec. Even if there is a -fuse-ld=lld you would still need to warn the user about installing lld, and is not a hard dependency now to use lld. In osx ld seems to have no issues.

I think that the crystal Makefile should add -fuse-ld=lld if lld is found.

lld in ubuntu is 6.0, I guess it should work. lld-8 does not symlink to ld nor lld, there is no option to pass to -fuse-ld that will pick lld-8 directly.

That is way I would add the symlink of ln -sf /usr/bin/ld.lld-8 /usr/bin/ld. Since -build images are only expected to be used here, then we are scoping to this repo somehow.

@RX14
Copy link
Contributor

RX14 commented Dec 21, 2019

I have no idea were to diagnose what is happening with ld.

Me neither, but it's not practical to simply not support the linker with 90% market share. Not supporting ld is worse than not supporting Windows.

In practical terms, this means at the least submitting a bug report to them. We can leave this till later but I really don't want to forget.

ln -sf /usr/bin/ld.lld-8 /usr/bin/ld

This is wrong way to do it, export ENV CC cc -fuse-ld=ldd is the right way to do it if you want to do it container-wide. I still prefer the makefile method.

The thing is that contributors might come to this issue while running make std_spec

We definitely need to document it somewhere in the contributing docs.

In osx ld seems to have no issues.

I think ld is probably ldd on osx.

there is no option to pass to -fuse-ld that will pick lld-8 directly.

Well, shit. I just played around with this, and clang supports -fuse-ld=/usr/bin/ld.lld-8, but gcc needs ln -sd /usr/bin/ld.lld-8 /usr/bin/ld.lld. Fixing debian's ldd install and using -fuse-ld in the makefile would be very preferable to me over making lld the ld system-wide.

@bcardiff
Copy link
Member

So I will rebuild 0.32.1-build images with https://github.com/crystal-lang/crystal-dist/pull/8/files

And after that the following changes to the Makefile to build the specs https://github.com/crystal-lang/crystal/compare/ci/use-lld using lld

If lld is not installed the following error will be shown

CC="cc -fuse-ld=lld"  ./bin/crystal build  --exclude-warnings spec/std --exclude-warnings spec/compiler -o .build/std_spec spec/std_spec.cr
collect2: fatal error: cannot find 'ld'
compilation terminated.
Error: execution of command failed with code: 1: `cc -fuse-ld=lld "${@}" -o '/root/.cache/crystal/src-src-ecr-process.cr/macro_run'  -rdynamic  -lpcre -lm /usr/bin/../lib/crystal/lib/libgc.a -lpthread /src/src/ext/libcrystal.a -levent -lrt -ldl -L/usr/bin/../lib/crystal/lib -L/usr/lib -L/usr/local/lib`
Makefile:125: recipe for target '.build/std_spec' failed

In osx if -fuse-ld=lld is used the following error is shown, so the changes in the Makefile are applied on Linux only.

clang: warning: argument unused during compilation: '-fuse-ld=lld' [-Wunused-command-line-argument]
ld64.lld: warning: -sdk_version is required when emitting min version load command.  Setting sdk version to match provided min version
ld64.lld: error: Unable to find library for -lpthread
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Sound good?

I've also updated the https://github.com/crystal-lang/crystal/wiki/All-required-libraries#ubuntu

@RX14
Copy link
Contributor

RX14 commented Dec 24, 2019

@bcardiff I was thinking of only adding -fuse-ld=lld if ld.lld was found on the path... (and it's linux, I guess)

@JacobUb
Copy link
Contributor Author

JacobUb commented Feb 14, 2020

After updating to 0.33.0 the error persists with the same message.

@bcardiff
Copy link
Member

Even with lld?

@JacobUb
Copy link
Contributor Author

JacobUb commented Feb 16, 2020

which lld says /usr/bin/lld so I think so.

@bcardiff
Copy link
Member

After installing lld there are one of the following you need to do:

  1. Since lld is a drop replacement, so you can overwrite ld with lld.
  2. export CC="cc -fuse-ld=lld" and run things as usual

@JacobUb
Copy link
Contributor Author

JacobUb commented Feb 16, 2020

Oh my. I can't believe it worked after so long! The executable went from 181 mb to 198 but the program seems the be working just fine.
Thanks @bcardiff and everyone else involved.
Edit: compilation time went down from almost 6 minutes with --single-module to 2 without it. Such a relief.

@oprypin
Copy link
Member

oprypin commented Apr 5, 2020

On Arch Linux, for Crystal master, running make clean; make && make spec, I get the following:

.build/all_spec: symbol lookup error: .build/all_spec: undefined symbol: __libc_start_main, version spec/compiler/macro/macro_expander_spec.cr:68
make: *** [Makefile:81: spec] Error 127

@oprypin
Copy link
Member

oprypin commented Apr 5, 2020

After installing lld, make spec works.

@funny-falcon
Copy link
Contributor

Yep, installing lld helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants