Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: disable CGO when building on Fedora to avoid linking issues on the Ubuntu-based image (#2140) #2141

Merged
merged 1 commit into from
Mar 18, 2021

Conversation

orpiske
Copy link
Contributor

@orpiske orpiske commented Mar 16, 2021

Disable CGO when building on Fedora to avoid linking issues on the Ubuntu based image (#2140)

Release Note

fix linking issues when building on Fedora

Copy link
Member

@nicolaferraro nicolaferraro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, do we need CGO in non-fedora?

Btw, the released kamel is usually built on my Fedora laptop (never at latest version)

@orpiske
Copy link
Contributor Author

orpiske commented Mar 17, 2021

Cool, do we need CGO in non-fedora?

Btw, the released kamel is usually built on my Fedora laptop (never at latest version)

I believe this might be a problem when/if we start building Camel K on any system that uses glibc 2.32 or newer. The reason is that on 2.32 they moved some of the pthread_*sigmask symbols from libpthread to libc. Therefore any binary built against 2.32, that uses pthread_* symbols won't work on older versions. The opposite is not true, though.

This is the kamel released:

objdump -T kamel | grep pthread                                                                                                                                                                                                     
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_lock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_wait
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_unlock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_broadcast
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_create
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_detach
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_init
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_getstacksize
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_destroy
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_sigmask

This is the one built on Fedora 33:

objdump -T kamel| grep pthread                                                                                                                                                                       
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_lock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_wait
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_unlock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_broadcast
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_create
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_detach
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_init
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_getstacksize
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_destroy
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.32  pthread_sigmask

So, I am guessing it may be a problem on any/most of the distros where the ABI is incompatible so crossing these version boundaries may be problematic.

So, yes, I think we might need to rely on CGO to ensure we have statically-linked binaries for the time being (specially since it looks like more changes like this will happen in the future).

@astefanutti
Copy link
Member

My understanding is that the Kamel binary added to the operator image should be built on the image OS. I think that the current approach, that is to build the kamel binary locally, and add it to the operator image, is either an heritage of how the operator SDK scaffolded the project at its creation, or a convenient way to re-use the binary that is also used for the client CLI. We already cross-compiled the binary, which deactivates CGO, when it's built on non Linux OS.

It seems a possible solution would be to refactor the Dockerfile, and use a build stage to compile the binary on the same OS as the final image.

@orpiske
Copy link
Contributor Author

orpiske commented Mar 17, 2021

My understanding is that the Kamel binary added to the operator image should be built on the image OS. I think that the current approach, that is to build the kamel binary locally, and add it to the operator image, is either an heritage of how the operator SDK scaffolded the project at its creation, or a convenient way to re-use the binary that is also used for the client CLI. We already cross-compiled the binary, which deactivates CGO, when it's built on non Linux OS.

It seems a possible solution would be to refactor the Dockerfile, and use a build stage to compile the binary on the same OS as the final image.

Compiling it on the same image could solve the problem, indeed. However there's one thing that concerns me with this approach: it makes it harder for distribution packagers and downstream users building and productizing this. This is specially true on environments with secluded internet access, where the dependencies may not be available at build time. Thus forcing the users/packagers to go through the work of copying previously cached dependencies into the build stage and dealing with all the downsides of doing that.

@astefanutti
Copy link
Member

However it's done (multi-stage build, vendoring, Go proxy), I think packaging a binary built against the runtime image it will be the executed on, is more robust, and worth the possible extra efforts.

@orpiske
Copy link
Contributor Author

orpiske commented Mar 17, 2021

However it's done (multi-stage build, vendoring, Go proxy), I think packaging a binary built against the runtime image it will be the executed on, is more robust, and worth the possible extra efforts.

I think it's worth mentioning that this affects not just the operator, but the CLI client as well. As I pointed with the CLI client that is available on the release page: if you are building the kamel CLI on any system that is based on glibc 2.32 (or newer), you are eventually going to have this problem. It is not the case yet because @nicolaferraro is using an older Fedora version, but eventually it might be.

@astefanutti
Copy link
Member

Right. I think that's a bit different for client binaries, as probably Mac OS and Windows ones are cross-compiled, which disable CGO as far as a know. So it's probably fine disabling it for Linux client binary. That leads to the question, why disabling it only for Fedora?

And does it necessarily imply it should be done for the operator binary, for which it's possible to guarantee compatibility with its container image?

@orpiske
Copy link
Contributor Author

orpiske commented Mar 17, 2021

Right. I think that's a bit different for client binaries, as probably Mac OS and Windows ones are cross-compiled, which disable CGO as far as a know. So it's probably fine disabling it for Linux client binary. That leads to the question, why disabling it only for Fedora?

I think it might be safe to disable it for any Linux system regardless of OS and I can adjust the PR accordingly. Fedora was just the one that I could reproduce the issue. I am not sure how many others distros have upgraded to 2.32 already, but I guess this is likely to become more common with newer distributions.

And does it necessarily imply it should be done for the operator binary, for which it's possible to guarantee compatibility with its container image?

Yes if you are building on Linux:

camel-k/script/Makefile

Lines 228 to 236 in 2ac21c4

images: bundle-kamelets test
mkdir -p build/_maven_output
mkdir -p build/_output/bin
ifneq ($(shell uname -s 2>/dev/null || echo Unknown),Linux)
GOOS=linux go build $(GOFLAGS) -o build/_output/bin/kamel ./cmd/kamel/*.go
else
cp kamel build/_output/bin
endif
docker build -t $(IMAGE_NAME):$(VERSION) -f build/Dockerfile .

@orpiske
Copy link
Contributor Author

orpiske commented Mar 17, 2021

FYI, I modified it to disable CGO on any Linux system.

@astefanutti
Copy link
Member

OK, hopefully the operator won't be affected by not using any of the // +build cgo packages from the Go stdlib, which is probably the reason the binary was dynamically linked.

@orpiske
Copy link
Contributor Author

orpiske commented Mar 17, 2021

OK, hopefully the operator won't be affected by not using any of the // +build cgo packages from the Go stdlib, which is probably the reason the binary was dynamically linked.

Hm, maybe let's wait, then? Although there is a problem that affects both the CLI and operator, if the solution may cause unforeseen problems ... then that's not ideal. I think the CGO think would work OK for the CLI, but the container may be a different problem indeed. I'll ping you to brainstorm this tomorrow.

@astefanutti
Copy link
Member

Hm, maybe let's wait, then? Although there is a problem that affects both the CLI and operator, if the solution may cause unforeseen problems ... then that's not ideal. I think the CGO think would work OK for the CLI, but the container may be a different problem indeed. I'll ping you to brainstorm this tomorrow.

Totally agree. I'd like to make sure we understand the impacts for the operator. That's the reason I proposed to consider building it against its container image.

@orpiske
Copy link
Contributor Author

orpiske commented Mar 18, 2021

@astefanutti and I talked today about this PR and it seems we have sufficient information to believe the change is safe. I am writing a summary of these, so we can understand the reasoning if the PR turns out to be broken.

  1. It seems that the only dynamic symbols required by the binary are either glibc or golang-specific:
DYNAMIC SYMBOL TABLE:
000000000061fda0 g    DF .text	0000000000000050  Base        _cgo_panic
0000000000472280 g    DF .text	0000000000000019  Base        _cgo_topofstack
000000000061fe00 g    DF .text	000000000000005a  Base        crosscall2
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 __errno_location
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getnameinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getaddrinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 freeaddrinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 gai_strerror
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 malloc
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 free
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getgrgid_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getgrnam_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getpwnam_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getpwuid_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 realloc
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sysconf
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getgrouplist
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 stderr
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 fwrite
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 vfprintf
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 fputc
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 abort
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_lock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_wait
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_unlock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_broadcast
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_create
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 nanosleep
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_detach
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 strerror
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 fprintf
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_init
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_getstacksize
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_destroy
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigfillset
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.32  pthread_sigmask
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 mmap
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 munmap
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 setenv
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 unsetenv
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigemptyset
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigaddset
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigaction
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigismember
  1. Cross-compiling the Linux binary on a mac results on a statically linked executable similar to the one generated by this patch.
  2. There are two packages that rely on cgo and for both there are pure go implementations that are used as fallback when cgo is disabled or not available:

3.1 The first one is os/user for which the documentation says: " ... For most Unix systems, this package has two internal implementations of resolving user and group ids to names. One is written in pure Go and parses /etc/passwd and /etc/group. The other is cgo-based and relies on the standard C library (libc) routines such as getpwuid_r and getgrnam_r ... When cgo is available, cgo-based (libc-backed) code is used by default. This can be overridden by using osusergo build tag, which enforces the pure Go implementation ... "

3.2 The second is the resolver on the net package. In this package the documentation states: "... On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo ... By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name ..."

The binary does seem to be using the cgo implementation:

0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getnameinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getaddrinfo
  1. The patch is trivial do revert if there is any problem.

So, given these findings, we agreed it would be reasonably safe to do this change.

Copy link
Member

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants