fix: disable CGO when building on Fedora to avoid linking issues on the Ubuntu-based image (#2140) #2141

orpiske · 2021-03-16T17:17:25Z

Disable CGO when building on Fedora to avoid linking issues on the Ubuntu based image (#2140)

Release Note

fix linking issues when building on Fedora

nicolaferraro

Cool, do we need CGO in non-fedora?

Btw, the released kamel is usually built on my Fedora laptop (never at latest version)

orpiske · 2021-03-17T10:59:19Z

Cool, do we need CGO in non-fedora?

Btw, the released kamel is usually built on my Fedora laptop (never at latest version)

I believe this might be a problem when/if we start building Camel K on any system that uses glibc 2.32 or newer. The reason is that on 2.32 they moved some of the pthread_*sigmask symbols from libpthread to libc. Therefore any binary built against 2.32, that uses pthread_* symbols won't work on older versions. The opposite is not true, though.

This is the kamel released:

objdump -T kamel | grep pthread                                                                                                                                                                                                     
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_lock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_wait
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_unlock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_broadcast
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_create
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_detach
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_init
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_getstacksize
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_destroy
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_sigmask

This is the one built on Fedora 33:

objdump -T kamel| grep pthread                                                                                                                                                                       
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_lock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_wait
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_unlock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_broadcast
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_create
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_detach
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_init
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_getstacksize
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_destroy
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.32  pthread_sigmask

So, I am guessing it may be a problem on any/most of the distros where the ABI is incompatible so crossing these version boundaries may be problematic.

So, yes, I think we might need to rely on CGO to ensure we have statically-linked binaries for the time being (specially since it looks like more changes like this will happen in the future).

astefanutti · 2021-03-17T11:51:57Z

My understanding is that the Kamel binary added to the operator image should be built on the image OS. I think that the current approach, that is to build the kamel binary locally, and add it to the operator image, is either an heritage of how the operator SDK scaffolded the project at its creation, or a convenient way to re-use the binary that is also used for the client CLI. We already cross-compiled the binary, which deactivates CGO, when it's built on non Linux OS.

It seems a possible solution would be to refactor the Dockerfile, and use a build stage to compile the binary on the same OS as the final image.

orpiske · 2021-03-17T12:16:02Z

My understanding is that the Kamel binary added to the operator image should be built on the image OS. I think that the current approach, that is to build the kamel binary locally, and add it to the operator image, is either an heritage of how the operator SDK scaffolded the project at its creation, or a convenient way to re-use the binary that is also used for the client CLI. We already cross-compiled the binary, which deactivates CGO, when it's built on non Linux OS.

It seems a possible solution would be to refactor the Dockerfile, and use a build stage to compile the binary on the same OS as the final image.

Compiling it on the same image could solve the problem, indeed. However there's one thing that concerns me with this approach: it makes it harder for distribution packagers and downstream users building and productizing this. This is specially true on environments with secluded internet access, where the dependencies may not be available at build time. Thus forcing the users/packagers to go through the work of copying previously cached dependencies into the build stage and dealing with all the downsides of doing that.

astefanutti · 2021-03-17T14:14:12Z

However it's done (multi-stage build, vendoring, Go proxy), I think packaging a binary built against the runtime image it will be the executed on, is more robust, and worth the possible extra efforts.

orpiske · 2021-03-17T16:09:54Z

However it's done (multi-stage build, vendoring, Go proxy), I think packaging a binary built against the runtime image it will be the executed on, is more robust, and worth the possible extra efforts.

I think it's worth mentioning that this affects not just the operator, but the CLI client as well. As I pointed with the CLI client that is available on the release page: if you are building the kamel CLI on any system that is based on glibc 2.32 (or newer), you are eventually going to have this problem. It is not the case yet because @nicolaferraro is using an older Fedora version, but eventually it might be.

astefanutti · 2021-03-17T16:21:03Z

Right. I think that's a bit different for client binaries, as probably Mac OS and Windows ones are cross-compiled, which disable CGO as far as a know. So it's probably fine disabling it for Linux client binary. That leads to the question, why disabling it only for Fedora?

And does it necessarily imply it should be done for the operator binary, for which it's possible to guarantee compatibility with its container image?

orpiske · 2021-03-17T16:30:24Z

Right. I think that's a bit different for client binaries, as probably Mac OS and Windows ones are cross-compiled, which disable CGO as far as a know. So it's probably fine disabling it for Linux client binary. That leads to the question, why disabling it only for Fedora?

I think it might be safe to disable it for any Linux system regardless of OS and I can adjust the PR accordingly. Fedora was just the one that I could reproduce the issue. I am not sure how many others distros have upgraded to 2.32 already, but I guess this is likely to become more common with newer distributions.

And does it necessarily imply it should be done for the operator binary, for which it's possible to guarantee compatibility with its container image?

Yes if you are building on Linux:

camel-k/script/Makefile

Lines 228 to 236 in 2ac21c4

    
           images: bundle-kamelets test 
        
           	mkdir -p build/_maven_output 
        
           	mkdir -p build/_output/bin 
        
           ifneq ($(shell uname -s 2>/dev/null || echo Unknown),Linux) 
        
           	GOOS=linux go build $(GOFLAGS) -o build/_output/bin/kamel ./cmd/kamel/*.go 
        
           else 
        
           	cp kamel build/_output/bin 
        
           endif 
        
           	docker build -t $(IMAGE_NAME):$(VERSION) -f build/Dockerfile .

…he Ubuntu-based image (apache#2140)

orpiske · 2021-03-17T16:35:35Z

FYI, I modified it to disable CGO on any Linux system.

astefanutti · 2021-03-17T16:44:57Z

OK, hopefully the operator won't be affected by not using any of the // +build cgo packages from the Go stdlib, which is probably the reason the binary was dynamically linked.

orpiske · 2021-03-17T17:31:02Z

OK, hopefully the operator won't be affected by not using any of the // +build cgo packages from the Go stdlib, which is probably the reason the binary was dynamically linked.

Hm, maybe let's wait, then? Although there is a problem that affects both the CLI and operator, if the solution may cause unforeseen problems ... then that's not ideal. I think the CGO think would work OK for the CLI, but the container may be a different problem indeed. I'll ping you to brainstorm this tomorrow.

astefanutti · 2021-03-17T17:36:10Z

Hm, maybe let's wait, then? Although there is a problem that affects both the CLI and operator, if the solution may cause unforeseen problems ... then that's not ideal. I think the CGO think would work OK for the CLI, but the container may be a different problem indeed. I'll ping you to brainstorm this tomorrow.

Totally agree. I'd like to make sure we understand the impacts for the operator. That's the reason I proposed to consider building it against its container image.

orpiske · 2021-03-18T09:16:48Z

@astefanutti and I talked today about this PR and it seems we have sufficient information to believe the change is safe. I am writing a summary of these, so we can understand the reasoning if the PR turns out to be broken.

It seems that the only dynamic symbols required by the binary are either glibc or golang-specific:

DYNAMIC SYMBOL TABLE:
000000000061fda0 g    DF .text	0000000000000050  Base        _cgo_panic
0000000000472280 g    DF .text	0000000000000019  Base        _cgo_topofstack
000000000061fe00 g    DF .text	000000000000005a  Base        crosscall2
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 __errno_location
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getnameinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getaddrinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 freeaddrinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 gai_strerror
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 malloc
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 free
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getgrgid_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getgrnam_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getpwnam_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getpwuid_r
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 realloc
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sysconf
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getgrouplist
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 stderr
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 fwrite
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 vfprintf
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 fputc
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 abort
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_lock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_wait
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_mutex_unlock
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.3.2 pthread_cond_broadcast
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_create
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 nanosleep
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_detach
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 strerror
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 fprintf
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_init
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_getstacksize
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 pthread_attr_destroy
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigfillset
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.32  pthread_sigmask
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 mmap
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 munmap
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 setenv
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 unsetenv
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigemptyset
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigaddset
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigaction
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 sigismember

Cross-compiling the Linux binary on a mac results on a statically linked executable similar to the one generated by this patch.
There are two packages that rely on cgo and for both there are pure go implementations that are used as fallback when cgo is disabled or not available:

3.1 The first one is os/user for which the documentation says: " ... For most Unix systems, this package has two internal implementations of resolving user and group ids to names. One is written in pure Go and parses /etc/passwd and /etc/group. The other is cgo-based and relies on the standard C library (libc) routines such as getpwuid_r and getgrnam_r ... When cgo is available, cgo-based (libc-backed) code is used by default. This can be overridden by using osusergo build tag, which enforces the pure Go implementation ... "

3.2 The second is the resolver on the net package. In this package the documentation states: "... On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo ... By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name ..."

The binary does seem to be using the cgo implementation:

0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getnameinfo
0000000000000000      DO *UND*	0000000000000000  GLIBC_2.2.5 getaddrinfo

The patch is trivial do revert if there is any problem.

So, given these findings, we agreed it would be reasonably safe to do this change.

astefanutti

Thanks!

oscerd approved these changes Mar 16, 2021

View reviewed changes

nicolaferraro approved these changes Mar 17, 2021

View reviewed changes

fix: disable CGO when building on Fedora to avoid linking issues on t…

052b505

…he Ubuntu-based image (apache#2140)

orpiske force-pushed the issue-2140 branch from 7e48b8d to 052b505 Compare March 17, 2021 16:34

astefanutti approved these changes Mar 18, 2021

View reviewed changes

astefanutti merged commit 33a02e7 into apache:master Mar 18, 2021

astefanutti mentioned this pull request Mar 19, 2021

CGO and linking woes when building with newer Fedora versions #2140

Closed

nicolaferraro mentioned this pull request Apr 13, 2021

Release 1.4.0 #2134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: disable CGO when building on Fedora to avoid linking issues on the Ubuntu-based image (#2140) #2141

fix: disable CGO when building on Fedora to avoid linking issues on the Ubuntu-based image (#2140) #2141

orpiske commented Mar 16, 2021

nicolaferraro left a comment

orpiske commented Mar 17, 2021 •

edited

Loading

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021 •

edited

Loading

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021 •

edited

Loading

orpiske commented Mar 17, 2021

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021

astefanutti commented Mar 17, 2021

orpiske commented Mar 18, 2021 •

edited

Loading

astefanutti left a comment

fix: disable CGO when building on Fedora to avoid linking issues on the Ubuntu-based image (#2140) #2141

fix: disable CGO when building on Fedora to avoid linking issues on the Ubuntu-based image (#2140) #2141

Conversation

orpiske commented Mar 16, 2021

nicolaferraro left a comment

Choose a reason for hiding this comment

orpiske commented Mar 17, 2021 • edited Loading

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021 • edited Loading

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021 • edited Loading

orpiske commented Mar 17, 2021

astefanutti commented Mar 17, 2021

orpiske commented Mar 17, 2021

astefanutti commented Mar 17, 2021

orpiske commented Mar 18, 2021 • edited Loading

astefanutti left a comment

Choose a reason for hiding this comment

orpiske commented Mar 17, 2021 •

edited

Loading

orpiske commented Mar 17, 2021 •

edited

Loading

orpiske commented Mar 17, 2021 •

edited

Loading

orpiske commented Mar 18, 2021 •

edited

Loading