Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working OpenTelemetry sidecar solution #8622

Closed
wants to merge 9 commits into from
Closed

Conversation

Tobrek
Copy link
Contributor

@Tobrek Tobrek commented May 23, 2022

What this PR does / why we need it:

fixes #8437 and #5883
This is a working rudimentary support of OpenTelemetry sidecar. Currently just loading the OpenTelemetry module and the config at http block is supported. All other OpenTelemetry nginx directives (https://github.com/open-telemetry/opentelemetry-cpp-contrib/tree/main/instrumentation/nginx#nginx-directives) are not supported right now.
This PR should be just a possible solution to fix the sidecar and required libraries problem.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation only

Which issue/s this PR fixes

fixes #8437
#5883

How Has This Been Tested?

I used a minikube environment with an installed Opentelemetry collector and connected Grafana.
I added an additional config map for OpenTelemetry config:

    "kind": "ConfigMap",
    "apiVersion": "v1",
    "metadata": {
        "name": "nginx-otel",
        "namespace": "ingress-nginx",
        "creationTimestamp": null
    },
    "data": {
        "otel-nginx.toml": "exporter = \"otlp\"\nprocessor = \"batch\"\n\n[exporters.otlp]\n# Alternatively the OTEL_EXPORTER_OTLP_ENDPOINT environment variable can also be used.\nhost = \"http://otel-collector:4317\"\nport = 4317\n\n[processors.batch]\nmax_queue_size = 2048\nschedule_delay_millis = 5000\nmax_export_batch_size = 512\n\n[service]\nname = \"nginx-proxy\" # Opentelemetry resource name\n\n[sampler]\nname = \"AlwaysOn\" # Also: AlwaysOff, TraceIdRatioBased\nratio = 0.1\nparent_based = false\n"
    }
}

After installing ingress-nginx using helm chart 4.1.1 and this values file

controller:
  image:
    registry: tobrek
    image: tobrek-nginx
    tag: v1.2.1
    digest: ''
  config:
    enable-opentelemetry: true
    opentelemetry-config: "/conf/otel-nginx.toml"
  extraModules: 
  - name: opentelemetry
    image: "tobrek/tobrek-nginx:opentelemetry-sidecar-1.2.1"
  service:
    type: NodePort
  hostPort:
    enabled: true
  extraVolumeMounts:
  - name: otel-nginx
    readOnly: true
    mountPath: /conf
  extraVolumes:
   - name: otel-nginx
     configMap:
      name: nginx-otel

i saw incomming traces in Grafana for "nginx-proxy".

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
    I don't know if i changed all necessary documentation.
  • I've read the CONTRIBUTION guide
  • I have added tests to cover my changes.
    For what i know, yes.
  • All new and existing tests passed.
    new test cases - yes
    old tests - no. This one is failing locally:
panic: mkdir /.cache: permission denied

goroutine 1 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/testing/addr.init.0()
        /go/src/k8s.io/ingress-nginx/.modcache/sigs.k8s.io/[email protected]/pkg/internal/testing/addr/manager.go:51 +0xda
FAIL    k8s.io/ingress-nginx/internal/ingress/controller/store  0.010s

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 23, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @Tobrek!

It looks like this is your first PR to kubernetes/ingress-nginx 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/ingress-nginx has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 23, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @Tobrek. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 23, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Tobrek
To complete the pull request process, please assign elvinefendi after the PR has been reviewed.
You can assign the PR to them by writing /assign @elvinefendi in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@longwuyuan
Copy link
Contributor

You have change the nginx base image entry point and it needs a lot of reading to make sense of it.

Is it ok to ask for simple but elaborate and detailed explanation on why you need that change to the nginx Dockerfile.
Also, what happens to users who do not enable the opentelemetry. How are the non-users of opentelemetry impacted by the changes you are proposing here.

@Tobrek
Copy link
Contributor Author

Tobrek commented May 24, 2022

@longwuyuan Good point about test for use case without sidecar. I catch up on that test and the nginx is running without any errors, when started without the sidecar.

Regarding the changes in nginx Dockerfile:
First the changed CMD to ENTRYPOINT - i changed it back to CMD.

I removed all the (opentelemetry) entrypoint.sh parts, because i think they are not necessary. In the opentelemetry sidecar there are the files for the OpenTelemetry module and the libraries. The only difference for the module location (etc/nginx/modules vs /modules_mount/...) i recognized is that the load_module command have to have the path included. When copying the file to /etc/nginx/modules this may not needed. But the entrypoint.sh script copies the file to /etc/nginx/modules/modules (one modules to much). And this (too long) path must be also set explicit for the load_module command. So there is no big difference if the load_module command uses "/etc/nginx/modules/modules/otel_ngx_module.so" or "/modules_mount/etc/nginx/modules/otel_ngx_module.so".
Also if i recognized it correctly the nginx entrypoint is overwritten by nginx controller entrypoint. So if you want to copy the otel module to the "correct" place by executing the nginx entrypoint you have to refactor it. There are ways to do that, but I think it is more effort then benefit (see above).
In general, i reverted the opentelemetry entry point changes, to keep the OpenTelemetry module inside the sidecar. It can be handled there nearly in the same way as from the "correct" location, by less effort to move it to the new location.

The only change which is "new", is adding the /modules_mount path to the library search path by using ld-musl. My tests with LD_LIBRARY_PATH or copying the OpenTelemetry library files to any usable path (e.g. at init container runtime) didn't work, be to early / have no effect or not the permission to do that. E.g. to copy the files at init container runtime it requires root permission, but at this time it uses the www-data user. Setting weak permissions at /usr/lib, to be able to copy the files as www-data, is also no way you can handle it. That's why my solution is to add the /modules_mount path, where the OpenTelemetry library files will be, to the global library search path. That change is also active for using nginx without the sidecar. But this should have no negative effect (i hope), because normally the additional path doesn't exists.

In this context ... i reverted the copy of OpenTelemetry library files to /etc/nginx/modules in the sidecar init script. In this proposed solution the module and the library files are used by explicit path. That's why i don't see the need to mix the files up and have a clean path structure in the modules_mount.

@longwuyuan
Copy link
Contributor

@Tobrek part of what you described is acceptable and it will help if we can moe ahead with that. But there are other parts that are helpful but not integrated yet.

The project does want to move to having all modules as sidecars but it is not integrated yet. Meaning the helm chart and other manifest generation with opentelemetry, open tracing, mod security, others is not built into the helm chart or the manifest yaml generation bash scripts.

If you can suggest ho/where to set LD_LIBRARY_PATH with the way things are now, using the initcontainer technique that is working as of a few hours ago, we can make progress.

Otherwise I am waiting for advise from @esigo to change the design and deal with the libopentelemetry*.so files sitting in /modules_mount/.....

@strongjz
Copy link
Member

/kind bug
/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels May 31, 2022
@strongjz
Copy link
Member

/assign @Tobrek

/check-cla

If you havent yet @Tobrek can you sign the CNCF CLA https://easycla.lfx.linuxfoundation.org/#/

@Tobrek
Copy link
Contributor Author

Tobrek commented Jun 1, 2022

I signed the CNCF CLA several times, but it isn't accepted here. I don't know what I'm doing wrong.

@longwuyuan
Copy link
Contributor

longwuyuan commented Jun 1, 2022

@strongjz @Tobrek If a user wants to use opentelemetry as well as some other module, for example modsecurity, then are we going to have a ingress-controller pod with one container for ingress-controller code, a sidecar container for opentelemetry, a second sidecar for modsecurity, and potentially more than 3 containers in one single pod.

@Tobrek
Copy link
Contributor Author

Tobrek commented Jun 1, 2022

@longwuyuan the opentelemetry sidecar is no real sidecar in my opinion. This "sidecar" runs as init container to copy the files to the main container and then stops.

@longwuyuan
Copy link
Contributor

Thank you @Tobrek . We have data that creating ld-musl-x86_64.path is a absolute requirement.

Can we have a screen sharing to see your test or can you provide tips/steps on how I can checkout your fork and enable opentelemetry in a ingress-controller installation, using your fork. Or are there hard dependencies on rebuilding the base nginx image or a new opentelemetry image as per your changes.

@Tobrek
Copy link
Contributor Author

Tobrek commented Jun 2, 2022

@longwuyuan Sry currently i'm on vacation and online just in the evening (GMT+2). I think next week we can find a day, where i can show you what i did and the results.

@Tobrek
Copy link
Contributor Author

Tobrek commented Jun 13, 2022

Hello @longwuyuan, sry for the late response, but as written I was on vacation. Now I'm back and have my developing environment up and running again.
Sure, we can do a screen sharing session to show what i did and show my test. I can ping you at Slack to find a time slot if you want.

@longwuyuan
Copy link
Contributor

Yes, please ping me on slack. We have 2 PRs on this so hopefully we can gather info to make some progress.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 20, 2022
@Tobrek
Copy link
Contributor Author

Tobrek commented Jun 20, 2022

After discussion with @longwuyuan i rebased my fork with the latest changes and could fix my EasyCLA issue.
I tested my rebased fork (build sidecar, nginx and ingress-nginx locally and run the new version) and still got tracing information from nginx.

Reminder (as already mentioned in my initial post): additional required configuration to get the OpenTelemetry sidecar working:
otel configuration as configmap:

{
    "kind": "ConfigMap",
    "apiVersion": "v1",
    "metadata": {
        "name": "nginx-otel",
        "namespace": "ingress-nginx",
        "creationTimestamp": null
    },
    "data": {
        "otel-nginx.toml": "exporter = \"otlp\"\nprocessor = \"batch\"\n\n[exporters.otlp]\n# Alternatively the OTEL_EXPORTER_OTLP_ENDPOINT environment variable can also be used.\nhost = \"http://otel-collector:4317\"\nport = 4317\n\n[processors.batch]\nmax_queue_size = 2048\nschedule_delay_millis = 5000\nmax_export_batch_size = 512\n\n[service]\nname = \"nginx-proxy\" # Opentelemetry resource name\n\n[sampler]\nname = \"AlwaysOn\" # Also: AlwaysOff, TraceIdRatioBased\nratio = 0.1\nparent_based = false\n"
    }
}

additional lines in the values.yaml file (for exporter configuration above):

  extraVolumeMounts:
  # mount opentelemetry-cpp lib config from configmap
  - name: otel-nginx
    readOnly: true
    mountPath: /conf

  extraVolumes:
  # needed for otel-nginx extraVolumeMount
   - name: otel-nginx
     configMap:
      name: nginx-otel

Next open steps could / would be adding the Opentelemetry module configuration (nginx.conf, https://github.com/open-telemetry/opentelemetry-cpp-contrib/tree/main/instrumentation/nginx) to the helm chart.

@Tobrek
Copy link
Contributor Author

Tobrek commented Jun 20, 2022

I had a little chat with @longwuyuan why changing the nginx image. To don't loose the answers or type everything twice, here a copy of the slack conversation:

very suspected removal images/nginx/rootfs/Dockerfil line 36 37

Daniel Schulze
I removed the complete entrypoint stuff which copies the module to /etc/nginx/modules and load the module, when required, directly from /modules_mount/etc/nginx/modules
{{ if (shouldLoadOpentelemetryModule $cfg $servers) }}
load_module /modules_mount/etc/nginx/modules/otel_ngx_module.so;
{{ end }}
One reason and my main reason is that the complete entrypoint stuff was not working correctly and the images/nginx/rootfs/entrypoint.sh file was added for the sidecar.
The sidecar init_module.sh script did not match the expected path of the entrypoint.sh script. The init_modules.sh script was copying the module file to /modules_mount/etc/nginx/modules/modules/ and the entrypoint.sh script copies the module file using the wrong file name. So i expect to change the entrypoint.sh script regardless if i fix the script or use the easy way to use the /modules_mount path to include the module.

The only way to not change the nginx image would be to copy the module in the init_module.sh script to /modules_mount/etc/nginx/modules/otel_ngx_module.so/otel_ngx_module.so to match the entrypoint.sh expectation. And that looks hilarious.

With the idea of more sidecar / init container for other modules in mind, removing the entrypoint stuff is the better idea, because

  • as mentioned before now you can use the /mount_module as file system for all "external" modules - modules can be loaded with explicit path, /modules_mount/usr/lib is added to the library path
  • otherwise, without changing the entrypoint stuff, every sidecar module have to match the same strange path requirements as described above (/modules_mount/etc/nginx/modules/<module_name>/<module_name>) to be copied to /etc/nginx/modules in a correct way.

One more reason to remove the nginx entrypoint stuff: because ingress-nginx extends the nginx image, the ENTRYPOINT definition in rootfs/Dockerfile (and rootfs/Dockerfile.chroot) is overwritting the nginx/Dockerfile ENTRYPOINT, which should copy the files. I also found a way to handle "several entrypoint scripts", but this also requires a change to the nginx image.
(My solution to this problem would be to copy the nginx and the main entrypoint scripts to a specific directory and execute run-parts at this directory. Possible problems: order of execution, if the files are not named in an ordering way.)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2022
@k8s-ci-robot
Copy link
Contributor

@Tobrek: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tobrek
Copy link
Contributor Author

Tobrek commented Jun 23, 2022

This pull request is split into #8719 and #8735. Closing

@Tobrek Tobrek closed this Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An error occurred between opentelemetry modules
4 participants