Evaluate Logging Options for provider and tenant (container) logs #250

anilmurty · 2024-08-26T23:21:56Z

Is your feature request related to a problem? Please describe.

When debugging customer issues today, we have limited logging capabilities available. Broadly (ideally) we need logs from two places - the provider (including the kubernetes cluster) and from the container (the tenants application container) so that we can determine what was the source of the issue. And ideally we want the logs to be retained for a resonable amount of time (at least a few hours if not several days pr weeks) so that we don't have a short window of opportunity to catch the issues.

This is what we have (TODAY) in terms of logs, retention and ability to query things:

Provider/ Cluster:

Logs from the various k8s control plane components (API server, scheduler, etcd, kubelet, kube-proxy, container-runtime)
Logs from akash provider code (?) - are there any?
Availability: typically for the last hour?
Key Challenge: If you don't jump in and look at the logs fast enough (within an hour of the incedent) you lose them.

Tentant/ Container:

Container logs streamed into Console.
Available for the duration of the lease. If the lease is closed, logs are lost.

Limited query capabilities - mostly have to grep for anything we want from the above two logs.

Describe the solution you'd like

The logs from the provider software, the k8s control plane components and the tenant containers (apps) are collected, stored and made queryable through some logging platform like Lodgy or Grafana/ Loki, ELK stack or similar).

The provider should be able to configure (via CLI or provider console) where to send provider and k8s logs.

The user (tenant) should be able to configure (via Console or API) where the tenant/ container logs should be sent.

As a precursor to implementing any UI and API changes to support this we want to evaluate whether fluentd is a good option to use for us to use for log collection.

Benefits of using fluentd:

Open Source
Can run one fluentd pod per node and collect logs from all containers on the node (?)
Can configure fluentd plugins to
- Receive both (provider and container) types of logs (?)
- Output logs to various destinations (kafka, elasticsearch etc)
- Output various data export formats
- Set resource limits (in terms of CPU and memory) for the fluentd pods

Goal of the exercise:

Evaluate whether fluentd can be used for colllecting akash provider and tenant container logs. Will need to configure daemonset accordingly?
Evaluate resource load in terms of CPU and memory utilization of the nodes. We will want to run a multi-node cluster with some very chatty applications
Evaluate network bandwidth utilization (internal/ E-W as well as external/ N-S) for exporting logs.
Ensure sensitive information can be masked.
Ensure logs can be visualized and queried with at least one common visualization tool (kibana, grafana, logdy)

Describe alternatives you've considered

continuing to grep logs from kubectl for provider

Search

I did search for other open and closed issues before opening this

Code of Conduct

I agree to follow this project's Code of Conduct

Additional context

No response

andy108369 · 2024-09-18T09:19:03Z

I think we should look into the managed hassle-free solutions, so any Akash Provider (K8s cluster) can install the agent and get all the logs out of the box available on the dashboard:

https://newrelic.com/
https://www.datadoghq.com/
https://logz.io/
https://coralogix.com/
https://betterstack.com/logs
https://opensearch.org/ -- is an open source fork of ElasticSearch, which might be useful to go with instead of ES if we want to go down that path (to manage it ourselves)

And maintaining Elastic Search is a big pain. Few companies use cloud.elastic.co which is a managed ES solution.

I think most (if not all) of them support K8s pods logging (including akash-provider pod, etc), so we probably just need to pick the one that:

ideally the cheapest
easiest to install (similarly to netdata cloud agent installation, one-click install)
easiest & fastest to render the logs and useful UI/UX experience to parse through the logs (can simply test querying some of the known dseq, recently deployed)

fluentd DaemonSet and managed ElasticSearch
It should be very easy to install Fluentd directly into K8s since they provide DaemonSet install https://docs.fluentd.org/container-deployment/kubernetes
then we could just point it to the managed ElasticSearch https://cloud.elastic.co and that's it.
It looks like the basic managed ES would cost us 0.0532*(24*(365/12)) = $38/month, when looking at the defaults here https://cloud.elastic.co/pricing?elektra=pricing-page
It is likely that we'll have to scale it vertically over time, but probably not too much. I think we'll want like at least a month or two of the most recent logs
Elastic Cloud uses HTTPS by default and will need us to pass the fluentd auth & pass once we have the details from the ES cloud.
OpenSearch helm charts
We can also try in parallel getting ES running within existing K8s cluster on of the providers without leases
since ElasticSearch charts https://github.com/elastic/helm-charts has been archived on May 16, 2023, can use the OpenSearch instead https://opensearch.org/docs/latest/install-and-configure/install-opensearch/helm/
That's not going to be a centralized and a hassle-free managed solution yet, but at least we can evaluate whether fluentd DaemonSet logs what we need (akash-provider pod logs, etc) and see how much of the resources does OpenSearch consume.

andy108369 · 2024-09-18T14:06:03Z

@shimpa1 installed fluentd+Loki+grafana, everything looks to be working. He is now fixing a small issue with the fluentd not seeing kubernetes_metadata plugin. Once it is working, we can install it across the rest of the providers. ~~It is not going to be a centralized solution for ALL of the providers, but for EACH of the providers.~~ Artur asked to have it in a centralized place. So we'll get a VM with Loki+Grafana and point fluentd running on each provider to it.

anilmurty added the awaiting-triage label Aug 26, 2024

anilmurty assigned devalpatel67 and andy108369 Aug 26, 2024

anilmurty added this to Akash Cohesive Product / Engineering Roadmap Aug 26, 2024

github-project-automation bot moved this to Backlog (not prioritized) in Akash Cohesive Product / Engineering Roadmap Aug 26, 2024

anilmurty moved this from Backlog (not prioritized) to Up Next (prioritized) in Akash Cohesive Product / Engineering Roadmap Aug 26, 2024

chainzero added repo/provider Akash provider-services repo issues and removed awaiting-triage labels Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Logging Options for provider and tenant (container) logs #250

Evaluate Logging Options for provider and tenant (container) logs #250

anilmurty commented Aug 26, 2024 •

edited

Loading

andy108369 commented Sep 18, 2024 •

edited

Loading

andy108369 commented Sep 18, 2024 •

edited

Loading

Evaluate Logging Options for provider and tenant (container) logs #250

Evaluate Logging Options for provider and tenant (container) logs #250

Comments

anilmurty commented Aug 26, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Search

Code of Conduct

Additional context

andy108369 commented Sep 18, 2024 • edited Loading

andy108369 commented Sep 18, 2024 • edited Loading

anilmurty commented Aug 26, 2024 •

edited

Loading

andy108369 commented Sep 18, 2024 •

edited

Loading

andy108369 commented Sep 18, 2024 •

edited

Loading