Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the best way to decide when to read system logs from files or journald #10797

Closed
Tracked by #37086
belimawr opened this issue Aug 14, 2024 · 14 comments · Fixed by #11618
Closed
Tracked by #37086

Investigate the best way to decide when to read system logs from files or journald #10797

belimawr opened this issue Aug 14, 2024 · 14 comments · Fixed by #11618
Assignees
Labels

Comments

@belimawr
Copy link
Contributor

Debian 12 has stopped writing system logs to traditional log files and now only uses journald by default (see release notes).

This makes the system integration unable to ingest some data because it expects to read direct from files.

We need to find the best way to detect the whether files or journald is used to store the system logs and configure the correct input (log/filestream or journald).

There is a similar issue in the Beats repository to handle the same situatin in Filebeat's system module: elastic/beats#40526.

@mauri870
Copy link
Member

Since all Debian 12 installations use systemd-journald, maybe a condition like os == "debian" && version >= 12 is enough? Or this is a more general problem to detect if a Linux OS uses journald vs log files? If it is the later we could probe for some specific files on /var/log (ie dmesg, kern.log, etc) as well as checking if systemd-journald.service is running.

@belimawr
Copy link
Contributor Author

A condition might be enough to star with, at this moment I'm not sure which information about the distros (like name and flavour) are available to use as conditions in the policy.

It is also a general problem of detecting it in all Linux hosts so we don't have to manually update it whenever a new distro/version starts (or stops) using journald for system logs.

The last bit of the challenge is (maybe not covered in this issue) is how to handle ingest pipelines and other assets that expect the event to be on a specific format (mostly the plain text form the traditional log files) that is different than what the journald input will create.

The ingest pipelines might just be a matter of updating them to also support the events from the journald input as they're capable of quite complex logics.

@pierrehilbert
Copy link
Contributor

@belimawr as discussed yesterday, could you come with some options that you identified to solve this issue and support operating system basing themself on journald instead of syslog?

@belimawr
Copy link
Contributor Author

Problem statement

To support the system integration on Debian 12 and other distributions
that are migrating away from traditional log files we need to be able
to be able ingest logs from journald or traditional log files. Ideally
this would be fully automated and transparent for the user.

Currently Filebeat's system module implements this by by looking if
the log files exist, if none are found, then the journald input is
used, otherwise the log input is used. To perform this check, Filebeat
uses a "proxy input" (system-logs) that decides whether to use the log
or journald input and starts the corresponding input.

Observations

  • The Elastic-Agent groups inputs based on the type defined in the
    integration data-stream's manifest, it does not need to match the
    Beat input.

Questions

Do we need support a scenario where the system integration can collect system logs from any supported OS without any configuration?

If we add a toggle in the in the system logs to enable journald
instead of traditional log files, we will support Debian 12. If this
is disabled by default, then there will be no issues for users
upgrading the integration on OSes that use traditional log files

My answer:
We probably need so users can have a single policy deployed across a
variety of hosts/OSes.

Do we need keep the Elastic-Agent running different process for actual input Beats types?

Honestly, what is a input type here? Is it a Beat input type or just
what is defined in the integration's manifest? How strict needs to be
this mapping to the Beat input type?

If it is the latter, then we can use the system-logs input that will
decide whether the journald or log input need to be instantiated. This
should work without any state loss because the state folder is based
on the type defined in the integration's manifest.

Best option

Use the system-logs input from Filebeat

This is the best option because it keeps Filebeat's system module and
the Elastic-Agent system integration consistent, and the logic to
decide which input to use is in a single place.

Challenges:

  • Elastic-Agent runs all inputs of the same type in a single process
    and each process has got its own state folder, so changing the
    input type between integration versions will make the newer version
    of the integration to loose its state, re-ingesting all files.

  • The Elastic-Agent will never know which input is running because it
    will send the system-logs input configuration and only when the
    Filebeat is actually starting the input it will make the decision
    between the log or journald inputs.

  • Filebeat input initialisation uses the input type before the input
    is actually instantiated, other configuration fields (like tags,
    processors, index, etc) are also processed independently from the
    input instantiating. This also causes input.type: system-logs to
    be added to the event regardless of the input that was
    instantiated.

  • Some configurations from the log input, like multiline have a
    different syntax on journald.

Keeping the same type in the integration/datastream manifest's should
allow the state to be kept (it needs to be tested)

Other options considered

Add a toggle in the integration configuration to read from journald

This is by far the easiest from an implementation point of view, we
just add a "use journald to collect system logs" and the user can
select whether to use traditional log files or journald.

Paths, exclude files, etc would all be ignored in journald, which
makes sense and it isn't too bad.

The down side is that users will have to a different policy for
different OSes.

The Elastic-Agent inspects the system and decides which input to start.

This is not a good option because it requires custom logic in the
Elastic-Agent to handle an specific case from an integration.

@cmacknz
Copy link
Member

cmacknz commented Oct 30, 2024

After discussion in the data plane team meeting today, we concluded that the current approach of the new systemlogs input does not work well for agent. Instead, we should pursue an approach based on agent's input conditions. https://www.elastic.co/guide/en/fleet/current/dynamic-input-configuration.html. This will clearly show which input is active in Elastic Agent's state and health reports.

First, we will revert the change in 8.16 that causes the system module to use the systemlogs input by default. This is so that no user is forced to use this input by default while we evaluate if we need it at all.

Then, we will work to implement a conditions based approach. This will require the introduction of a new OS version or distribution (Debian 12, Windows 10, etc) field in Elastic Agent's host provider. This condition can then be used by us to specify which OS versions require use of journald by default.

For example:

inputs:
  - type: log
    id: a
    paths:
      - /var/log/*.log
    condition: ${host.os_version} != 'debian12'
  - type: journald
    id: b
    condition: ${host.os_version} == 'debian12'

It must be possible for users to force use of journald or syslog regardless of the condition in the integration. We need to confirm that this can be done with the existing package templating support. The current OS conditional support for winlog can likely be used as a reference.

Image

The system logs input is being removed from the Filebeat system module to give us the freedom to see if we can also use this approach with Beats. Beats also have support for conditions, but primarily for autodiscovery which may not do what we need. While we evaluate this we do not want users using the new input.

After we have support for conditions based on the OS version, we should evaluate a way to simplify this detection. For example, the inputs would be conditional on the presence of the syslog file paths in the file system, but agent currently has no way for us to do this. If it is simpler to implement a condition based on the paths in the log input, we should pursue that immediately to avoid having to maintain the OS version detection logic.

@belimawr
Copy link
Contributor Author

belimawr commented Nov 1, 2024

I've been doing some testing and managed to get the conditions working, some key points:

  • There is a new input, journald listed under the system integration
  • Integrations allow for input types or data streams within an input to be enabled/disabled
  • The Journald inputs (one for syslog and another for auth) will have the condition populated to run on Debian 12
  • The current (log input) syslog and auth will have the condition populated to not run on Debian 12
  • By default, both inputs: log and journald will be enabled

This allows users to install the integration accepting the defaults and have it working on any supported OS while still allowing them to fine tune when to use log or journald input to their specific needs.

Screenshot of the integration

The copy can be greatly improved, I'm focusing on the overall user experience here

Image

Policy example

I removed some fields for simplicity, but that's the rendering from Fleet

inputs:
  - id: journald-system-e23caacf-2836-4068-981e-5e7cd7ffe3cc
    name: system-1
    revision: 1
    type: journald
    use_output: default
    meta:
      package:
        name: system
        version: 1.61.1
    data_stream:
      namespace: default
    package_policy_id: e23caacf-2836-4068-981e-5e7cd7ffe3cc
    streams:
      - id: journald-system.syslog-e23caacf-2836-4068-981e-5e7cd7ffe3cc
        type: journald
        data_stream:
          dataset: null
        condition: '${host.os_version} == "12 (bookworm)"'
  - id: logfile-system-e23caacf-2836-4068-981e-5e7cd7ffe3cc
    name: system-1
    revision: 1
    type: logfile
    use_output: default
    meta:
      package:
        name: system
        version: 1.61.1
    data_stream:
      namespace: default
    package_policy_id: e23caacf-2836-4068-981e-5e7cd7ffe3cc
    streams:
      - id: logfile-system.auth-e23caacf-2836-4068-981e-5e7cd7ffe3cc
        data_stream:
          dataset: system.auth
          type: logs
        condition: '${host.os_version} != "12 (bookworm)"'
        ignore_older: 72h
        paths:
          - /var/log/auth.log*
          - /var/log/secure*
        tags:
          - system-auth
      - id: logfile-system.syslog-e23caacf-2836-4068-981e-5e7cd7ffe3cc
        data_stream:
          dataset: system.syslog
          type: logs
        condition: '${host.os_version} != "12 (bookworm)"'
        paths:
          - /var/log/messages*
          - /var/log/syslog*
          - /var/log/system*

Elastic-Agent Overview

Image

Elastic-Agent status output

root@vagrant-debian-12:~/elastic-agent-8.15.42-SNAPSHOT-linux-x86_64# elastic-agent status --output full                                                                                                                                                                    ┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 530dd244-5270-4e87-ae1c-7b631bad1ede
   │  ├─ version: 8.15.42
   │  └─ commit: 68be8a762a06ed48b83249211d3ca9d2d00457f4
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '10329'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '10319'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '10308'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ journald-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '10298'
      ├─ journald-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      └─ journald-default-journald-system-c72ba722-5701-41f9-8645-f79b5f1d001a
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

Another option to have each distro listed explicitly in the conditions field, is to provide a list in the host provider for the "use systemd distros", then we can simplify the condition to something like

arrayContains(${host.systemd_distros}, ${host.os_version})

cc: @nimarezainia, @cmacknz

@belimawr
Copy link
Contributor Author

belimawr commented Nov 1, 2024

Filtering by os/version is going to be interesting, here are some examples of what we get with go-sysinfo:
Archlinux:

  "os": {
    "type": "linux",
    "family": "arch",
    "platform": "arch",
    "name": "Arch Linux",
    "version": "",
    "major": 0,
    "minor": 0,
    "patch": 0,
    "build": "rolling"
  }

Amazon Linux 2

  "os": {
    "type": "linux",
    "family": "redhat",
    "platform": "amzn",
    "name": "Amazon Linux",
    "version": "2",
    "major": 2,
    "minor": 0,
    "patch": 0,
    "codename": "Karoo"
  }

Debian 12:

  "os": {
    "type": "linux",
    "family": "debian",
    "platform": "debian",
    "name": "Debian GNU/Linux",
    "version": "12 (bookworm)",
    "major": 12,
    "minor": 0,
    "patch": 0,
    "codename": "bookworm"
  }

Debian 11:

  "os": {
    "type": "linux",
    "family": "debian",
    "platform": "debian",
    "name": "Debian GNU/Linux",
    "version": "11 (bullseye)",
    "major": 11,
    "minor": 0,
    "patch": 0,
    "codename": "bullseye"
  }

Ubuntu 22.04:

  "os": {
    "type": "linux",
    "family": "debian",
    "platform": "ubuntu",
    "name": "Ubuntu",
    "version": "22.04.5 LTS (Jammy Jellyfish)",
    "major": 22,
    "minor": 4,
    "patch": 5,
    "codename": "jammy"
  }

@cmacknz
Copy link
Member

cmacknz commented Nov 1, 2024

The integration and condition approach is working as expected, with some refining to do on the conditions, also expected.

You are going to need to filter on both the distro name and version, case in point being Amazon Linux, we do not want a condition just against the number "2" as that is far too ambiguous.

@belimawr
Copy link
Contributor Author

belimawr commented Nov 1, 2024

You are going to need to filter on both the distro name and version, case in point being Amazon Linux, we do not want a condition just against the number "2" as that is far too ambiguous.

Yes, version worked for Debian 12, but I had no hope it would hold true for other distros. Amazon Linux is the most interesting one, platform is more specific than family.

@belimawr
Copy link
Contributor Author

belimawr commented Nov 5, 2024

Sharing more of the progress and challenges. I've opened a draft PR, if anybody wants to follow the code (#11618) and a draft PR with the Elastic-Agent changes: elastic/elastic-agent#5941.

Regarding the challenges, to properly manage the ingest pipelines, I created new data_streams for the journald version of the syslog and auth data streams, this allows them to have their own tests as well. Journald provides much more structured information while traditional log files need to rely on more parsing or even fetching information from the host system, this makes part of the processing very different.

This creates an interesting situation where data_stream.dataset and event.dataset get set to syslog_journald and auth_journald instead of syslog and auth. I've briefly tried to fix it using ingest pipelines, but with no success as there are some mapping issues. I'm leaving it as is for now and I'll circle back later.

Another challenge is the system tests (when elastic-package runs an Elastic-Agent instance with the integration), the current implementation first starts an Elastic-Agent with the policy fully configured, then starts a container with the service being tested (in our case a container with the journal files to be ingested), the problem is: the journald input does not watch the filesystem for files, it assumes either the journal files are already there or it's reading from the system journald that already exists. I'm also leaving it on standby for now and I'll come back to it later.

@cmacknz
Copy link
Member

cmacknz commented Nov 5, 2024

This creates an interesting situation where data_stream.dataset and event.dataset get set to syslog_journald and auth_journald instead of syslog and auth.

We need to track fixing this as it will break all queries and visualizations. Make sure there is a bug tracking a fix for this.

@belimawr
Copy link
Contributor Author

belimawr commented Nov 5, 2024

This creates an interesting situation where data_stream.dataset and event.dataset get set to syslog_journald and auth_journald instead of syslog and auth.

We need to track fixing this as it will break all queries and visualizations. Make sure there is a bug tracking a fix for this.

I'll try to fix it as part of my PR.

@belimawr
Copy link
Contributor Author

belimawr commented Nov 5, 2024

Amazon Linux 2023 only uses journald, here is the host info:

{
  "type": "linux",
  "family": "redhat",
  "platform": "amzn",
  "name": "Amazon Linux",
  "version": "2023",
  "major": 2023,
  "minor": 6,
  "patch": 20241010,
  "codename": "Amazon Linux"
}

@belimawr
Copy link
Contributor Author

belimawr commented Nov 5, 2024

I managed to fix the value for data_stream.dataset and event.dataset using Beats processors, the processors declared in the integration configuration are added after the ones created by Elastic-Agent, so their values can be replaced. It's not the most elegant solution, but it works. Here is the implementation.

I also tested the dashboards, they all seem to work well, I added some screenshots in the integrations PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants