Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(scap): properly detect threads in child pidns during /proc scan #860

Merged
merged 2 commits into from
Feb 7, 2023

Conversation

gnosek
Copy link
Contributor

@gnosek gnosek commented Feb 2, 2023

tid != vtid is not necessary (though sufficient) to check if a thread is in a child pid namespace. This leads to pidns_start_ts being wrong occasionally (when a thread happens to have tid == vtid by chance, even if it's in a child pidns).

Note: the name of the flag (PPM_CL_CHILD_IN_PIDNS) isn't the best for this use case but it effectively means what we want ("we know this thread is in a child pidns, even if tid == vtid")

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area API-version

/area build

/area CI

/area driver-kmod

/area driver-bpf

/area driver-modern-bpf

/area libscap-engine-bpf

/area libscap-engine-gvisor

/area libscap-engine-kmod

/area libscap-engine-modern-bpf

/area libscap-engine-nodriver

/area libscap-engine-noop

/area libscap-engine-source-plugin

/area libscap-engine-savefile

/area libscap-engine-udig

/area libscap

/area libpman

/area libsinsp

/area tests

/area proposals

Does this PR require a change in the driver versions?

/version driver-API-version-major

/version driver-API-version-minor

/version driver-API-version-patch

/version driver-SCHEMA-version-major

/version driver-SCHEMA-version-minor

/version driver-SCHEMA-version-patch

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@gnosek
Copy link
Contributor Author

gnosek commented Feb 2, 2023

cc @incertum

@poiana poiana added the size/XS label Feb 2, 2023
@FedeDP
Copy link
Contributor

FedeDP commented Feb 2, 2023

/milestone 0.11.0

@poiana poiana added this to the 0.11.0 milestone Feb 2, 2023
@incertum
Copy link
Contributor

incertum commented Feb 3, 2023

Greg you are the best, thanks for catching these subtleties I totally wasn't aware of 🙏 !

Q: Should it also be changed here

if (tinfo->m_vpid != tinfo->m_pid)
-> the reason why I also added the check there is to have a consistent boot_ts without imprecisions and that's why the method of retrieving boot_ts was also changed in that same PR in order to have it constant between agent reboots even -> however it would also be ok to remove the check there all together, sometimes I try to be over precise where it's not even needed.

@gnosek
Copy link
Contributor Author

gnosek commented Feb 3, 2023

This particular one cost me over a month of work a while back (3b7a794) so now I'm suspicious every time I see tid and vtid in the same expression ;)

Thanks for pointing out the other place we need to fix :) I think we can just check in_container there, as defined at:

if(flags & PPM_CL_CHILD_IN_PIDNS || tid != vtid)
but I'll take another look after the coffee kicks in.

@incertum
Copy link
Contributor

incertum commented Feb 3, 2023

This sounds about right @gnosek would say, but I'll also actually check it as well! -> ✔️

@incertum
Copy link
Contributor

incertum commented Feb 4, 2023

@gnosek [nit] would renaming in_container to in_pidns improve technical clarity? Because libs only recognizes something as actual container when resolve cgroup returns a valid id. In addition, this happens later as part of the container engine.

A current corner case is when you launch a container directly over containerd CLI:

sudo ctr image pull docker.io/library/busybox:latest
sudo ctr run --rm docker.io/library/busybox:latest hello sh

-> cursor will be just blinking no bash prompt, but you can now run commands!

@terylt also requested something like incertum@8c79505, but am gonna hold off until we have discussed if we would even want that.

@gnosek
Copy link
Contributor Author

gnosek commented Feb 6, 2023

@incertum,

It seems we can use in_container in the clone exit parser so I did. I also added another commit on top, hoping to reduce stat() calls while scanning /proc. With the current code it seems we're doing a stat() for every thread for every cgroup subsystems and these calls add up quickly. Please take a look if you can :)

would renaming in_container to in_pidns improve technical clarity?

Possibly :) though I'd leave it to a separate no-op commit. Also, pidns_start_ts technically isn't correct either, since it's the creation time of some cgroup subsys, not the pidns (probably close enough for practical purposes but still).

something like incertum@8c79505

I'm potentially overengineering the problem but I think I'd prefer a bitmap of non-init namespaces the thread is in and a separate filtercheck to access them, otherwise we're one PR away from host_pidns_netns_ipc_user etc ;)

@incertum
Copy link
Contributor

incertum commented Feb 6, 2023

It seems we can use in_container in the clone exit parser so I did. I also added another commit on top, hoping to reduce stat() calls while scanning /proc. With the current code it seems we're doing a stat() for every thread for every cgroup subsystems and these calls add up quickly. Please take a look if you can :)

Very nice changes, love them!

would renaming in_container to in_pidns improve technical clarity?

Possibly :) though I'd leave it to a separate no-op commit. Also, pidns_start_ts technically isn't correct either, since it's the creation time of some cgroup subsys, not the pidns (probably close enough for practical purposes but still).

Agreed re separate PR.
Re pidns_start_ts technically isn't correct either -> very painfully aware, was banging my head against the wall for 2 days to find an approach to kindish support this field in the proc scan without adding extra lookups, so that's what seemed the most acceptable solution. Would you have new ideas?

something like incertum@8c79505

I'm potentially overengineering the problem but I think I'd prefer a bitmap of non-init namespaces the thread is in and a separate filtercheck to access them, otherwise we're one PR away from host_pidns_netns_ipc_user etc ;)

Agreed and I think a bitmap would be the right engineering approach here, thanks for the valuable input ;)
#865

@incertum
Copy link
Contributor

incertum commented Feb 6, 2023

/approve

incertum
incertum previously approved these changes Feb 6, 2023
@poiana
Copy link
Contributor

poiana commented Feb 6, 2023

LGTM label has been added.

Git tree hash: 36cb75d3f702d685f46f873abddb6a1bed938d67

@gnosek
Copy link
Contributor Author

gnosek commented Feb 6, 2023

Re pidns_start_ts technically isn't correct either -> very painfully aware, was banging my head against the wall for 2 days to find an approach to kindish support this field in the proc scan without adding extra lookups, so that's what seemed the most acceptable solution. Would you have new ideas?

Hmm the obvious thing of stat("/proc/$PID/ns/pid") doesn't work (it returns the current timestamp in all the timestamp fields). Once I get out of my current stack of yaks to shave, I'll see if the kernel value is exposed anywhere in userspace (I'm assuming the kernel does track the pidns start time, otherwise we've already lost).

@incertum
Copy link
Contributor

incertum commented Feb 6, 2023

Hmm the obvious thing of stat("/proc/$PID/ns/pid") doesn't work (it returns the current timestamp in all the timestamp fields). Once I get out of my current stack of yaks to shave, I'll see if the kernel value is exposed anywhere in userspace (I'm assuming the kernel does track the pidns start time, otherwise we've already lost).

Exactly ... so in the kernel instrumentation it is the time stamp of the init task in the pid namespace https://github.com/falcosecurity/libs/blob/master/driver/bpf/fillers.h#L2649, and doing cross lookups during proc scan seems not good ...

@gnosek
Copy link
Contributor Author

gnosek commented Feb 6, 2023

@incertum well do I have news for you... :D

  • the task's start time is (apparently) accessible as simply the timestamp (any timestamp) on its /proc/
  • for every task, its root filesystem is accessible via /proc/<pid>/root
  • the first task in every pidns has pid==1 (as seen from the pidns)

Putting these together, it seems it's enough to stat("/proc/<pid>/root/proc/1") and pick whichever you want out of [acm]time.

One downside is that we then depend on the thread having the right /proc mounted. It should always be true for actual container runtimes but not necessarily for handcrafted namespaces. Does that hurt your use case much? We could try to handle it (e.g. stat /proc vs /proc/<pid>/root/proc and readlink /proc/1/ns/pid vs /proc/<pid>/ns/pid) but there's always a way to set up the namespaces and /proc mounts in a way that will confuse everything except a (very heavy) scan of parent threads, verifying namespaces as we go up. And even then it's going to be racy.

Still, at least we could claim it's the creation time of the pidns that the thread can see in its /proc, which I suppose is something ;)

gnosek added a commit to gnosek/falco-libs that referenced this pull request Feb 6, 2023
We can significantly simplify the logic of checking the pidns start time
(during the initial /proc scan), based on the following observations:
* the task's start time is (apparently) accessible as simply the timestamp (any timestamp) on its /proc/
* for every task, its root filesystem is accessible via /proc/<pid>/root
* the first task in every pidns has pid==1 (as seen from the pidns)

Putting these together, it seems it's enough to stat("/proc/<pid>/root/proc/1")
and pick whichever you want out of [acm]time.

Ref: falcosecurity#860 (comment)

Signed-off-by: Grzegorz Nosek <[email protected]>
@gnosek
Copy link
Contributor Author

gnosek commented Feb 6, 2023

@incertum, can you give it a spin and see if it works for you? 1575222 (it's https://github.com/gnosek/falco-libs/tree/fix-pidns-start-scan)

@incertum
Copy link
Contributor

incertum commented Feb 6, 2023

I like this a lot, it's one step up and better than the current implementation 🎉 . Primary use case is containers.
Hence would say we worry about corner cases down the road.

Will pull and test it later tonight, ty! Initial intuition is it looks great!

@incertum
Copy link
Contributor

incertum commented Feb 7, 2023

This is really great and it works @gnosek, just tested it.

Feel free to include here and I re-approve or I'll approve the follow up PR!

@gnosek
Copy link
Contributor Author

gnosek commented Feb 7, 2023

/hold

bad push

tid != vtid is not necessary (though sufficient) to check if
a thread is in a child pid namespace. This leads to pidns_start_ts
being wrong occasionally (when a thread happens to have tid == vtid
by chance, even if it's in a child pidns).

Signed-off-by: Grzegorz Nosek <[email protected]>
We can significantly simplify the logic of checking the pidns start time
(during the initial /proc scan), based on the following observations:
* the task's start time is (apparently) accessible as simply the timestamp (any timestamp) on its /proc/
* for every task, its root filesystem is accessible via /proc/<pid>/root
* the first task in every pidns has pid==1 (as seen from the pidns)

Putting these together, it seems it's enough to stat("/proc/<pid>/root/proc/1")
and pick whichever you want out of [acm]time.

Ref: falcosecurity#860 (comment)

Signed-off-by: Grzegorz Nosek <[email protected]>
@gnosek
Copy link
Contributor Author

gnosek commented Feb 7, 2023

/unhold

Copy link
Contributor

@incertum incertum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

if(scap_proc_fill_pidns_start_ts(handle->m_lasterr, tinfo, dir_name) == SCAP_FAILURE)
{
// ignore errors
// the thread may not have /proc visible so we shouldn't kill the scan if this fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty!

@poiana poiana added the lgtm label Feb 7, 2023
@poiana
Copy link
Contributor

poiana commented Feb 7, 2023

LGTM label has been added.

Git tree hash: 284201999315c5563c5811977457d15743bd9c8b

Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Feb 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, gnosek, incertum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [FedeDP,gnosek,incertum]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana merged commit b18eef4 into falcosecurity:master Feb 7, 2023
@gnosek gnosek deleted the fix-tid-vtid branch February 7, 2023 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants