Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for indexing root filesystem #442

Merged
merged 8 commits into from
Jun 29, 2021
Merged

Add support for indexing root filesystem #442

merged 8 commits into from
Jun 29, 2021

Conversation

wagoodman
Copy link
Contributor

Before this PR indexing the root filesystem would typically not ever complete. This is mainly due to the fact that sensitive system dirs such as /proc, /sys, and /dev were not ignored.

Changes:

  • Directory resolver now indexes a filesystem first (using stereoscopes existing FileTree object)
  • The filesystem index filters sensitive paths (/proc, /sys, /dev)
  • Add ETUI support to show indexing progress

When an unreadable path is found (permission denied, io error, etc), it is tracked and ignored so that indexing may continue.

The added advantage of using the existing stereoscope FileTree object is now globbing idioms are the same between the directory resolver and any image resolver. Before this PR there were subtle differences (e.g. using * vs ** would have different recursive search strategies, which could get confusing).

Future changes:

Depending on the size of the filesystem, this may still take a while. The only way to improve this process further is to remove globbing and not allow for basename searches (full paths required). Another alternative to speed this up without removing these features would be to try and leverage already-created filesystem indexes, but these indexes might be stale or untrusted.

Closes #283

@wagoodman wagoodman self-assigned this Jun 19, 2021
@github-actions
Copy link

github-actions bot commented Jun 19, 2021

Benchmark Test Results

Benchmark results from the latest changes vs base branch
name                                                   old time/op    new time/op    delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           795µs ± 5%     928µs ± 2%  +16.70%  (p=0.008 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2        1.10ms ± 7%    1.26ms ± 2%  +15.07%  (p=0.008 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     415µs ± 2%     479µs ± 1%  +15.25%  (p=0.008 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 387µs ± 3%     460µs ± 1%  +18.98%  (p=0.008 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  416µs ± 5%     488µs ± 1%  +17.39%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  5.51ms ± 7%    6.36ms ± 5%  +15.48%  (p=0.008 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                  584µs ± 5%     698µs ± 2%  +19.45%  (p=0.008 n=5+5)
ImagePackageCatalogers/go-cataloger-2                     218µs ± 5%     262µs ± 3%  +20.57%  (p=0.008 n=5+5)
ImagePackageCatalogers/rust-cataloger-2                   318µs ± 3%     395µs ± 1%  +24.25%  (p=0.008 n=5+5)

name                                                   old alloc/op   new alloc/op   delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2          97.4kB ± 0%    97.5kB ± 0%     ~     (p=0.222 n=5+5)
ImagePackageCatalogers/python-package-cataloger-2         579kB ± 0%     579kB ± 0%     ~     (p=1.000 n=5+5)
ImagePackageCatalogers/javascript-package-cataloger-2     112kB ± 0%     112kB ± 0%     ~     (p=0.056 n=5+5)
ImagePackageCatalogers/dpkgdb-cataloger-2                 115kB ± 0%     116kB ± 0%     ~     (p=0.222 n=5+5)
ImagePackageCatalogers/rpmdb-cataloger-2                  134kB ± 0%     134kB ± 0%   +0.01%  (p=0.008 n=5+5)
ImagePackageCatalogers/java-cataloger-2                  1.79MB ± 0%    1.79MB ± 0%     ~     (p=0.151 n=5+5)
ImagePackageCatalogers/apkdb-cataloger-2                 1.14MB ± 0%    1.14MB ± 0%     ~     (p=0.421 n=5+5)
ImagePackageCatalogers/go-cataloger-2                    53.9kB ± 0%    53.9kB ± 0%     ~     (p=0.421 n=5+5)
ImagePackageCatalogers/rust-cataloger-2                  88.9kB ± 0%    88.9kB ± 0%   +0.01%  (p=0.029 n=4+4)

name                                                   old allocs/op  new allocs/op  delta
ImagePackageCatalogers/ruby-gemspec-cataloger-2           1.96k ± 0%     1.96k ± 0%     ~     (all equal)
ImagePackageCatalogers/python-package-cataloger-2         5.89k ± 0%     5.89k ± 0%     ~     (all equal)
ImagePackageCatalogers/javascript-package-cataloger-2     1.93k ± 0%     1.93k ± 0%     ~     (all equal)
ImagePackageCatalogers/dpkgdb-cataloger-2                 2.37k ± 0%     2.37k ± 0%     ~     (all equal)
ImagePackageCatalogers/rpmdb-cataloger-2                  3.19k ± 0%     3.19k ± 0%     ~     (all equal)
ImagePackageCatalogers/java-cataloger-2                   22.3k ± 0%     22.3k ± 0%     ~     (all equal)
ImagePackageCatalogers/apkdb-cataloger-2                  1.85k ± 0%     1.85k ± 0%     ~     (all equal)
ImagePackageCatalogers/go-cataloger-2                     1.44k ± 0%     1.44k ± 0%     ~     (all equal)
ImagePackageCatalogers/rust-cataloger-2                   2.75k ± 0%     2.75k ± 0%     ~     (all equal)

Copy link
Contributor

@dakaneye dakaneye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits and some questions and some larger questions here too, but great work on the improvements in this pR!

A few other testing ideas I had:

  • Should we have an integration test(s) for file system indexing / traversal?
  • Is the addPathToIndex method missing a unit test? I didn't see any test cases for symlinks but not sure if they exist already or were implicit in the other tests you addeed?

internal/string_helpers_test.go Show resolved Hide resolved
syft/event/parsers/parsers.go Show resolved Hide resolved
)

var systemRuntimePrefixes = []string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unixSystemRuntimePrefixes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also will we ever want this to be configurable?

Copy link
Contributor Author

@wagoodman wagoodman Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect one day we may, for now I've opted out of doing so.

syft/source/directory_resolver.go Outdated Show resolved Hide resolved
// why account for multiple roots? To cover cases when there is a symlink that references above the root path,
// in which case we need to additionally index where the link resolves to. it's for this reason why the filetree
// must be relative to the root of the filesystem (and not just relative to the given path).
roots := []string{root}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strikes me as something that could be its own method, with a unit test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also not really clear to me what this is doing since the roots attribute isn't used after this loop. Is it just an error handling mechanism?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roots is a local variable used as a scratch list of paths that should be indexed. When indexing a path, some symlink resolution may indicate that other paths outside of the current path need to be indexed as well... these paths get added to the list of paths to be indexed. One path is indexed at a time, and is popped off of the front of the list.

The roots variable isn't needed outside of this function since it's only purpose is to track things that should be indexed, and by the time we leave this function that slice should always be empty (unless there is an error).

I'll move this to a separate function, add tests for it, and rename a few of the variables to improve readability here.

syft/source/directory_resolver.go Outdated Show resolved Hide resolved
}

// permission denied, IO error, etc... we keep track of the paths we can't see, but continue with indexing
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be filtering the type of error here? seems pretty broad at present, which makes me question if there are any errors that should stop execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit it is broad. I couldn't thing of an error that I would get back for a single path that would indicate there is a larger problem that should stop execution. The closest thing I can imagine is an IO error, however, that doesn't necessarily mean that there is a larger over-all problem that should indicate halting execution.

The largest risk is there being an over-arching problem with accessing the storage device and there is nothing indexed, and no error returned. There would be at least one warning (if not several) that would indicate something is wrong.

What I can do for now is restrict this to only permission denied errors, and we can adjust back if we find other cases later.

syft/source/directory_resolver.go Outdated Show resolved Hide resolved
syft/source/directory_resolver.go Show resolved Hide resolved
syft/source/file_type.go Outdated Show resolved Hide resolved
@wagoodman
Copy link
Contributor Author

wagoodman commented Jun 25, 2021

@dakaneye re: testing suggestions:

Should we have an integration test(s) for file system indexing / traversal?

There are already integration-level tests that flex directory scanning and check for scanning correctness. An integration test shouldn't be asserting how a task was performed (in this case asserting specific indexing conditions), only that the result is shallowly correct and asserts that components are wired up correctly. I was tempted to add an integration test that would flex skipping system paths (as asserted in unit tests) but decided against it (for the above reasons). But, I think this would at least be good to capture as a regression test (assert dir scanning of root doesn't take that long). I think a CLI test here makes the most sense, but the test harness will be much different than other tests since it would need to be done within a container so I'll need a bit to add this in.

Is the addPathToIndex method missing a unit test? I didn't see any test cases for symlinks but not sure if they exist already or were implicit in the other tests you added?

Good find, I'll update to test the subject newDirectoryResolver but asserting the effects expected from addPathToIndex. All of the other tests are testing the correct behavior of the index via FilesByGlob and FilesByPath, so testing correctness is already covered, but couldn't hurt to assert some expected behavior of the index directly.

@wagoodman wagoodman requested a review from dakaneye June 25, 2021 20:15
@bureado
Copy link
Contributor

bureado commented Jun 28, 2021

Good stuff! Right now and at least to me this seems to fail on encountering a dangling symlink. Going through the source to see if I can identify a workaround:

failed to catalog input: unable to determine resolver while cataloging packages: unable to index filesystem path="/usr/lib/debug/lib/udev/v4l_id-239-34.cm1.x86_64.debug": unable to access path="/usr/lib/debug/lib/udev/v4l_id-239-34.cm1.x86_64.debug": lstat /usr/lib/debug/lib/udev/v4l_id-239-34.cm1.x86_64.debug: no such file or directory

@bureado
Copy link
Contributor

bureado commented Jun 28, 2021

Here's something quick that worked for me:

diff --git a/syft/source/directory_resolver.go b/syft/source/directory_resolver.go
index edd7834..0971d88 100644
--- a/syft/source/directory_resolver.go
+++ b/syft/source/directory_resolver.go
@@ -93,7 +93,12 @@ func (r *directoryResolver) indexTree(root string) ([]string, error) {
 				return nil
 			}
 
+			if info == nil {
+				return nil
+			}
+
 			newRoot, err := r.addPathToIndex(path, info)
+
 			if err = r.handleFileAccessErr(path, err); err != nil {
 				return fmt.Errorf("unable to index path: %w", err)
 			}
@@ -107,7 +112,7 @@ func (r *directoryResolver) indexTree(root string) ([]string, error) {
 }
 
 func (r *directoryResolver) handleFileAccessErr(path string, err error) error {
-	if errors.Is(err, os.ErrPermission) {
+	if errors.Is(err, os.ErrPermission) || errors.Is(err, os.ErrNotExist) {
 		// don't allow for permission errors to stop indexing, keep track of the paths and continue.
 		log.Warnf("unable to access path=%q: %+v", path, err)
 		r.errPaths[path] = err

@wagoodman
Copy link
Contributor Author

@bureado great find! And much appreciated for hunting down a patch too! I'll incorporate in the branch shortly 🙌

@wagoodman wagoodman force-pushed the dir-performance branch 2 times, most recently from ab01847 to 415ae45 Compare June 29, 2021 16:41
Copy link
Contributor

@dakaneye dakaneye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me! Great work @wagoodman

GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* change directory resolver to ignore system runtime paths + drive by index

Signed-off-by: Alex Goodman <[email protected]>

* add event/etui support for filesystem indexing (for dir resolver)

Signed-off-by: Alex Goodman <[email protected]>

* add warnings for path indexing problems

Signed-off-by: Alex Goodman <[email protected]>

* add directory resolver index tests

Signed-off-by: Alex Goodman <[email protected]>

* improve testing around directory resolver

Signed-off-by: Alex Goodman <[email protected]>

* renamed p var to path when not conflicting with import

Signed-off-by: Alex Goodman <[email protected]>

* pull docker image in CLI dir scan timeout test

Signed-off-by: Alex Goodman <[email protected]>

* ensure file not exist errors do not stop directory resolver indexing

Signed-off-by: Alex Goodman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scanning a full FS inside a container is running forever
4 participants