Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use concurrency for path scanning #405

Merged
merged 24 commits into from
Aug 16, 2024

Conversation

egibs
Copy link
Member

@egibs egibs commented Aug 15, 2024

As opposed to #402 which implemented concurrency at the scanPath level, this PR implements configurable concurrency at the path level, that is, all of the recursively discovered paths within a given scanPath.

By default, the concurrency is limited to runtime.NumCPU() and will automatically be set to 1 if the configured concurrency is less than 1. Similarly, the concurrency can not be set higher than the number of paths being scanned.

I converted map[any]any usages over to go-ordered-map to ensure that all of our dictionaries maintain their insertion order when iterated over.

Anecdotally, this change has provided a decent increase in speed depending on the concurrency value, number of files, and the size of each file.

Benchmarking with the default concurrency of runtime.NumCPU() takes ~50 seconds:

goos: darwin
goarch: arm64
pkg: github.com/chainguard-dev/bincapz/samples
cpu: Apple M3 Pro
BenchmarkRun-12    	       1	48120070292 ns/op	773853464 B/op	 4895873 allocs/op
PASS
ok  	github.com/chainguard-dev/bincapz/samples	49.446s

as opposed to ~80 seconds without concurrency (-j 1):

goos: darwin
goarch: arm64
pkg: github.com/chainguard-dev/bincapz/samples
cpu: Apple M3 Pro
BenchmarkRun-12    	       1	78739845000 ns/op	774942256 B/op	 4896912 allocs/op
PASS
ok  	github.com/chainguard-dev/bincapz/samples	80.366s

Anecdotally, scanning /usr/bin on my Framework Laptop takes ~60 seconds with concurrency and ~220 seconds without (-j 1).

@@ -80,7 +80,7 @@ func programKind(ctx context.Context, path string) string {
headerString := ""
n, err := io.ReadFull(f, header[:])
if err == nil || errors.Is(err, io.ErrUnexpectedEOF) {
kind, err := magic.Lookup(header[:n])
kind, err := magic.LookupSync(header[:n])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt we were seeing much of a benefit from this.

LookupSync lookups up the file type based on the provided magic bytes without spawning any additional goroutines. You should provide at least the first 1024 bytes of the file in this slice. A magic.ErrUnknown will be returned if the file type is not known.

@egibs egibs force-pushed the performance-improvements branch from d7e0729 to 42d2cda Compare August 15, 2024 22:06
@@ -127,7 +128,7 @@ func processDest(ctx context.Context, c bincapz.Config, from, to map[string]*bin
for relPath, tr := range to {
fr, exists := from[relPath]
if !exists {
d.Added[relPath] = tr
d.Added.Set(relPath, tr)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern for adding a KV pair to an orderedmap dictionary.

delete(d.Removed, rpath)
delete(d.Added, apath)
d.Modified.Set(apath, abs)
d.Modified.Delete(rpath)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern for deleting a key from an orderedmap.

@@ -152,7 +159,7 @@ func recursiveScan(ctx context.Context, c bincapz.Config) (*bincapz.Report, erro
logger := clog.FromContext(ctx)
logger.Debug("recursive scan", slog.Any("config", c))
r := &bincapz.Report{
Files: map[string]*bincapz.FileReport{},
Files: orderedmap.New[string, *bincapz.FileReport](),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern for initializing a new orderedmap.

@@ -0,0 +1,24 @@
// Copyright 2024 Chainguard, Inc.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are helper functions for inserting sorted keys into a map.

d.Modified[relPath].Behaviors = append(d.Modified[relPath].Behaviors, abs.Behaviors...)
rel, _ := d.Modified.Get(relPath)
rel.Behaviors = append(rel.Behaviors, abs.Behaviors...)
d.Modified.Set(relPath, rel)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the most difficult portion of migrating to orderedmap. I don't think a one-liner was doable but this is more readable.

Signed-off-by: egibs <[email protected]>
pkg/action/scan.go Outdated Show resolved Hide resolved
@egibs egibs force-pushed the performance-improvements branch from 3b10a15 to aa6bed2 Compare August 16, 2024 12:47
@egibs egibs force-pushed the performance-improvements branch from aa6bed2 to 8b05ed4 Compare August 16, 2024 12:48
@egibs egibs requested review from tstromberg and hectorj2f August 16, 2024 13:21
@egibs egibs marked this pull request as ready for review August 16, 2024 13:21
pkg/action/scan.go Outdated Show resolved Hide resolved
pkg/action/scan.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@tstromberg tstromberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid improvement, I just have some readability comments regarding data structures.

pkg/action/scan.go Outdated Show resolved Hide resolved
pkg/action/scan.go Outdated Show resolved Hide resolved
fr, err := processFile(ctx, c, yrs, path, scanPath, trimPath, logger)
if err != nil {
mu.Lock()
result = append(result, findings{path: path, fr: &bincapz.FileReport{}})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would improve readability if a sync.Map was used to store the findings, so you wouldn't have to worry about manually managing the mutex. The slice ends up being converted to a map anyways later via scanPathFindings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out and got some weird results when running tests where the file reports were merged for two different tests. Managing the mutex(es) isn't ideal but the outcome is deterministic. I'll see if I was doing something wrong with the sync.Map, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I was doing something wrong. I'll push the working code in a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 81fca83 (#405).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up removals/fixes in c03d232 (#405) and c4888b3 (#405).

@egibs egibs requested a review from tstromberg August 16, 2024 17:35
@egibs egibs changed the title Use concurrency for path scanning; improve scan performance via ScanFileDescriptor Use concurrency for path scanning Aug 16, 2024
@tstromberg
Copy link
Collaborator

Love it. Thank you!

@egibs egibs merged commit 24c0a3b into chainguard-dev:main Aug 16, 2024
6 checks passed
@egibs egibs deleted the performance-improvements branch August 16, 2024 18:24
egibs added a commit to egibs/malcontent that referenced this pull request Sep 25, 2024
* Use file descriptors

Signed-off-by: egibs <[email protected]>

* Relocate file descriptor code; implement working concurrency for discovered paths

Signed-off-by: egibs <[email protected]>

* Move to ordered map; relocate File rendering

Signed-off-by: egibs <[email protected]>

* Limit concurrency to runtime.NumCPU()

Signed-off-by: egibs <[email protected]>

* Rename map variables

Signed-off-by: egibs <[email protected]>

* Directly initialize diff maps

Signed-off-by: egibs <[email protected]>

* Appease the linter

Signed-off-by: egibs <[email protected]>

* Configurable concurrency

Signed-off-by: egibs <[email protected]>

* Update samples_test.go

Signed-off-by: egibs <[email protected]>

* More consolidation

Signed-off-by: egibs <[email protected]>

* Add exists check

Signed-off-by: egibs <[email protected]>

* Use -j instead of -c for concurrency

Signed-off-by: egibs <[email protected]>

* Use an errgroup instead of a waitgroup + semaphore

Signed-off-by: egibs <[email protected]>

* Use SetLimit; add concurrency config to all tests; linting fixes

Signed-off-by: egibs <[email protected]>

* Move sorting code into scan.go; move rendering inside of map construction loop

Signed-off-by: egibs <[email protected]>

* Use 8-core runner for tests

Signed-off-by: egibs <[email protected]>

* Go back to ubuntu-latest

Signed-off-by: egibs <[email protected]>

* Run core tests in parallel

Signed-off-by: egibs <[email protected]>

* Better alias naming, consolidate structs

Signed-off-by: egibs <[email protected]>

* Run make fix

Signed-off-by: egibs <[email protected]>

* Replace mutexes with sync.Map

Signed-off-by: egibs <[email protected]>

* Remove struct/type

Signed-off-by: egibs <[email protected]>

* Appease the linter

Signed-off-by: egibs <[email protected]>

---------

Signed-off-by: egibs <[email protected]>
Signed-off-by: Evan Gibler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants