Fix parsing of apk databases with large entries #1365

luhring · 2022-11-27T14:52:22Z

This PR is a suggested solution to #1354. An alternative implementation could have been to increase the scanner buffer size even further, but the chosen implementation avoids the need to find a perfect buffer size since whole entries no longer need to fit within the buffer. Consequently, it also removes the custom configuration applied to the scanner.

This PR also updates existing tests that were testing parseApkDBEntry to test parseApkDB instead, since parseApkDBEntry no longer exists in the suggested implementation.

Caveat: The test fixture used to demonstrate the failure and prove the fix is syft/pkg/cataloger/apkdb/test-fixtures/very-large-entries. This is the APK installed DB taken from the image where the failure was first noticed. This file is large (~4.6 MB)! Let me know if you think it makes sense to approach this test in a different way.

Closes #1354

Signed-off-by: Dan Luhring <[email protected]>

spiffcs

small nit on reading entryParsingInProgress

Thank you for the updated parsing logic for apply - it's pretty easy to follow how the metadata is built over time and then reset for a new entry.

Thanks also for making sure everything that could be nil is guarded with empty values.

I think I'm happy to merge this and then maybe follow up by lifting the large case to the integration level. If you have an image in mind that this is currently failing against that we should add a regression case against that's probably the ideal image.

spiffcs · 2022-11-28T15:55:35Z

syft/pkg/cataloger/apkdb/parse_apk_db.go

+		if line == "" {
+			// i.e. apk entry separator
+
+			if entryParsingInProgress {


Question: I might be reading the flow wrong here:

"If entry parsing In Progress (true)" then the entry is complete and we append?

Does the opposite sound better?

"If entry parsing is no longer in progress (false)" then currentEntry is complete and we append.

"If entry parsing In Progress (true)" then the entry is complete and we append?

Yes (almost) — "If entry parsing has been in progress AND we've now encountered an empty line, then call the entry complete NOW".

We don't know that parsing has completed for a given package entry until we enter the conditional branch on line 57, where the conditions on both line 53 and line 56 are evaluated to true.

The idea here is that each successive line of the file has a field that's applied to the same package entry unless the entry has been completely parsed, i.e. an empty line has been found. And only in that case would we start applying data to a new package entry (if there's more package data left in the file).

The if check on line 56 there is needed because if we encounter an empty line, but we hadn't been working on an entry, we shouldn't add anything to the package collection.

Very unopinionated on the variable name, so if you have something that makes more sense here, I can go ahead and update it!

spiffcs · 2022-11-28T16:26:07Z

syft/pkg/cataloger/apkdb/parse_apk_db.go

-	// We want sane defaults for collections, i.e. an empty array instead of null.
-	pkgFields["D"] = []string{}
-	pkgFields["p"] = []string{}
-	files := make([]pkg.ApkFileRecord, 0)


These are now defaulted in parseListValue

Yeah good catch, there's a bit of overlap here. Maybe I should adjust parseListValue to return nil for an empty list, and leave the nil-to-empty-slice conversion in nilFieldsToEmptySlice, since only the latter is handling an empty files list currently... 🤔

Also, question (that maybe I knew the answer to at one point?):

Why is it necessary to use empty slices (e.g. []string{} instead of nil) here in the parsing logic? I know sometimes we've wanted to ensure JSON output uses [] instead of null, but that's more of a presentation concern, and I'm not remembering other reasons why this is important...

Ahh sorry - I added this comment for myself to track where this had been moved to keep track of the diff.

I have no opinion either way on the overlap nilFieldsToEmptySlice and parseListValue and defer to you where the default of []string{} should live.

The presentation concern is valid, but I was not around when this was initially written. Another concern could be guarding against any kind of exceptions that may occur downstream for any functions that access these fields after assembly.

Adjusted for the overlap in f9cf2fc.

The "why []string{}?" question can be a lingering curiosity of mine 😄 — not needed for this PR in particular

spiffcs · 2022-11-28T16:31:35Z

syft/pkg/cataloger/apkdb/parse_apk_db.go

+
+		// save updated file
+		ctx.files[i] = latest
+	}


Do we want to add a default case for apply here to indicate if f.name is unknown or is that overkill?

We can! Is an extra/unknown field something we'd want to warn users about? (And if so, why? Just curious)

You might have more context on the why - How often if at all do new field names get added or removed here?

If it's super stable and no additional information will be added in the future then it's probably not worth the default case.

The why from my side is the proactive => "Hey! This new type of entry has been added and syft doesn't handle it" kind of message.

I gotcha, yeah.

AFAIK the format of the installed DB is very stable. I believe most of the current efforts on apk itself are focused on the upcoming apkv3, which has an entirely new encoding of this data altogether. The single letter field thing (e.g. P:, Z:, etc.) is an apkv2 thing (not v3) — so while the apkv2 format itself is probably not going to change, eventually apkv2 will probably be phased out. And when apkv3 is eventually rolled out, Syft will need an entirely new parser in order to support it.

So based on your thoughts, I'd suggest not handling extra fields explicitly right now, since I don't think that's very likely to happen?

spiffcs · 2022-11-28T16:34:30Z

cc @anchore/tools for a second review since the parsing logic has changed in more instances than just parsing larger DB entries

luhring · 2022-11-28T17:03:53Z

Thank you for the updated parsing logic for apply - it's pretty easy to follow how the metadata is built over time and then reset for a new entry.

Thanks! ❤️

I think I'm happy to merge this and then maybe follow up by lifting the large case to the integration level. If you have an image in mind that this is currently failing against that we should add a regression case against that's probably the ideal image.

I put one in the linked issue (#1354), we could use that?

Signed-off-by: Dan Luhring <[email protected]>

* main: (87 commits) feat: Add license parsing for java (anchore#1385) fix: cyclonedx component type for binaries (anchore#1406) fix: openjdk detection pattern (anchore#1415) bug: spdx checksum empty array; allow syft to generate SHA1 for spdx-tag-value documents (anchore#1404) Add NetBSD support. (anchore#1412) feat: add catalog delete (anchore#1377) docs: remove file classifier (anchore#1397) chore: update latest cyclonedx library (anchore#1390) feat: Add Java binary catalogers (anchore#1392) chore: Update SPDX license list to 3.19 (anchore#1389) fix: add manual vendor/product removal to fix false flags (anchore#1070) Update Stereoscope to c5ff155d72f166e2332e160a75c3ff2b8e9c7e2e (anchore#1395) chore: fix test busybox image sha (anchore#1393) fix: go version not properly identified in binary (anchore#1384) Update Stereoscope to 3b80d983223f6e6fc2d33b0ffa003d30268418e9 (anchore#1376) fix: Update node binary package name (anchore#1375) feat: Generic Binary Cataloger (anchore#1336) recover from bad parsing of golang binary (anchore#1371) Fix parsing of apk databases with large entries (anchore#1365) Update syft bootstrap tools to latest versions. (anchore#1369) ...

* main: (189 commits) feat: add h1digest when scanning go.mod (anchore#1405) feat: Add license parsing for java (anchore#1385) fix: cyclonedx component type for binaries (anchore#1406) fix: openjdk detection pattern (anchore#1415) bug: spdx checksum empty array; allow syft to generate SHA1 for spdx-tag-value documents (anchore#1404) Add NetBSD support. (anchore#1412) feat: add catalog delete (anchore#1377) docs: remove file classifier (anchore#1397) chore: update latest cyclonedx library (anchore#1390) feat: Add Java binary catalogers (anchore#1392) chore: Update SPDX license list to 3.19 (anchore#1389) fix: add manual vendor/product removal to fix false flags (anchore#1070) Update Stereoscope to c5ff155d72f166e2332e160a75c3ff2b8e9c7e2e (anchore#1395) chore: fix test busybox image sha (anchore#1393) fix: go version not properly identified in binary (anchore#1384) Update Stereoscope to 3b80d983223f6e6fc2d33b0ffa003d30268418e9 (anchore#1376) fix: Update node binary package name (anchore#1375) feat: Generic Binary Cataloger (anchore#1336) recover from bad parsing of golang binary (anchore#1371) Fix parsing of apk databases with large entries (anchore#1365) ...

Closes anchore#1354

Fix parsing of apk databases with large entries

1d52843

Signed-off-by: Dan Luhring <[email protected]>

spiffcs reviewed Nov 28, 2022

View reviewed changes

Return nil slice per PR feedback

f9cf2fc

Signed-off-by: Dan Luhring <[email protected]>

spiffcs requested a review from a team November 28, 2022 17:58

spiffcs approved these changes Nov 29, 2022

View reviewed changes

spiffcs merged commit f6996f7 into anchore:main Nov 29, 2022

luhring deleted the 1354-parsing-apk-installed branch November 29, 2022 15:19

GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024

Fix parsing of apk databases with large entries (anchore#1365)

6fbf750

Closes anchore#1354

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing of apk databases with large entries #1365

Fix parsing of apk databases with large entries #1365

luhring commented Nov 27, 2022

spiffcs left a comment

spiffcs Nov 28, 2022

luhring Nov 28, 2022

spiffcs Nov 28, 2022

luhring Nov 28, 2022

spiffcs Nov 28, 2022

luhring Nov 28, 2022

spiffcs Nov 28, 2022 •

edited

Loading

luhring Nov 28, 2022

spiffcs Nov 28, 2022

luhring Nov 28, 2022

spiffcs commented Nov 28, 2022

luhring commented Nov 28, 2022

Fix parsing of apk databases with large entries #1365

Fix parsing of apk databases with large entries #1365

Conversation

luhring commented Nov 27, 2022

spiffcs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffcs Nov 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffcs commented Nov 28, 2022

luhring commented Nov 28, 2022

spiffcs Nov 28, 2022 •

edited

Loading