-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parsing of apk databases with large entries #1365
Conversation
Signed-off-by: Dan Luhring <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit on reading entryParsingInProgress
Thank you for the updated parsing logic for apply
- it's pretty easy to follow how the metadata is built over time and then reset for a new entry.
Thanks also for making sure everything that could be nil is guarded with empty values.
I think I'm happy to merge this and then maybe follow up by lifting the large
case to the integration level. If you have an image in mind that this is currently failing against that we should add a regression case against that's probably the ideal image.
if line == "" { | ||
// i.e. apk entry separator | ||
|
||
if entryParsingInProgress { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I might be reading the flow wrong here:
"If entry parsing In Progress (true)" then the entry is complete and we append?
Does the opposite sound better?
"If entry parsing is no longer in progress (false)" then currentEntry is complete and we append.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If entry parsing In Progress (true)" then the entry is complete and we append?
Yes (almost) — "If entry parsing has been in progress AND we've now encountered an empty line, then call the entry complete NOW".
We don't know that parsing has completed for a given package entry until we enter the conditional branch on line 57, where the conditions on both line 53 and line 56 are evaluated to true
.
The idea here is that each successive line of the file has a field that's applied to the same package entry unless the entry has been completely parsed, i.e. an empty line has been found. And only in that case would we start applying data to a new package entry (if there's more package data left in the file).
The if
check on line 56 there is needed because if we encounter an empty line, but we hadn't been working on an entry, we shouldn't add anything to the package collection.
Very unopinionated on the variable name, so if you have something that makes more sense here, I can go ahead and update it!
// We want sane defaults for collections, i.e. an empty array instead of null. | ||
pkgFields["D"] = []string{} | ||
pkgFields["p"] = []string{} | ||
files := make([]pkg.ApkFileRecord, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are now defaulted in parseListValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good catch, there's a bit of overlap here. Maybe I should adjust parseListValue
to return nil
for an empty list, and leave the nil-to-empty-slice conversion in nilFieldsToEmptySlice
, since only the latter is handling an empty files
list currently... 🤔
Also, question (that maybe I knew the answer to at one point?):
Why is it necessary to use empty slices (e.g. []string{}
instead of nil
) here in the parsing logic? I know sometimes we've wanted to ensure JSON output uses []
instead of null
, but that's more of a presentation concern, and I'm not remembering other reasons why this is important...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh sorry - I added this comment for myself to track where this had been moved to keep track of the diff.
I have no opinion either way on the overlap nilFieldsToEmptySlice
and parseListValue
and defer to you where the default of []string{}
should live.
The presentation concern is valid, but I was not around when this was initially written. Another concern could be guarding against any kind of exceptions that may occur downstream for any functions that access these fields after assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted for the overlap in f9cf2fc.
The "why []string{}
?" question can be a lingering curiosity of mine 😄 — not needed for this PR in particular
|
||
// save updated file | ||
ctx.files[i] = latest | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a default case for apply
here to indicate if f.name is unknown or is that overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can! Is an extra/unknown field something we'd want to warn users about? (And if so, why? Just curious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have more context on the why - How often if at all do new field names get added or removed here?
If it's super stable and no additional information will be added in the future then it's probably not worth the default case.
The why from my side is the proactive => "Hey! This new type of entry has been added and syft doesn't handle it" kind of message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotcha, yeah.
AFAIK the format of the installed DB is very stable. I believe most of the current efforts on apk
itself are focused on the upcoming apkv3, which has an entirely new encoding of this data altogether. The single letter field thing (e.g. P:
, Z:
, etc.) is an apkv2 thing (not v3) — so while the apkv2 format itself is probably not going to change, eventually apkv2 will probably be phased out. And when apkv3 is eventually rolled out, Syft will need an entirely new parser in order to support it.
So based on your thoughts, I'd suggest not handling extra fields explicitly right now, since I don't think that's very likely to happen?
cc @anchore/tools for a second review since the parsing logic has changed in more instances than just parsing larger DB entries |
Thanks! ❤️
I put one in the linked issue (#1354), we could use that? |
Signed-off-by: Dan Luhring <[email protected]>
* main: (87 commits) feat: Add license parsing for java (anchore#1385) fix: cyclonedx component type for binaries (anchore#1406) fix: openjdk detection pattern (anchore#1415) bug: spdx checksum empty array; allow syft to generate SHA1 for spdx-tag-value documents (anchore#1404) Add NetBSD support. (anchore#1412) feat: add catalog delete (anchore#1377) docs: remove file classifier (anchore#1397) chore: update latest cyclonedx library (anchore#1390) feat: Add Java binary catalogers (anchore#1392) chore: Update SPDX license list to 3.19 (anchore#1389) fix: add manual vendor/product removal to fix false flags (anchore#1070) Update Stereoscope to c5ff155d72f166e2332e160a75c3ff2b8e9c7e2e (anchore#1395) chore: fix test busybox image sha (anchore#1393) fix: go version not properly identified in binary (anchore#1384) Update Stereoscope to 3b80d983223f6e6fc2d33b0ffa003d30268418e9 (anchore#1376) fix: Update node binary package name (anchore#1375) feat: Generic Binary Cataloger (anchore#1336) recover from bad parsing of golang binary (anchore#1371) Fix parsing of apk databases with large entries (anchore#1365) Update syft bootstrap tools to latest versions. (anchore#1369) ...
* main: (189 commits) feat: add h1digest when scanning go.mod (anchore#1405) feat: Add license parsing for java (anchore#1385) fix: cyclonedx component type for binaries (anchore#1406) fix: openjdk detection pattern (anchore#1415) bug: spdx checksum empty array; allow syft to generate SHA1 for spdx-tag-value documents (anchore#1404) Add NetBSD support. (anchore#1412) feat: add catalog delete (anchore#1377) docs: remove file classifier (anchore#1397) chore: update latest cyclonedx library (anchore#1390) feat: Add Java binary catalogers (anchore#1392) chore: Update SPDX license list to 3.19 (anchore#1389) fix: add manual vendor/product removal to fix false flags (anchore#1070) Update Stereoscope to c5ff155d72f166e2332e160a75c3ff2b8e9c7e2e (anchore#1395) chore: fix test busybox image sha (anchore#1393) fix: go version not properly identified in binary (anchore#1384) Update Stereoscope to 3b80d983223f6e6fc2d33b0ffa003d30268418e9 (anchore#1376) fix: Update node binary package name (anchore#1375) feat: Generic Binary Cataloger (anchore#1336) recover from bad parsing of golang binary (anchore#1371) Fix parsing of apk databases with large entries (anchore#1365) ...
This PR is a suggested solution to #1354. An alternative implementation could have been to increase the scanner buffer size even further, but the chosen implementation avoids the need to find a perfect buffer size since whole entries no longer need to fit within the buffer. Consequently, it also removes the custom configuration applied to the scanner.
This PR also updates existing tests that were testing
parseApkDBEntry
to testparseApkDB
instead, sinceparseApkDBEntry
no longer exists in the suggested implementation.Caveat: The test fixture used to demonstrate the failure and prove the fix is
syft/pkg/cataloger/apkdb/test-fixtures/very-large-entries
. This is the APK installed DB taken from the image where the failure was first noticed. This file is large (~4.6 MB)! Let me know if you think it makes sense to approach this test in a different way.Closes #1354