Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of apk databases with large entries #1365

Merged
merged 2 commits into from
Nov 29, 2022

Conversation

luhring
Copy link
Contributor

@luhring luhring commented Nov 27, 2022

This PR is a suggested solution to #1354. An alternative implementation could have been to increase the scanner buffer size even further, but the chosen implementation avoids the need to find a perfect buffer size since whole entries no longer need to fit within the buffer. Consequently, it also removes the custom configuration applied to the scanner.

This PR also updates existing tests that were testing parseApkDBEntry to test parseApkDB instead, since parseApkDBEntry no longer exists in the suggested implementation.

Caveat: The test fixture used to demonstrate the failure and prove the fix is syft/pkg/cataloger/apkdb/test-fixtures/very-large-entries. This is the APK installed DB taken from the image where the failure was first noticed. This file is large (~4.6 MB)! Let me know if you think it makes sense to approach this test in a different way.

Closes #1354

Copy link
Contributor

@spiffcs spiffcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit on reading entryParsingInProgress

Thank you for the updated parsing logic for apply - it's pretty easy to follow how the metadata is built over time and then reset for a new entry.

Thanks also for making sure everything that could be nil is guarded with empty values.

I think I'm happy to merge this and then maybe follow up by lifting the large case to the integration level. If you have an image in mind that this is currently failing against that we should add a regression case against that's probably the ideal image.

if line == "" {
// i.e. apk entry separator

if entryParsingInProgress {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I might be reading the flow wrong here:

"If entry parsing In Progress (true)" then the entry is complete and we append?

Does the opposite sound better?

"If entry parsing is no longer in progress (false)" then currentEntry is complete and we append.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If entry parsing In Progress (true)" then the entry is complete and we append?

Yes (almost) — "If entry parsing has been in progress AND we've now encountered an empty line, then call the entry complete NOW".

We don't know that parsing has completed for a given package entry until we enter the conditional branch on line 57, where the conditions on both line 53 and line 56 are evaluated to true.

The idea here is that each successive line of the file has a field that's applied to the same package entry unless the entry has been completely parsed, i.e. an empty line has been found. And only in that case would we start applying data to a new package entry (if there's more package data left in the file).

The if check on line 56 there is needed because if we encounter an empty line, but we hadn't been working on an entry, we shouldn't add anything to the package collection.

Very unopinionated on the variable name, so if you have something that makes more sense here, I can go ahead and update it!

// We want sane defaults for collections, i.e. an empty array instead of null.
pkgFields["D"] = []string{}
pkgFields["p"] = []string{}
files := make([]pkg.ApkFileRecord, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now defaulted in parseListValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good catch, there's a bit of overlap here. Maybe I should adjust parseListValue to return nil for an empty list, and leave the nil-to-empty-slice conversion in nilFieldsToEmptySlice, since only the latter is handling an empty files list currently... 🤔

Also, question (that maybe I knew the answer to at one point?):

Why is it necessary to use empty slices (e.g. []string{} instead of nil) here in the parsing logic? I know sometimes we've wanted to ensure JSON output uses [] instead of null, but that's more of a presentation concern, and I'm not remembering other reasons why this is important...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry - I added this comment for myself to track where this had been moved to keep track of the diff.

I have no opinion either way on the overlap nilFieldsToEmptySlice and parseListValue and defer to you where the default of []string{} should live.

The presentation concern is valid, but I was not around when this was initially written. Another concern could be guarding against any kind of exceptions that may occur downstream for any functions that access these fields after assembly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted for the overlap in f9cf2fc.

The "why []string{}?" question can be a lingering curiosity of mine 😄 — not needed for this PR in particular


// save updated file
ctx.files[i] = latest
}
Copy link
Contributor

@spiffcs spiffcs Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a default case for apply here to indicate if f.name is unknown or is that overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can! Is an extra/unknown field something we'd want to warn users about? (And if so, why? Just curious)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have more context on the why - How often if at all do new field names get added or removed here?

If it's super stable and no additional information will be added in the future then it's probably not worth the default case.

The why from my side is the proactive => "Hey! This new type of entry has been added and syft doesn't handle it" kind of message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotcha, yeah.

AFAIK the format of the installed DB is very stable. I believe most of the current efforts on apk itself are focused on the upcoming apkv3, which has an entirely new encoding of this data altogether. The single letter field thing (e.g. P:, Z:, etc.) is an apkv2 thing (not v3) — so while the apkv2 format itself is probably not going to change, eventually apkv2 will probably be phased out. And when apkv3 is eventually rolled out, Syft will need an entirely new parser in order to support it.

So based on your thoughts, I'd suggest not handling extra fields explicitly right now, since I don't think that's very likely to happen?

@spiffcs
Copy link
Contributor

spiffcs commented Nov 28, 2022

cc @anchore/tools for a second review since the parsing logic has changed in more instances than just parsing larger DB entries

@luhring
Copy link
Contributor Author

luhring commented Nov 28, 2022

Thank you for the updated parsing logic for apply - it's pretty easy to follow how the metadata is built over time and then reset for a new entry.

Thanks! ❤️

I think I'm happy to merge this and then maybe follow up by lifting the large case to the integration level. If you have an image in mind that this is currently failing against that we should add a regression case against that's probably the ideal image.

I put one in the linked issue (#1354), we could use that?

@spiffcs spiffcs requested a review from a team November 28, 2022 17:58
@spiffcs spiffcs merged commit f6996f7 into anchore:main Nov 29, 2022
@luhring luhring deleted the 1354-parsing-apk-installed branch November 29, 2022 15:19
spiffcs added a commit to raboof/syft that referenced this pull request Dec 20, 2022
* main: (87 commits)
  feat: Add license parsing for java (anchore#1385)
  fix: cyclonedx component type for binaries (anchore#1406)
  fix: openjdk detection pattern (anchore#1415)
  bug: spdx checksum empty array; allow syft to generate SHA1 for spdx-tag-value documents (anchore#1404)
  Add NetBSD support. (anchore#1412)
  feat: add catalog delete (anchore#1377)
  docs: remove file classifier (anchore#1397)
  chore: update latest cyclonedx library (anchore#1390)
  feat: Add Java binary catalogers (anchore#1392)
  chore: Update SPDX license list to 3.19 (anchore#1389)
  fix: add manual vendor/product removal to fix false flags (anchore#1070)
  Update Stereoscope to c5ff155d72f166e2332e160a75c3ff2b8e9c7e2e (anchore#1395)
  chore: fix test busybox image sha (anchore#1393)
  fix: go version not properly identified in binary (anchore#1384)
  Update Stereoscope to 3b80d983223f6e6fc2d33b0ffa003d30268418e9 (anchore#1376)
  fix: Update node binary package name (anchore#1375)
  feat: Generic Binary Cataloger (anchore#1336)
  recover from bad parsing of golang binary (anchore#1371)
  Fix parsing of apk databases with large entries (anchore#1365)
  Update syft bootstrap tools to latest versions. (anchore#1369)
  ...
spiffcs added a commit to cpendery/syft that referenced this pull request Dec 20, 2022
* main: (189 commits)
  feat: add h1digest when scanning go.mod (anchore#1405)
  feat: Add license parsing for java (anchore#1385)
  fix: cyclonedx component type for binaries (anchore#1406)
  fix: openjdk detection pattern (anchore#1415)
  bug: spdx checksum empty array; allow syft to generate SHA1 for spdx-tag-value documents (anchore#1404)
  Add NetBSD support. (anchore#1412)
  feat: add catalog delete (anchore#1377)
  docs: remove file classifier (anchore#1397)
  chore: update latest cyclonedx library (anchore#1390)
  feat: Add Java binary catalogers (anchore#1392)
  chore: Update SPDX license list to 3.19 (anchore#1389)
  fix: add manual vendor/product removal to fix false flags (anchore#1070)
  Update Stereoscope to c5ff155d72f166e2332e160a75c3ff2b8e9c7e2e (anchore#1395)
  chore: fix test busybox image sha (anchore#1393)
  fix: go version not properly identified in binary (anchore#1384)
  Update Stereoscope to 3b80d983223f6e6fc2d33b0ffa003d30268418e9 (anchore#1376)
  fix: Update node binary package name (anchore#1375)
  feat: Generic Binary Cataloger (anchore#1336)
  recover from bad parsing of golang binary (anchore#1371)
  Fix parsing of apk databases with large entries (anchore#1365)
  ...
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Syft finds no apks for some images with apks
2 participants