Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make diff behave like diff(1); report consistent behaviors #628

Merged
merged 6 commits into from
Nov 15, 2024

Conversation

egibs
Copy link
Member

@egibs egibs commented Nov 15, 2024

With the clarification about diff(1) behavior in #599, I wanted to get something written up to address the current implementation gap.

This PR overhauls diff and tries to mimic what diff(1) does --

  • if two files are scanned, the diff is treated as a change
  • if two directories are scanned, the diff considers paths

When diffing directories, the source file report is first compared to the destination report to identify matching files, followed by files only present in the source path. Afterward, the opposite is done to identify files that exist only in the destination path.

The processSrc, processDest, and fileDestination functions were confusing and I think handleFile does everything we need for "modified" files. Otherwise, we're just directly adding reports to the Added/Removed map.

I also started tracking consistent behaviors across modified files (originally called existing I think?) and I also updated the renderers to account for the new behaviors. Depending on the format, consistent behaviors will show with no + or -. In the terminal, consistent behaviors will show up as cyan; the updated diff test data also contains these behaviors.

Examples:

Two directories:

$ go run cmd/mal/mal.go diff ./out/chainguard-dev/malcontent-samples/python/clean/conda-build/ ./out/chainguard-dev/malcontent-samples/python/clean/fonttools/
├─ 🟡 Deleted: out/chainguard-dev/malcontent-samples/python/clean/conda-build/_load_setup_py_data.py [MEDIUM]
│     ≡ execution [MEDIUM]
│       🟡 remote_commands/code_eval — evaluate code dynamically using exec(): exec(code,, import
│     ≡ filesystem [LOW]
│       🔵 file/open — opens files: open(
│     ≡ impact [LOW]
│       🔵 remote_access/py_setuptools — Python library installer that evaluates arbitrary code: exec(code
│     ≡ networking [MEDIUM]
│       🟡 download — download files: not downloaded yet
│       🔵 url/embedded — contains embedded HTTPS URLs: https://numpy.org/doc/stable/reference/distutils_status_migration.html
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: compile(f.read()
│
├─ 🔵 Added: out/chainguard-dev/malcontent-samples/python/clean/fonttools/psLib.py []
│

Two relative directories:

$ go run cmd/mal/mal.go diff ../malcontent-samples/python/clean/hatch/ ../malcontent-samples/python/clean/idna/
├─ 🟡 Deleted: ../malcontent-samples/python/clean/hatch/migrate.py [MEDIUM]
│     ≡ discovery [MEDIUM]
│       🟡 system/environment — Dump values from the environment: os.environ.items()
│     ≡ execution [MEDIUM]
│       🟡 program — execute external program: subprocess.run([sys.executable, setup_py], env
│       🟡 remote_commands/code_eval — evaluate code dynamically using eval(): eval(value)
│     ≡ false-positives [LOW]
│       🔵 py_hatch — migrate py: '_HATCHLING_PORT_ADD_', literal_eval(value)
│     ≡ filesystem [LOW]
│       🔵 directory/list — lists contents of a directory: .listdir(
│       🔵 file/open — opens files: open(
│       🔵 symlink_resolve — resolves symbolic links: realpath
│     ≡ networking [MEDIUM]
│       🟡 download — download files: Download, download_url
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: f.read()
│       🔵 fd/write — writes to a file handle: f.write(output)
│     ≡ process [MEDIUM]
│       🟡 executable_path — gets executable associated to this process: sys.executable
│
├─ 🟡 Added: ../malcontent-samples/python/clean/idna/setup.py [MEDIUM]
│     ≡ execution [MEDIUM]
│       🟡 remote_commands/code_eval — evaluate code dynamically using exec(): exec(open('idna, import
│     ≡ filesystem [LOW]
│       🔵 file/open — opens files: open(
│     ≡ networking [LOW]
│       🔵 url/embedded — contains embedded HTTPS URLs: https://github.com/kjd/idna
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: ).read()
│

Two unrelated files:

$ go run cmd/mal/mal.go diff ../malcontent-samples/macOS/clean/ls ../malcontent-samples/linux/clean/ls.x86_64 
├─ 🟡 Changed: ../malcontent-samples/linux/clean/ls.x86_64 [LOW → MEDIUM]
│     ▲ data [NONE → LOW]
+++     🔵 compression/lzma — works with lzma files
│     ▲ discovery [NONE → LOW]
+++     🔵 system/hostname — get computer host name: gethostname
│     ≡ execution [LOW]
~~~     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
---     🔵 directory/traverse — traverse filesystem hierarchy
~~~     🔵 link_read — read value of a symbolic link
│     ▲ networking [NONE → LOW]
+++     🔵 url/embedded — contains embedded HTTPS URLs:
+++           https://gnu.org/licenses/gpl.html, https://translationproject.org/team/, https://wiki.xiph.org/MIME_Types_and_File_Extensions, https://www.gnu.org/software/coreutils/
│     ▲ process [NONE → MEDIUM]
+++     🟡 name_set — get or set the current process name: __progname
│

Two unrelated files in the same parent:

$ go run cmd/mal/mal.go diff ./out/chainguard-dev/malcontent-samples/linux/clean/ls.x86_64 ./out/chainguard-dev/malcontent-samples/macOS/clean/ls
├─ 🔵 Changed: out/chainguard-dev/malcontent-samples/macOS/clean/ls [MEDIUM → LOW]
│     X data [LOW → NONE]
---     🔵 compression/lzma — works with lzma files
│     X discovery [LOW → NONE]
---     🔵 system/hostname — get computer host name
│     ≡ execution [LOW]
~~~     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
+++     🔵 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
~~~     🔵 link_read — read value of a symbolic link
│     X networking [LOW → NONE]
---     🔵 url/embedded — contains embedded HTTPS URLs
│     X process [MEDIUM → NONE]
---     🟡 name_set — get or set the current process name
│

Moving further down the directory structure:

$HOME/go/1.23.2/bin/mal diff linux/clean/ls.x86_64 macOS/clean/ls
├─ 🔵 Changed: macOS/clean/ls [MEDIUM → LOW]
│     X data [LOW → NONE]
---     🔵 compression/lzma — works with lzma files
│     X discovery [LOW → NONE]
---     🔵 system/hostname — get computer host name
│     ≡ execution [LOW]
~~~     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
+++     🔵 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
~~~     🔵 link_read — read value of a symbolic link
│     X networking [LOW → NONE]
---     🔵 url/embedded — contains embedded HTTPS URLs
│     X process [MEDIUM → NONE]
---     🟡 name_set — get or set the current process name
│

Two directories that share a file of the same name:

$ go run cmd/mal/mal.go diff /tmp/old/ /tmp/new/
├─ 🟡 Deleted: /private/tmp/old/_load_setup_py_data.py [MEDIUM]
│     ≡ execution [MEDIUM]
│       🟡 remote_commands/code_eval — evaluate code dynamically using exec(): exec(code,, import
│     ≡ filesystem [LOW]
│       🔵 file/open — opens files: open(
│     ≡ impact [LOW]
│       🔵 remote_access/py_setuptools — Python library installer that evaluates arbitrary code: exec(code
│     ≡ networking [MEDIUM]
│       🟡 download — download files: not downloaded yet
│       🔵 url/embedded — contains embedded HTTPS URLs: https://numpy.org/doc/stable/reference/distutils_status_migration.html
│     ≡ operating-system [LOW]
│       🔵 fd/read — reads from a file handle: compile(f.read()
│
├─ 🔵 Changed: /private/tmp/new/ls [MEDIUM → LOW]
│     X data [LOW → NONE]
---     🔵 compression/lzma — works with lzma files
│     X discovery [LOW → NONE]
---     🔵 system/hostname — get computer host name
│     ≡ execution [LOW]
     🔵 shell/TERM — Look up or override terminal settings
│     ≡ filesystem [LOW]
+++     🔵 directory/traverse — traverse filesystem hierarchy: _fts_children, _fts_close, _fts_open, _fts_read, _fts_set
     🔵 link_read — read value of a symbolic link
│     X networking [LOW → NONE]
---     🔵 url/embedded — contains embedded HTTPS URLs
│     X process [MEDIUM → NONE]
---     🟡 name_set — get or set the current process name
│

Consistent archive diffs:

$ for i in (seq 1 10); go run cmd/mal/mal.go diff /tmp/py3.13-debugpy-bin-1.8.6-r1.apk /tmp/py3.13-debugpy-bin-1.8.7-r0.apk; end
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│
├─ 🔵 Changed: /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /usr/bin/debugpy
│     ≡ filesystem [LOW]
│       🔵 path/usr_bin — path reference within /usr/bin: /usr/bin/python3.13
│
├─ 🟡 Moved: /private/tmp/py3.13-debugpy-bin-1.8.6-r1.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.6-r1.spdx.json -> /private/tmp/py3.13-debugpy-bin-1.8.7-r0.apk ∴ /var/lib/db/sbom/py3.13-debugpy-bin-1.8.7-r0.spdx.json (score: 0.976238)
│     ≡ networking [MEDIUM]
│       🟡 download — download files: downloadLocation
│       🔵 url/embedded — contains embedded HTTPS URLs: https://spdx.org/spdxdocs/chainguard/melange/2dc1f85989cc45e9f3cb0cfa9c23
│

@egibs egibs requested a review from tstromberg November 15, 2024 04:21
@@ -319,6 +321,13 @@ func renderFileSummary(_ context.Context, fr *malcontent.FileReport, w io.Writer
content = fmt.Sprintf("%s%s%s %s %s", prefix, indent, bullet, rest, desc)
e = ""
}

if b.NoDiff {
prefix = "~~~"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To increase familiarity with folks who review diffs from diff or GitHub, can we leave NoDiff lines without a prefix or special appearance? I like the disambiguation that ~~~ gives, but since it isn't used elsewhere, I think it will raise more questions than answers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 4d0921c (#628).

Copy link
Collaborator

@tstromberg tstromberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great timing as I'm working on a PR to add diff sensitivity. Your PR cleans things up.

@@ -59,6 +59,7 @@ type Behavior struct {

DiffAdded bool `json:",omitempty" yaml:",omitempty"`
DiffRemoved bool `json:",omitempty" yaml:",omitempty"`
NoDiff bool `json:",omitempty" yaml:",omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NoDiff any different than an absence of DiffAdded or DiffRemoved? If it isn't, I'd recommend just leaving the public structure as is.

If we had good support for enums, this would be a perfect place to have one, since it could make for a confusing situation where all 3 bools are unset or set.

Copy link
Member Author

@egibs egibs Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not -- it's equivalent to both b.DiffRemoved and b.DiffAdded being false, so we can use that condition to check for behaviors that didn't change. I'll push that fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 4d0921c (#628).

@egibs egibs requested a review from tstromberg November 15, 2024 13:55
@egibs
Copy link
Member Author

egibs commented Nov 15, 2024

Diffing archives is exhibiting inconsistent behavior so I need to fix that.

Edit: updated in 6180a11 (#628). Without this change, the files in each report were being compared as if they were single files rather than an extracted directory of files. With concurrent processing, each diff would show a [single] different file.

@egibs egibs marked this pull request as draft November 15, 2024 15:40
@egibs egibs marked this pull request as ready for review November 15, 2024 15:52
@egibs egibs marked this pull request as draft November 15, 2024 16:20
Signed-off-by: egibs <[email protected]>
@egibs egibs marked this pull request as ready for review November 15, 2024 16:27
@tstromberg
Copy link
Collaborator

This is huge - thank you!

@egibs
Copy link
Member Author

egibs commented Nov 15, 2024

Will merge in a bit. Working on one last bug.

@@ -190,7 +217,13 @@ func Diff(ctx context.Context, c malcontent.Config) (*malcontent.Report, error)
if srcFile != nil && destFile != nil {
formatSrc := displayPath(srcBase, srcFile.Path)
formatDest := displayPath(destBase, destFile.Path)
handleFile(ctx, c, srcFile, destFile, fmt.Sprintf("%s -> %s", formatSrc, formatDest), d)
if scoreFile(srcFile, destFile) {
Copy link
Member Author

@egibs egibs Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the last remaining bug. Previously, we were hitting handleFile for the files we calculate moves for, so they would just show up as changed rather than moved. Now, we'll handle those files correctly via inferMoves while sending every other file to handleFile.

I also simplified the inferMove code path significantly.

@egibs egibs merged commit ae10a42 into chainguard-dev:main Nov 15, 2024
8 checks passed
@egibs egibs deleted the diff1-behavior branch November 18, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants