Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --output-format=json output option to v2 list #8450

Closed

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Oct 11, 2019

Problem

Resolves #8445.

Solution

  • Add an --output-format option to the list v2 @console_rule, and make --provides and --documented point to values of that enum option.
  • Add --output-format=json, which prints out lines of json with the keys:
    • was_root: whether the target was a target root, or one of the transitive dependencies of one of the roots
    • address: the target address
    • target_type: the stringified version of the target's BUILD file name, e.g. python_library
    • intransitive_fingerprint: the intransitive fingerprint for that TargetAdaptor
    • transitive_fingerprint: the transitive fingerprint for that TargetAdaptor

Result

The following command line will output a string representing a stable hash of the transitive closure of the target my/python:binary:

$ ./pants list --output-format=json my/python:binary \
  | jq -r 'select(.was_root) | .transitive_fingerprint'
ef54aa0c26c8d91bb74cd575e0cac9378bef8e4a

Once #7356 lands, we can use the following command to print the fingerprints of all python_binary targets whenever their source files change:

$ ./pants --loop --query="type_filter('python_binary')" list --output-format=json :: \
  jq 'select(.was_root) | {.address, .transitive_fingerprint}'
{"address": "my/python:binary", "transitive_fingerprint": "ef54aa0c26c8d91bb74cd575e0cac9378bef8e4a"}

@cosmicexplorer
Copy link
Contributor Author

Note that I didn't discuss this specific implementation with folks earlier and would love to hear input on this approach.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Oct 11, 2019

This is likely to be extremely useful when running pants with --loop for scalameta/metals#935, according to a discussion with the author of that PR (but this does not at all block that PR).

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about renaming the option to (in order of my personal preference) --output=json or --json or or --verbose or something? That allows us to add more info in the future if needed. Essentially we generalize this from a specific hack to a more general-purpose mechanism, currently only used for the specific hack...

@cosmicexplorer
Copy link
Contributor Author

I absolutely love --output=json or maybe --output-format=json!

@cosmicexplorer
Copy link
Contributor Author

(Ideally that sets the stage for a v2 ./pants export too!)

@cosmicexplorer cosmicexplorer changed the title add --with-fingerprints json output option to v2 list add --output-format=json output option to v2 list Oct 11, 2019
@cosmicexplorer
Copy link
Contributor Author

Done! Made an enum --output-format option, defaulting to --output-format=address-specs!

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

Copy link
Contributor

@blorente blorente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly, this calculates the hashes of the fields of the target, right? So, if I have

target(
  name = "a",
  sources=["a.py", "b.py"],
)

The intransitive hash will be: hash([hash("a"), hash("a.py"), hash("b.py")]).

If this is true, should we consider hashing the contents of the files, instead of (or in addition to) the filenames?
So it would be:
hash([hash("a"), hash(read("a.py")), hash(read("b.py"))])

I think for the case of #8445, they might want to redeploy every time a source file changes, not just the list of sources. Or maybe not, I'm not too sure.
We could also have each target adaptor define its own intransitive_fingerprint, so targets with sources would know how to hash itself. This might not be best, because it might introduce a layer of caching above the engine graph itself, but could be worth thinking about.

Also, I might totally be missing something.

Even if we end up hashing only the filenames, I think there's still value in this, so wouldn't be opposed to merging it.

@cosmicexplorer
Copy link
Contributor Author

It's correct that this previously calculated just the hashes of the fields. It's not clear to me why the sources field is excluded from calculation in e.g. Struct._key(), but this implementation I've just pushed will explicitly attempt to extract sources from targets.

@illicitonion
Copy link
Contributor

Stepping-back question: What's this actually useful for, if it doesn't include the digests of source files?

(FWIW it would now be trivial to mix in a source-file digest by mixing-in TargetAdaptor.sources.snapshot)

@illicitonion
Copy link
Contributor

Stepping-back question: What's this actually useful for, if it doesn't include the digests of source files?

(FWIW it would now be trivial to mix in a source-file digest by mixing-in TargetAdaptor.sources.snapshot)

Oops, I was just looking at the comments here not the code - you're already doing this! :)

Copy link
Contributor

@blorente blorente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the thorough testing!

for dep in tht.dependencies:
dep_intransitive_fingerprint = intransitive_fingerprint_dict.get(dep.root.address, None)
if not dep_intransitive_fingerprint:
dep_sources = getattr(dep.root.adaptor, 'sources', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This block and the one before looks like it could be extracted into a common function

Copy link
Contributor Author

@cosmicexplorer cosmicexplorer Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now via @memoized_classproperty!

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I have a couple of questions :) Thanks!

# `stable_json_sha1()` to fail with a cycle detection. Since some python targets are only mapped
# to `TargetAdaptor` (and not `PythonTargetAdaptor`), we check every single target for a
# `requirements` kwarg, which is fine for now.
key, value = super()._coerce_key_values(key, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure from reading - does this cover all actual key-values of the Target, or just the ones explicitly listed in BUILD files?

In particular, if we change a default value in pants, or set a default with a flag or something, and the target doesn't set it, will the fingerprint change? Presumably it should, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers all the key-values that are provided as field_adaptors, I believe, essentially because we're getting this info from a TargetAdaptor, which will explicitly only use HydrateableField @union members as the _kwargs provided by Struct -- see hydrate_targets:

@rule
def hydrate_target(hydrated_struct: HydratedStruct) -> HydratedTarget:
target_adaptor = hydrated_struct.value
"""Construct a HydratedTarget from a TargetAdaptor and hydrated versions of its adapted fields."""
# Hydrate the fields of the adaptor and re-construct it.
hydrated_fields = yield [Get(HydratedField, HydrateableField, fa)
for fa in target_adaptor.field_adaptors]
kwargs = target_adaptor.kwargs()
for field in hydrated_fields:
kwargs[field.name] = field.value
yield HydratedTarget(target_adaptor.address,
type(target_adaptor)(**kwargs),
tuple(target_adaptor.dependencies))

In particular, if we change a default value in pants, or set a default with a flag or something, and the target doesn't set it, will the fingerprint change? Presumably it should, right?

Short answer: no, this fingerprint will not necessarily change, and yes, I absolutely think it should before we merge this.

Longer answer: Not all the things that contribute to Target#fingerprint do not transfer to a TargetAdaptor (which subclasses StructWithDeps, which subclasses Struct), only the things which are marked as hydrateable fields. As you imply, this is possibly not what we want at all for this purpose, and my use of TargetAdaptor here might not be correct.

@stuhood do you have any insight on how to bridge this? Does it break the v2 build graph traversal model if we're able to get an instance of a real Target in order to get a more normal fingerprint (i.e. a fingerprint containing everything the target payload does)? Are the Payload/Target concepts necessarily v1-only/requiring a v1 build graph, or is it possible to avoid having to reimplement payloads for all targets? I would love to pair on this or make an issue as necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the fingerprints from the output before merging this?
It feels like target_type is an obviously useful thing to include; it's less obvious that these fingerprints are things we should be exposing as-is, but we can always add them in the future if we need to / firm them up :)

"was_root": True,
"address": "f:alias",
"target_type": "target",
"intransitive_fingerprint": "c108686e1fc1b327af1dbc295008762559d0b410",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look like they may be fragile because of the hard-coding of both ordering of dumped fields, and fingerprints. I'm worried that if we add new values to Target constructors, or change defaults, we'll need to blindly update fingerprints. Could we instead phrase the tests more like:

  • Run ./pants list --ouptut-format=json f:alias
  • json.loads(outptut) and assert that the fields we expect to be stable are correct
  • Make a change to a file which we expect to alter the fingerprint, run ./pants list again, and see that the fingerprints we expect to change do, and the ones we don't expect to don't.
    ?

The particular tests I'd be wanting to see are:

  • If I change a source file, both of the owning target's fingerprints change, but for a dependee only the transitive fingerprint changes
  • If I change a random attribute on the target, as above
  • If I run pants with a flag which will change an attribute (say whether strict_deps is default on or off), as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree to all of the above, and would absolutely prefer not to add new testing that blindly tries to match fingerprints. I will add this testing!

@ShaneDelmore
Copy link
Contributor

+1 I could use this for the work I am currently doing. The addition of target type to list is particularly useful.

@ShaneDelmore
Copy link
Contributor

An additional attribute of targets that would be useful, but that I would not block the PR for, is internal/external.

@cosmicexplorer cosmicexplorer force-pushed the fingerprint-targets-v2 branch 2 times, most recently from 47ae05f to 30e15de Compare February 12, 2020 18:24
@cosmicexplorer
Copy link
Contributor Author

internal/external

./pants list as we've implemented here only lists target roots -- are we looking to list dependencies as well?

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Feb 15, 2020

What is the definition of target root? I thought it was the output of filter-minimize, the targets you needed to invoke to get all code compiled eventually, but based on your usage I'm thinking I was wrong and it is actually all targets defined in BUILD.* files maybe.

"target roots" are all targets on the command line. filter-minimize (which is a v1 task that subclasses Filter in the twitter monorepo) mimics that to obtain the smallest number of target roots that pull in all of the invalid dependent targets. that's useful for us because we can then execute each target in its own pants invocation, which twitter uses in its internal CI -- that reduces the number of separate parallel pants invocations we know we have to invoke via https://github.com/twitter/scoot.

As a note, I've added a minimize() method to #7356, so filter-minimize should be available for the general public soon too.

@cosmicexplorer
Copy link
Contributor Author

This failed on unrelated test timeouts and a mypy check which is now fixed, so it is definitely green. Since @ShaneDelmore and @olafurpg have expressed interest in this feature and I don't know of an obvious alternative, I would love to merge this change. @stuhood I have modified the tests to avoid relying on any specific fingerprinting mechanism, which means that no code should be relying on these fingerprints, or the structure of this json output, being the same across pants versions (yet). Would that guarantee be sufficient to overcome your concern about investing further in list vs --query?

@cosmicexplorer
Copy link
Contributor Author

Going to merge this if there are no further comments in the next day or so since there is a clear user need and some review has been performed.

add --with-fingerprints to list

coerce the provides key when fingerprinting targets

coerce the provides key when fingerprinting targets

convert the option to be named --output-format!

ensure fingerprints incorporate sources snapshots

use the new Enum type!

make fingerprints easier to create

clean up impl

bump deprecation version

fix ci

[ci skip-rust-tests]  # No Rust changes made.

[ci skip-jvm-tests]  # No JVM changes made.
@cosmicexplorer cosmicexplorer force-pushed the fingerprint-targets-v2 branch from 66a7bed to dba213f Compare May 4, 2020 03:08
@cosmicexplorer
Copy link
Contributor Author

Splitting this into two PRs to separate:
(1) refactoring list_targets.py
(2) adding TransitiveFingerprintedTargets (and conform to the new v2 target api!)

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented May 14, 2020

Noting that the buck build tool supports a “show target hash” option for a while, which has allowed hooking it up to a system containing bazel (https://eng.uber.com/go-monorepo-bazel/). It would have been really great to have had less pushback on this PR initially when there was a clearly present user need.

@stuhood
Copy link
Member

stuhood commented May 14, 2020

See the discussion in the linked ticket for more information on why this is subtle: bazelbuild/bazel#7962 ... file digest and action digests are useful for very different things: it's really important that users don't think that this is the latter thing (ie, they will need to do their own digesting of all of pant's other config, etc).

I've suggested in slack that I think a RuleGraph-aware query might allow for exposing more of the subtlety here, because it has the potential to allow for querying the digest of a particular goal, or of the inputs to a particular process (which would include all of the kinds of config you need in an action graph fingerprint).

Given that that might be a ways away though, I think that we could move forward here if we resolve a few things:

  1. the naming of the properties: we should make sure it is clear that they only include file contents (ie, the digest will not change if pants' configuration changes): maybe files_only_fingerprint?
  2. by default, list does not require hydrating targets, or walking into their transitive deps: both of those add a non-trivial cost, so the json output should probably not contain all of those properties by default. So the json should maybe further allow for field filtering, with digests disabled by default.

@benjyw
Copy link
Contributor

benjyw commented May 14, 2020

content_fingerprint ?

@stuhood
Copy link
Member

stuhood commented May 14, 2020

content_fingerprint ?

Probably not specific enough... content of what? But for that matter, files_only is not very clear either, heh. Not sure what to call this.

@Eric-Arellano
Copy link
Contributor

FYI #9912 will impact this, hopefully making things easier. We go back to having only one list implementation, this time using the Target API.

Base automatically changed from master to main March 19, 2021 19:20
@Eric-Arellano
Copy link
Contributor

Closing as stale, which we're doing for all changes that haven't been touched in 1+ years to simplify project management.

This would still be a really neat feature, though. Do feel free to reopen. Thank you for showing what --query could look like for Pants!

huonw added a commit that referenced this pull request Mar 3, 2023
This has the `peek` output include the fingerprint of the sources
referenced in a target. This is a step towards #8445, by putting more
information into `peek`.

For instance, with this, one way to get a crude "hash" of a target would
be something like:

```shell
{ 
   pants dependencies --transitive --closed path/to:target | xargs pants peek
   # these might change behaviour and so need to be included
   cat pants.toml
   cat 3rdparty/python/default.lock # or whatever other lock files are relevant
} | openssl sha256
```

This is conservative: the hash can be different without the behaviour of
the target changing at all. For instance:

- irrelevant changes in `pants.toml`: adjusting comments, unrelated
subsystem config (e.g. settings in `[golang]` when `path/to:target` is a
Python-only `pex_binary`)
- upgrading 3rd party dependencies in the resolve that aren't
(transitively) used by `path/to:target`. This relates to #12733: if all
transitive 3rd party deps appeared in `pants dependencies --transitive`,
and `pants peek` included the right info for them (e.g. version and
fingerprints), the `cat 3rdparty/...` could be removed because the
`peek` pipe would handle it.
- target fields that don't impact execution behaviour, e.g. changing the
`skip_black` setting on a `python_source` target, without changing the
file contents (this might be _most_ fields on the (transitive)
dependencies of a packageable target?)

This is also only the hash of the input configuration, rather than a
hash of a built artefact. If there's processes that aren't deterministic
(e.g. `shell_command(command="date > output.txt",
output_files=["output.txt"])` somewhere in the chain), the exact output
artefact might be different if built twice, even if the hash hasn't
changed.

This PR is, in some sense, a partial revival of #8450, although is much
simpler, because the JSON-outputting `peek` target already exists, and
this doesn't try to solve the full problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose the hash of a given target
7 participants