Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Gather phase from Audit phase #1806

Closed
paulirish opened this issue Mar 3, 2017 · 12 comments · Fixed by #3743
Closed

Split Gather phase from Audit phase #1806

paulirish opened this issue Mar 3, 2017 · 12 comments · Fixed by #3743
Assignees

Comments

@paulirish
Copy link
Member

We currently can support auditing from saved artifacts, but we haven't really generalized this for all artifacts, just for performance stuff.

Splitting these two up would allow for folks who want to run Lighthouse against 1000s of URLs to gather on one machine and audit on another (or farmed out too many).

I think this involves saving the artifacts to disk as each gatherer finishes, and then allowing LH to pick up the disk artifacts later and run the remainder of the analysis.

As a side benefit, it'll be nice to do lighthouse --process-last-run (or whatever) during development of an audit/formatter/report rather than doing the entire round trip each iteration.

Off the top of my head, we probably have to figure out:

  1. What this looks like at the CLI
  2. How we save/store/delete the disk artifacts
  3. How the artifacts are retained in the devtools/extension case
  4. How we adapt the existing config-based approach to this

This is open for ideas, discussion, and anyone's help in moving this forward. :)

@patrickhulce
Copy link
Collaborator

Definitely would like a swing at this, I think the idea overlaps quite a bit with a transition to the gatherer runner emitting artifacts as they're computed and having a number of things happen in response (some audits computed while the next pass is happening, communicating the results/status to devtools/WPT, etc)

@brendankenny
Copy link
Member

brendankenny commented Mar 6, 2017

@patrickhulce are you working on this now? The initial request seems much simpler than what you're talking about :)

For the original issue, we really only need to formalize what we're already doing and make it more obvious/usable (like loading any artifact from file, not just performance logs or traces, and having --save-artifacts (or something new) save a config file capable of auditing the saved artifacts).

Running on any saved artifacts is technically already working when stored as literals in an artifacts object in the config, e.g. we use it in runner-test.js to test Runner running audits on already generated artifacts (to be fair, it is mostly doing do with performance artifacts, but there are a few HTTPS artifacts sprinkled in there since the is-on-https audit is so fast)

@paulirish
Copy link
Member Author

I put a really gross impl of this in https://github.com/GoogleChrome/lighthouse/compare/dumpartifacts

It hacks up runner pretty well.. but honestly the UX (for LH development) is pretty awesome.

In last offline convo with Patrick, I indicated it probably makes sense to kick artifacts out of config completely (breaking change). And you can specify either a separate artifacts config or maybe just a artifacts.json location. I like the idea that config and artifacts are managed separately.

@patrickhulce
Copy link
Collaborator

Yeah I'm not actively working on this since Paul showed me his working version, but just started a doc to outline my goals for the extensions I commented on :)

It'd be nice to do this and the report breaking change together in a config v2 format or something, maybe another doc like report requirements would be helpful...

@paulirish
Copy link
Member Author

paulirish commented Mar 17, 2017

branch: https://github.com/GoogleChrome/lighthouse/compare/dumpartifacts

updated. also includes the #2062 network records -> computed work

@paulirish paulirish mentioned this issue Mar 30, 2017
52 tasks
@brendankenny brendankenny self-assigned this Apr 14, 2017
@paulirish
Copy link
Member Author

We're getting close but some details to figure out about the API..

lighthouse --only-gather
lighthouse -G

lighthouse --only-audit
lighthouse -A

lighthouse --only-report
lighthouse -R

TBD: how to handle input/output filenames (and their defaults). interaction with output-path..

@patrickhulce
Copy link
Collaborator

patrickhulce commented Apr 26, 2017

Filenames proposal

  • --artifacts-path is required for --only-audit and defaults to ./latest.artifacts.log
  • --audit-results-path is required for --only-report and defaults to ./latest.audit-results.log
    NOTE: this file is essentially just the contents of report.audits and the rest of the category and scoring information should be done in --only-report IMO
  • --output is only meaningful/used when running without these --skip-* flags
  • --output-path is still used to control location of output
  • --output-path defaults to stdout in these modes with the remaining current exception for the html/domhtml output modes on --only-report

Example

lighthouse -G > latest.artifacts.log
lighthouse -A > latest.audit-results.log
lighthouse -R --output domhtml --view

Alternative proposal

Instead of --only to run exactly 1 stage, what if we control skipping of earlier steps by providing the paths as noted above (artifacts-path and audit-results-path), and we control skipping of later steps with flags such as --skip-report. This offers a greater level of flexibility for say, generating the lighthouse report and supporting --view while hacking on audits instead of running two commands back to back.

# same commands as above but slightly more verbose
lighthouse --skip-audit-results --output-path=./latest.artifacts.log
lighthouse --artifacts-path=./latest.artifacts.log --skip-report --output-path=./latest.audit-results.log
lighthouse --audit-results-path=./latest.audit-results.log --output domhtml --view

# new possibilities enabled
lighthouse --artifacts-path=./latest.artifacts.log --output domhtml --view
lighthouse --skip-report > latest.audit-results.log

@paulirish
Copy link
Member Author

paulirish commented Apr 27, 2017

patrick and I discussed this morning and have a slight variation on the above. And then I revised further.


scoring?

first up, we could introduce a scoring phase, which is basically this bit in runner. I guess the question is.. should "the almighty lighthouse result object" include this scoring data?

  • (Currently I favor not exposing scoring phase as a separate thing now and merging it into "auditing" or "report generating". Patrick favors the latter. (We can always separate it later if need be.))

image


Basic CLI usage

lighthouse --do=G     # run just gather phase
lighthouse --do=A     # pick up from saved artifacts and run audits
lighthouse --do=R     # pick up from lighthouse result object, generate scores & report
lighthouse --do=GA    # gather and audit. (equiv to the current --output=json)
lighthouse --do=AR    # pick up from saved artifacts and do everything else
lighthouse --do=GAR   # run everything (basically the default)
lighthouse --do=GR    # invalid. you get an error message.

(For the arg name, I'm into --do or --run-stages or whatever.)

Then we want to satisfy the lighthouse as trace processor usecases.. which currently use config.artifacts.. we think we can drop that approach and use these flags instead. But we have to sort out how we read these files from disk.

File paths

First up, saving a separate file for trace and devtoolslog makes sense, as these files are pretty consumable by other tools. (Also, I argue that since artifact/trace output is usually >20MB, we shouldn't it (or the full artifacts output) to stdout.)

So we can create a whole folder for our artifacts and write individual files for each artifact in there. (Yes some will be tiny (viewport, etc), but it seems very manageable to understand and debug.)

We'd redefine (i think..?) --output-path to do this, and have it always define the folder path that we'll use.
The structure inside that folder is very predictable. It's wiped empty if someone is about to start gathering.

./latest-run/
 - /artifacts/
   - trace.json
   - devtoolslog.json
   - ...
 - result.json
 - report.html

I'm also okay with using our filenameprefix on these filenames as long as the filenames end with these strings -- for example --add-filename-prefix (defaults to off). idgaf.

okay so CLI usage with --output-path:

lighthouse --do=G # --artifacts-path defaults to `./latest-run/artifacts/`
lighthouse --do=G --output-path=./my-run/  # saves artifacts to ./my-run/artifacts/

lighthouse --do=A # looks for artifacts in `./latest-run/`
lighthouse --do=A --output-path=./my-run/  # grabs artifacts from ./my-run/artifacts/
                                           # and saves lh result to ./my-run/result.json

lighthouse --do=R # looks for lighthhouse result in `./latest-run/result.json`
lighthouse --do=R -output-path=./my-run/  # looks for lh result object at ./my-run/result.json
                                          # and saves report as `./latest-run/report.html`

--output-path

and with the change to --output-path it means our normal behavior without --do is affected

lighthouse --output-path=./airhorner-run/  # creates the below folder

./airhorner-run/
 - /artifacts/
   - trace.json
   - devtoolslog.json
   - ...
 - result.json
 - report.html

--save-assets

Some details TBD, but i'm not too worried. Will include saving a different trace than the original trace saved in artifacts/

@patrickhulce
Copy link
Collaborator

Love the folder structure move and clarity of it. Couple quick additions

write individual files for each artifact in there.

This unfortunately has to deal with the wrinkles of some artifacts scoped by pass while most aren't. A mega artifacts.json file was also thrown around, which I'd slightly lean towards with the --save-assets still handling the splitting into separate files

Currently I favor not exposing scoring phase as a separate thing now and merging it into "auditing" or "report generating". Patrick favors the latter.

Most of the example demonstrates the former, but I think merging scoring into report generating makes the most sense because that's essentially --output json and allows for iterating on categories and scoring from pre-computed audits rather than having to recompute the audits when scoring changes.

@paulirish
Copy link
Member Author

This unfortunately has to deal with the wrinkles of some artifacts scoped by pass while most aren't.

Oh right. Maybe passName is included in the filename (or we add another folder level?)

./airhorner-run/
 - /artifacts/
   - defaultPass.trace.json
   - defaultPass.devtoolslog.json
   - dbwPass.trace.json
   - dbwPass.devtoolslog.json

--save-assets

Let's definitely kill --save-artifacts as part of this effort. --save-assets has value for the modified trace and screenshots file. Simple for me means this flag now just saves those two files inside of the run folder, next to the result.json and report.html

@brendankenny
Copy link
Member

Yeah, I agree it's probably not worth differentiating between audit results and the json report generation stages. It adds a whole extra step if someone is doing this piecemeal, and the only benefit seems to be speeding up development of scoring changes, and arguably the time saved over just rerunning the audits isn't that great (10 or 15 seconds for a bad site?)

@brendankenny
Copy link
Member

(but if there was a need I'd definitely be open to it in the future. Dropping the need to re-run gatherers is going to be such a huge improvement by itself that it's probably worth waiting to see what that's like first :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants