Split Gather phase from Audit phase #1806

paulirish · 2017-03-03T00:25:50Z

We currently can support auditing from saved artifacts, but we haven't really generalized this for all artifacts, just for performance stuff.

Splitting these two up would allow for folks who want to run Lighthouse against 1000s of URLs to gather on one machine and audit on another (or farmed out too many).

I think this involves saving the artifacts to disk as each gatherer finishes, and then allowing LH to pick up the disk artifacts later and run the remainder of the analysis.

As a side benefit, it'll be nice to do lighthouse --process-last-run (or whatever) during development of an audit/formatter/report rather than doing the entire round trip each iteration.

Off the top of my head, we probably have to figure out:

What this looks like at the CLI
How we save/store/delete the disk artifacts
How the artifacts are retained in the devtools/extension case
How we adapt the existing config-based approach to this

This is open for ideas, discussion, and anyone's help in moving this forward. :)

The text was updated successfully, but these errors were encountered:

patrickhulce · 2017-03-03T00:29:45Z

Definitely would like a swing at this, I think the idea overlaps quite a bit with a transition to the gatherer runner emitting artifacts as they're computed and having a number of things happen in response (some audits computed while the next pass is happening, communicating the results/status to devtools/WPT, etc)

brendankenny · 2017-03-06T22:56:02Z

@patrickhulce are you working on this now? The initial request seems much simpler than what you're talking about :)

For the original issue, we really only need to formalize what we're already doing and make it more obvious/usable (like loading any artifact from file, not just performance logs or traces, and having --save-artifacts (or something new) save a config file capable of auditing the saved artifacts).

Running on any saved artifacts is technically already working when stored as literals in an artifacts object in the config, e.g. we use it in runner-test.js to test Runner running audits on already generated artifacts (to be fair, it is mostly doing do with performance artifacts, but there are a few HTTPS artifacts sprinkled in there since the is-on-https audit is so fast)

paulirish · 2017-03-06T23:18:12Z

I put a really gross impl of this in https://github.com/GoogleChrome/lighthouse/compare/dumpartifacts

It hacks up runner pretty well.. but honestly the UX (for LH development) is pretty awesome.

In last offline convo with Patrick, I indicated it probably makes sense to kick artifacts out of config completely (breaking change). And you can specify either a separate artifacts config or maybe just a artifacts.json location. I like the idea that config and artifacts are managed separately.

patrickhulce · 2017-03-06T23:35:16Z

Yeah I'm not actively working on this since Paul showed me his working version, but just started a doc to outline my goals for the extensions I commented on :)

It'd be nice to do this and the report breaking change together in a config v2 format or something, maybe another doc like report requirements would be helpful...

paulirish · 2017-03-17T05:38:14Z

branch: https://github.com/GoogleChrome/lighthouse/compare/dumpartifacts

updated. also includes the #2062 network records -> computed work

paulirish · 2017-04-26T21:07:24Z

We're getting close but some details to figure out about the API..

lighthouse --only-gather
lighthouse -G

lighthouse --only-audit
lighthouse -A

lighthouse --only-report
lighthouse -R

TBD: how to handle input/output filenames (and their defaults). interaction with output-path..

patrickhulce · 2017-04-26T21:35:24Z

Filenames proposal

--artifacts-path is required for --only-audit and defaults to ./latest.artifacts.log
--audit-results-path is required for --only-report and defaults to ./latest.audit-results.log
NOTE: this file is essentially just the contents of report.audits and the rest of the category and scoring information should be done in --only-report IMO
--output is only meaningful/used when running without these --skip-* flags
--output-path is still used to control location of output
--output-path defaults to stdout in these modes with the remaining current exception for the html/domhtml output modes on --only-report

Example

lighthouse -G > latest.artifacts.log
lighthouse -A > latest.audit-results.log
lighthouse -R --output domhtml --view

Alternative proposal

Instead of --only to run exactly 1 stage, what if we control skipping of earlier steps by providing the paths as noted above (artifacts-path and audit-results-path), and we control skipping of later steps with flags such as --skip-report. This offers a greater level of flexibility for say, generating the lighthouse report and supporting --view while hacking on audits instead of running two commands back to back.

# same commands as above but slightly more verbose
lighthouse --skip-audit-results --output-path=./latest.artifacts.log
lighthouse --artifacts-path=./latest.artifacts.log --skip-report --output-path=./latest.audit-results.log
lighthouse --audit-results-path=./latest.audit-results.log --output domhtml --view

# new possibilities enabled
lighthouse --artifacts-path=./latest.artifacts.log --output domhtml --view
lighthouse --skip-report > latest.audit-results.log

paulirish · 2017-04-27T21:22:13Z

patrick and I discussed this morning and have a slight variation on the above. And then I revised further.

scoring?

first up, we could introduce a scoring phase, which is basically this bit in runner. I guess the question is.. should "the almighty lighthouse result object" include this scoring data?

(Currently I favor not exposing scoring phase as a separate thing now and merging it into "auditing" or "report generating". Patrick favors the latter. (We can always separate it later if need be.))

Basic CLI usage

lighthouse --do=G     # run just gather phase
lighthouse --do=A     # pick up from saved artifacts and run audits
lighthouse --do=R     # pick up from lighthouse result object, generate scores & report
lighthouse --do=GA    # gather and audit. (equiv to the current --output=json)
lighthouse --do=AR    # pick up from saved artifacts and do everything else
lighthouse --do=GAR   # run everything (basically the default)
lighthouse --do=GR    # invalid. you get an error message.

(For the arg name, I'm into --do or --run-stages or whatever.)

Then we want to satisfy the lighthouse as trace processor usecases.. which currently use config.artifacts.. we think we can drop that approach and use these flags instead. But we have to sort out how we read these files from disk.

File paths

First up, saving a separate file for trace and devtoolslog makes sense, as these files are pretty consumable by other tools. (Also, I argue that since artifact/trace output is usually >20MB, we shouldn't it (or the full artifacts output) to stdout.)

So we can create a whole folder for our artifacts and write individual files for each artifact in there. (Yes some will be tiny (viewport, etc), but it seems very manageable to understand and debug.)

We'd redefine (i think..?) --output-path to do this, and have it always define the folder path that we'll use.
The structure inside that folder is very predictable. It's wiped empty if someone is about to start gathering.

./latest-run/
 - /artifacts/
   - trace.json
   - devtoolslog.json
   - ...
 - result.json
 - report.html

I'm also okay with using our filenameprefix on these filenames as long as the filenames end with these strings -- for example --add-filename-prefix (defaults to off). idgaf.

okay so CLI usage with --output-path:

lighthouse --do=G # --artifacts-path defaults to `./latest-run/artifacts/`
lighthouse --do=G --output-path=./my-run/  # saves artifacts to ./my-run/artifacts/

lighthouse --do=A # looks for artifacts in `./latest-run/`
lighthouse --do=A --output-path=./my-run/  # grabs artifacts from ./my-run/artifacts/
                                           # and saves lh result to ./my-run/result.json

lighthouse --do=R # looks for lighthhouse result in `./latest-run/result.json`
lighthouse --do=R -output-path=./my-run/  # looks for lh result object at ./my-run/result.json
                                          # and saves report as `./latest-run/report.html`

`--output-path`

and with the change to --output-path it means our normal behavior without --do is affected

lighthouse --output-path=./airhorner-run/  # creates the below folder

./airhorner-run/
 - /artifacts/
   - trace.json
   - devtoolslog.json
   - ...
 - result.json
 - report.html

`--save-assets`

Some details TBD, but i'm not too worried. Will include saving a different trace than the original trace saved in artifacts/

patrickhulce · 2017-04-27T21:32:35Z

Love the folder structure move and clarity of it. Couple quick additions

write individual files for each artifact in there.

This unfortunately has to deal with the wrinkles of some artifacts scoped by pass while most aren't. A mega artifacts.json file was also thrown around, which I'd slightly lean towards with the --save-assets still handling the splitting into separate files

Currently I favor not exposing scoring phase as a separate thing now and merging it into "auditing" or "report generating". Patrick favors the latter.

Most of the example demonstrates the former, but I think merging scoring into report generating makes the most sense because that's essentially --output json and allows for iterating on categories and scoring from pre-computed audits rather than having to recompute the audits when scoring changes.

paulirish · 2017-04-27T21:38:23Z

This unfortunately has to deal with the wrinkles of some artifacts scoped by pass while most aren't.

Oh right. Maybe passName is included in the filename (or we add another folder level?)

./airhorner-run/
 - /artifacts/
   - defaultPass.trace.json
   - defaultPass.devtoolslog.json
   - dbwPass.trace.json
   - dbwPass.devtoolslog.json

--save-assets

Let's definitely kill --save-artifacts as part of this effort. --save-assets has value for the modified trace and screenshots file. Simple for me means this flag now just saves those two files inside of the run folder, next to the result.json and report.html

brendankenny · 2017-04-27T22:13:36Z

Yeah, I agree it's probably not worth differentiating between audit results and the json report generation stages. It adds a whole extra step if someone is doing this piecemeal, and the only benefit seems to be speeding up development of scoring changes, and arguably the time saved over just rerunning the audits isn't that great (10 or 15 seconds for a bad site?)

brendankenny · 2017-04-27T22:14:41Z

(but if there was a need I'd definitely be open to it in the future. Dropping the need to re-run gatherers is going to be such a huge improvement by itself that it's probably worth waiting to see what that's like first :)

paulirish added architecture help wanted labels Mar 3, 2017

paulirish mentioned this issue Mar 30, 2017

[meta] I/O Burndown #1937

Closed

52 tasks

brendankenny self-assigned this Apr 14, 2017

paulirish mentioned this issue Apr 27, 2017

Gatherers should have ids #2087

Open

paulirish mentioned this issue May 5, 2017

[meta] I/O The Final Countdown #2161

Closed

40 tasks

brendankenny added needs-priority and removed help wanted labels Sep 26, 2017

patrickhulce added P1 and removed needs-priority labels Nov 1, 2017

paulirish mentioned this issue Nov 3, 2017

core(lifecycle): allow gathering & auditing to run separately #3743

Merged

darcyclarke mentioned this issue Nov 10, 2017

Remove <url> requirement #3803

Closed

paulirish closed this as completed in #3743 Jan 5, 2018

brendankenny mentioned this issue Oct 6, 2020

tests: mock saveLhr and assert no unit test source changes #11519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split Gather phase from Audit phase #1806

Split Gather phase from Audit phase #1806

paulirish commented Mar 3, 2017

patrickhulce commented Mar 3, 2017

brendankenny commented Mar 6, 2017 •

edited

Loading

paulirish commented Mar 6, 2017

patrickhulce commented Mar 6, 2017

paulirish commented Mar 17, 2017 •

edited

Loading

paulirish commented Apr 26, 2017

patrickhulce commented Apr 26, 2017 •

edited

Loading

paulirish commented Apr 27, 2017 •

edited by patrickhulce

Loading

patrickhulce commented Apr 27, 2017

paulirish commented Apr 27, 2017

brendankenny commented Apr 27, 2017

brendankenny commented Apr 27, 2017

Split Gather phase from Audit phase #1806

Split Gather phase from Audit phase #1806

Comments

paulirish commented Mar 3, 2017

patrickhulce commented Mar 3, 2017

brendankenny commented Mar 6, 2017 • edited Loading

paulirish commented Mar 6, 2017

patrickhulce commented Mar 6, 2017

paulirish commented Mar 17, 2017 • edited Loading

paulirish commented Apr 26, 2017

patrickhulce commented Apr 26, 2017 • edited Loading

Filenames proposal

Example

Alternative proposal

paulirish commented Apr 27, 2017 • edited by patrickhulce Loading

scoring?

Basic CLI usage

File paths

--output-path

--save-assets

patrickhulce commented Apr 27, 2017

paulirish commented Apr 27, 2017

brendankenny commented Apr 27, 2017

brendankenny commented Apr 27, 2017

brendankenny commented Mar 6, 2017 •

edited

Loading

paulirish commented Mar 17, 2017 •

edited

Loading

patrickhulce commented Apr 26, 2017 •

edited

Loading

paulirish commented Apr 27, 2017 •

edited by patrickhulce

Loading

`--output-path`

`--save-assets`