Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure links extracts to keep original links #1055

Merged
merged 5 commits into from
Sep 5, 2022
Merged

Conversation

dontcallmedom
Copy link
Member

Currently, reffy "canonicalize" urls before recording them in the links extracts.
While that canonicalization is useful to regroup links that points to the same spec, it loses critical information (e.g. for links to multipage specs) and produces information that does not really reflect the reality of the spec, making it hard to produce reliable analysis.

The patch instead records the original link untouched, and annotates it with the spec that can be identified after canonicalization.

This means the structure of links extract will change.

Currently, reffy "canonicalize" urls before recording them in the links extracts.
While that canonicalization is useful to regroup links that points to the same spec, it loses critical information (e.g. for links to multipage specs) and produces information that does not really reflect the reality of the spec, making it hard to produce reliable analysis.

The patch instead records the original link untouched, and annotates it with the spec that can be identified after canonicalization.

This means the structure of links extract will change.
@@ -294,7 +294,7 @@ async function saveSpecResults(spec, settings) {
async function crawlList(speclist, crawlOptions) {
// Make a shallow copy of crawl options object since we're going
// to modify properties in place
crawlOptions = Object.assign({}, crawlOptions);
crawlOptions = Object.assign({speclist}, crawlOptions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've never been very clean on separating inputs and outputs. Unless I missed a copy somewhere, we directly store the results of the crawl within the spec objects in speclist, at least temporarily while the results get saved to disk. This means speclist may also contain a few expanded crawl results (or all of them when useCrawl is set or when the user does not specify an output folder), up to >50MB of data.

Here, we're asking Puppeteer to send the list to a separate Chrome process, so Puppeteer will need to serialize/deserialize the list each time. That works but that seems like a waste of processing time and memory.

I think we should rather make a deeper copy here and create a new list of spec objects that only contain core metadata about each spec (and that won't be touched by the crawl). That will still send >500KB of data but that should be fine.

I can live with it, but it could perhaps be argued that the extraction code should remain atomic in the sense that its results should only depend on information about the spec and on the spec's contents (extraction code is not the easiest bit to debug since it runs in a headless Chrome instance, so better keep the logic simple). An alternative would be to split the extraction of links into two parts:

  1. The extraction of the raw links in Chrome through extract-links.mjs without attempting to link them back to specs.
  2. A post-processing module that associates extracted links with specs and runs in the Node.js process. Post-processing is a good place to run logic that needs to look at other specs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach makes more sense indeed; I've updated the PR to that effect

Copy link
Member

@tidoust tidoust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me, thanks for the update!

See inline comments, regarding Strudy in particular coz' this will break it...

const fs = require('fs');
const path = require('path');

function canonicalizeUrl(url, options) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strudy currently needs canonicalizeUrl so deleting the mjs file and moving the logic here will break it. That was one of the remaining todos in #747 (comment). I suppose you're looking into Strudy in any case and planning to fix the code there?


module.exports = {
dependsOn: ['links'],
input: 'crawl',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the post-processor is a bit clunky here. This post-processing module can run directly once the spec has crawled so input: 'spec' would be the right choice in theory (this would also allow to use the default save logic that the post-processor contains instead of re-defining it because the post-processor does something else for crawl-wide post-processing modules).

That said, to be able to do that, you would need a way to access the list of specs in the crawl, which could either be done by updating the signature of the run function to run: async function(spec, specs, options) or by copying speclist into the options as you had done previously (no need to worry about memory copies a priori).

Anyway, no big deal for now. We can revisit that later on.

@tidoust
Copy link
Member

tidoust commented Sep 2, 2022

Forgot to add: we'll need to update webref's crawl job to run the post-processing module in update-ed.yml

More complex options are only needed in strudy, which will maintain its own version for now
See also #1055 (comment)
@dontcallmedom dontcallmedom merged commit 841db67 into main Sep 5, 2022
@dontcallmedom dontcallmedom deleted the raw-links branch September 5, 2022 11:20
dontcallmedom added a commit to w3c/strudy that referenced this pull request Sep 5, 2022
dontcallmedom added a commit to w3c/strudy that referenced this pull request Sep 5, 2022
Following change in w3c/reffy#1055
Also split code between CLI and re-usable library
dontcallmedom added a commit that referenced this pull request Sep 6, 2022
Breaking change:
-  Restructure links extracts to keep original links (#1055)

Dependencies bumped:
- Bump web-specs from 2.19.0 to 2.23.0 (#1056)
- Bump puppeteer from 16.1.1 to 17.1.1 (#1057)
- Bump rollup from 2.78.0 to 2.79.0 (#1054)

How to upgrade:
The breaking change only affects code that makes use of links extracts.
These extracts are now using an object with an anchors property that lists
the said anchors, instead of being listed directly as an array.
tidoust pushed a commit to w3c/strudy that referenced this pull request Sep 6, 2022
* Adapt to removal of canonicalization code in reffy

per w3c/reffy#1055 (comment)

* Adapt to new structure of links data

Following change in w3c/reffy#1055
Also split code between CLI and re-usable library

* Add comment about duplicated code with Reffy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants