Restructure links extracts to keep original links #1055

dontcallmedom · 2022-09-01T15:28:07Z

Currently, reffy "canonicalize" urls before recording them in the links extracts.
While that canonicalization is useful to regroup links that points to the same spec, it loses critical information (e.g. for links to multipage specs) and produces information that does not really reflect the reality of the spec, making it hard to produce reliable analysis.

The patch instead records the original link untouched, and annotates it with the spec that can be identified after canonicalization.

This means the structure of links extract will change.

Currently, reffy "canonicalize" urls before recording them in the links extracts. While that canonicalization is useful to regroup links that points to the same spec, it loses critical information (e.g. for links to multipage specs) and produces information that does not really reflect the reality of the spec, making it hard to produce reliable analysis. The patch instead records the original link untouched, and annotates it with the spec that can be identified after canonicalization. This means the structure of links extract will change.

tidoust · 2022-09-02T09:03:42Z

src/lib/specs-crawler.js

@@ -294,7 +294,7 @@ async function saveSpecResults(spec, settings) {
 async function crawlList(speclist, crawlOptions) {
    // Make a shallow copy of crawl options object since we're going
    // to modify properties in place
-    crawlOptions = Object.assign({}, crawlOptions);
+    crawlOptions = Object.assign({speclist}, crawlOptions);


We've never been very clean on separating inputs and outputs. Unless I missed a copy somewhere, we directly store the results of the crawl within the spec objects in speclist, at least temporarily while the results get saved to disk. This means speclist may also contain a few expanded crawl results (or all of them when useCrawl is set or when the user does not specify an output folder), up to >50MB of data.

Here, we're asking Puppeteer to send the list to a separate Chrome process, so Puppeteer will need to serialize/deserialize the list each time. That works but that seems like a waste of processing time and memory.

I think we should rather make a deeper copy here and create a new list of spec objects that only contain core metadata about each spec (and that won't be touched by the crawl). That will still send >500KB of data but that should be fine.

I can live with it, but it could perhaps be argued that the extraction code should remain atomic in the sense that its results should only depend on information about the spec and on the spec's contents (extraction code is not the easiest bit to debug since it runs in a headless Chrome instance, so better keep the logic simple). An alternative would be to split the extraction of links into two parts:

The extraction of the raw links in Chrome through extract-links.mjs without attempting to link them back to specs.

A post-processing module that associates extracted links with specs and runs in the Node.js process. Post-processing is a good place to run logic that needs to look at other specs.

That approach makes more sense indeed; I've updated the PR to that effect

As suggested in #1055 (comment)

tidoust

That looks good to me, thanks for the update!

See inline comments, regarding Strudy in particular coz' this will break it...

tidoust · 2022-09-02T15:52:44Z

src/postprocessing/annotate-links.js

+const fs = require('fs');
+const path = require('path');
+
+function canonicalizeUrl(url, options) {


Strudy currently needs canonicalizeUrl so deleting the mjs file and moving the logic here will break it. That was one of the remaining todos in #747 (comment). I suppose you're looking into Strudy in any case and planning to fix the code there?

tidoust · 2022-09-02T16:10:40Z

src/postprocessing/annotate-links.js

+
+module.exports = {
+  dependsOn: ['links'],
+  input: 'crawl',


Ah, the post-processor is a bit clunky here. This post-processing module can run directly once the spec has crawled so input: 'spec' would be the right choice in theory (this would also allow to use the default save logic that the post-processor contains instead of re-defining it because the post-processor does something else for crawl-wide post-processing modules).

That said, to be able to do that, you would need a way to access the list of specs in the crawl, which could either be done by updating the signature of the run function to run: async function(spec, specs, options) or by copying speclist into the options as you had done previously (no need to worry about memory copies a priori).

Anyway, no big deal for now. We can revisit that later on.

tidoust · 2022-09-02T17:10:31Z

Forgot to add: we'll need to update webref's crawl job to run the post-processing module in update-ed.yml

More complex options are only needed in strudy, which will maintain its own version for now See also #1055 (comment)

As suggested in #1055 (comment)

per w3c/reffy#1055 (comment)

Following change in w3c/reffy#1055 Also split code between CLI and re-usable library

Breaking change: - Restructure links extracts to keep original links (#1055) Dependencies bumped: - Bump web-specs from 2.19.0 to 2.23.0 (#1056) - Bump puppeteer from 16.1.1 to 17.1.1 (#1057) - Bump rollup from 2.78.0 to 2.79.0 (#1054) How to upgrade: The breaking change only affects code that makes use of links extracts. These extracts are now using an object with an anchors property that lists the said anchors, instead of being listed directly as an array.

* Adapt to removal of canonicalization code in reffy per w3c/reffy#1055 (comment) * Adapt to new structure of links data Following change in w3c/reffy#1055 Also split code between CLI and re-usable library * Add comment about duplicated code with Reffy

dontcallmedom requested a review from tidoust September 1, 2022 15:28

Associate link with proper version when matching version-less series

84dda39

tidoust reviewed Sep 2, 2022

View reviewed changes

Separate annotation of links with spec destination into postProcessing

90e1183

As suggested in #1055 (comment)

dontcallmedom requested a review from tidoust September 2, 2022 15:05

tidoust approved these changes Sep 2, 2022

View reviewed changes

dontcallmedom added 2 commits September 5, 2022 11:47

Simplify canonicalizeUrl code

f3e68e0

More complex options are only needed in strudy, which will maintain its own version for now See also #1055 (comment)

Use spec-mode for annotate-links post processing

6565feb

As suggested in #1055 (comment)

dontcallmedom merged commit 841db67 into main Sep 5, 2022

dontcallmedom deleted the raw-links branch September 5, 2022 11:20

dontcallmedom added a commit to w3c/strudy that referenced this pull request Sep 5, 2022

Adapt to removal of canonicalization code in reffy

f8756bd

per w3c/reffy#1055 (comment)

dontcallmedom added a commit to w3c/strudy that referenced this pull request Sep 5, 2022

Adapt to new structure of links data

e017da4

Following change in w3c/reffy#1055 Also split code between CLI and re-usable library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure links extracts to keep original links #1055

Restructure links extracts to keep original links #1055

dontcallmedom commented Sep 1, 2022

tidoust Sep 2, 2022

dontcallmedom Sep 2, 2022

tidoust left a comment

tidoust Sep 2, 2022

tidoust Sep 2, 2022

tidoust commented Sep 2, 2022

Restructure links extracts to keep original links #1055

Restructure links extracts to keep original links #1055

Conversation

dontcallmedom commented Sep 1, 2022

tidoust Sep 2, 2022

Choose a reason for hiding this comment

dontcallmedom Sep 2, 2022

Choose a reason for hiding this comment

tidoust left a comment

Choose a reason for hiding this comment

tidoust Sep 2, 2022

Choose a reason for hiding this comment

tidoust Sep 2, 2022

Choose a reason for hiding this comment

tidoust commented Sep 2, 2022