Make W3C script use the W3C API as source #828

tidoust · 2024-10-20T12:23:39Z

[Note: requires #820 and #826 to be merged first, otherwise generation will fail or make a couple of tests fail]

This is a complete re-write of the W3C update script to switch from the still-maintained-but-deprecated tr.rdf file to the more complete and current W3C API instead.

What changes? Essentially nothing substantial in terms of data, but:

When an entry is updated, the source property will target the API endpoint from which the data was pulled, such as:
https://api.w3.org/specifications/_shortname_
The W3C API has a few additional statuses that were not reported in tr.rdf such as DNOTE, FPWD, LCWD, and the registry statuses.
The script fills out properties more systematically for versions.
The order of the properties for each entry is not always exactly the same as the order generated by the previous script.

The first time the script runs, it will:

Fix a few entries of very old specs in Specref, for which the title is not the title of the actual spec.
Add entries for the draft registries published by a couple of groups.
Complete a few entries with additional versions that did not exist in tr.rdf for some reason.
Create consistent obsoletes properties, as Specref contains a few obsoletes properties that don't have a matching obsoletedBy property.

The script only updates recently published entries by default. That is, it does not attempt to refresh the whole list. That's needed because the W3C API follows the HAL convention:
https://en.wikipedia.org/wiki/Hypertext_Application_Language

One consequence is that each API request returns only a minimal amount of information, and re-generating the entire w3c.json file requires sending ~30000 requests, which would be at best impractical to do on an hourly basis, all the more so because the W3C API server has some rate limits rules in place (6000 requests every 10 minutes). More importantly, that would be a waste of resources as data essentially never changes once published.

Thus the script takes an incremental approach instead and only refreshes:

Specifications recently published... where recently published means specifications published since the newest publication date known to Specref minus 2 weeks by default. The "minus 2 weeks" is meant to catch data fixes that are sometimes made shortly after publication.
Specifications for which the base info (title, URL) is not aligned with the W3C API. That's meant to fix the data in Specref during the transition, and to catch further updates that could be made to the W3C API once in a while.

All in all, a typical update should send ~500 requests to the W3C API. The code throttles requests to 1 every 100ms. Running the script should take ~1-2 minutes.

A more thorough refresh may be forced by calling the script with a date as parameter (format YYYY-MM-DD, with month and day optional). The date gets interpreted as the synchronization point. For example, to refresh all specs published since 2023, run:

node scripts/w3c.js 2023

To force a "full" refresh (any year prior to 1995 would work):

node scripts/w3c.js 1995

A full refresh sends ~30k requests to the W3C API and may take >2h. I suggest to run a full refresh manually once, shortly after this script starts being used, and then to run it again every few months to capture potential fixes that might have been made to the data in the meantime.

Running that full refresh will also be useful to fix the few obsoletedBy properties that are not fully correct, and to move a few hasErrata links to the right spec version, as some entries have these links at the root level of the entry in Specref, whereas the latest version is no longer the REC that linked to the errata.

I worked with @deniak, who helped me understand the data in the W3C API, and who fixed and completed the data where Specref had more correct info. See also #826 for updating a few entries in Specref that would create issues with the new script.

The script contains a number of comments to explain the different cases that need to be handled to be able to fully map the data in the W3C API with the data in Specref. There will remain a few entries where the mapping is somewhat imperfect, notably when the shortname of a spec evolved from a level-less shortname to shortname with level, and sometimes back to a level-less shortname (examples include user-timing, performance-timeline). There are also a few entries for old specs that are flagged as retired in Specref (isRetired: true) but not in the W3C API. Mismatches are reported to the console as warnings. These should be addressed over time. The script preserves the information in Specref in any case.

The script also preserves the information in Specref in case of transient network errors while fetching info from the W3C API.

The new overwrites rules are needed during the transition (the changes need to be made as the same time as the data gets updated), but can be dropped afterwards. They affect specifications that switched from a shortname without a level to a shortname with a level. Longer term, these should be handled through the notion of specification series (see #811).

@deniak

This is a complete re-write of the W3C update script to switch from the still-maintained-but-deprecated `tr.rdf` file to the more complete and current W3C API instead. What changes? Essentially nothing substantial in terms of data, but: - When an entry is updated, the `source` property will target the API endpoint from which the data was pulled, such as: `https://api.w3.org/specifications/_shortname_` - The W3C API has a few additional statuses that were not reported in `tr.rdf` such as `DNOTE`, `FPWD`, `LCWD`, and the registry statuses. - The script fills out properties more systematically for versions - The order of the properties for each entry is not always exactly the same as the order generated by the previous script. The first time the script runs, it will: - Fix a few entries of very old specs in Specref, for which the title is not the title of the actual spec. - Add entries for the draft registries published by a couple of groups. - Complete a few entries with additional versions that did not exist in `tr.rdf` for some reason. - Create consistent `obsoletes` properties, as Specref contains a few `obsoletes` properties that don't have a matching `obsoletedBy` property. The script only updates recently published entries by default. That is, it does not attempt to refresh the whole list. That's needed because the W3C API follows the HAL convention: https://en.wikipedia.org/wiki/Hypertext_Application_Language One consequence is that each API request returns only a minimal amount of information, and re-generating the entire `w3c.json` file requires sending ~30000 requests, which would be at best impractical to do on an hourly basis, all the more so because the W3C API server has some rate limits rules in place (6000 requests every 10 minutes). More importantly, that would be a waste of resources as data essentially never changes once published. Thus the script takes an incremental approach instead and only refreshes: 1. Specifications recently published... where recently published means specifications published since the newest publication date known to Specref minus 2 weeks by default. The "minus 2 weeks" is meant to catch data fixes that are sometimes made shortly after publication. 2. Specifications for which the base info (title, URL) is not aligned with the W3C API. That's meant to fix the data in Specref during the transition, and to catch further updates that could be made to the W3C API once in a while. All in all, a typical update should send ~500 requests to the W3C API. The code throttles requests to 1 every 100ms. Running the script should take ~1-2 minutes. A more thorough refresh may be forced by calling the script with a date as parameter (format YYYY-MM-DD, with month and day optional). The date gets interpreted as the synchronization point. For example, to refresh all specs published since 2023, run: ``` node scripts/w3c.js 2023 ``` To force a "full" refresh (any year prior to 1995 would work): ``` node scripts/w3c.js 1995 ``` A full refresh sends ~30k requests to the W3C API and may take >2h. I suggest to run a full refresh manually once, shortly after this script starts being used, and then to run it again every few months to capture potential fixes that might have been made to the data in the meantime. Running that full refresh will also be useful to fix the few `obsoletedBy` properties that are not fully correct, and to move a few `hasErrata` links to the right spec version, as some entries have these links at the root level of the entry in Specref, whereas the latest version is no longer the REC that linked to the errata. I worked with @deniak to fix and complete the data in the W3C API where Specref had more correct info. I also updated entries that contained incorrect info in Specref. The script contains a number of comments to explain the different cases that need to be handled to be able to fully map the data in the W3C API with the data in Specref. There will remain a few entries where the mapping is somewhat imperfect, notably when the shortname of a spec evolved from a level-less shortname to shortname with level, and sometimes back to a level-less shortname (examples include `user-timing`, `performance-timeline`). There are also a few entries for old specs that are flagged as retired in Specref (`isRetired: true`) but not in the W3C API. Mismatches are reported to the console as warnings. These should be addressed over time. The script preserves the information in Specref in any case. The script also preserves the information in Specref in case of transient network errors while fetching info from the W3C API. The new overwrites rules are needed during the transition (the changes need to be made as the same time as the data gets updated), but can be dropped afterwards. They affect specifications that switched from a shortname without a level to a shortname with a level. Longer term, these should be handled through the notion of specification series (see tobie#811).

deniak

Fantastic job @tidoust !
I'm not very familiar with how specref records the data but the W3C API logic looks good to me.
I left some minor comments to the PR.

scripts/w3c.js

Co-authored-by: Denis Ah-Kang <[email protected]>

tobie · 2024-10-23T09:19:12Z

Thanks for all of this work. LMK when you feel we're ready to shift APIs and let's sync about it before we go ahead with it.

deniak

Thanks again for that work @tidoust.
The PR looks good to me.

The `--filter=[shortname]` CLI option tells the script to only update the specs whose shortnames start with the provided one. This option is intended to be used to force refresh of a specific family of specs without having to run a full and lengthy refresh.

The overwrites will only become needed once the related entries get updated and that will only happen once we run a full refresh. In the meantime, they create missing entries. The User Timing specs have been going back and forth between shortnames, which led to some entries incorrectly associated with `user-timing-1`, while last publications did not appear under `user-timing-3`. This update performs a full refresh of the entries and adds a few aliases for continuity (these entries should not have existed at all but since they do, we need to keep them around). Main entries are now `user-timing-1`, `user-timing-2` and `user-timing`, with `user-timing-3` defined as an alias of `user-timing`, consistent with the info in the W3C API and with how the Web Performance WG wanted to proceed with the third level. The entries for `vocab-dcat`, `xmlenc-core`, `wot-architecture` and `wot-thing-description` will need to be manually fixed as well, but that can wait until we run a full sync.

tidoust · 2024-11-14T15:20:52Z

A couple of final updates:

I dropped the overwrites after all because they will only become needed when we run a full refresh, and we should only do that in a second step.
I did force a refresh of the User Timing entries because the history of shortnames has been hectic there, and aligning with the W3C API seems more future-proof.

I also merged the main branch to make sure that the script continues to run well with recent updates made to w3c.json.

LMK when you feel we're ready to shift APIs and let's sync about it before we go ahead with it.

@tobie, I think we're ready :) Now this code needs the Node.js bump in #820, and I do not know how that bump impacts the deployment on the server.

tidoust requested review from tobie, dontcallmedom and deniak October 20, 2024 12:33

tidoust mentioned this pull request Oct 21, 2024

Retiring tr.rdf #612

Open

deniak requested changes Oct 21, 2024

View reviewed changes

tidoust and others added 4 commits October 22, 2024 23:16

Update scripts/w3c.js

f7fe183

Co-authored-by: Denis Ah-Kang <[email protected]>

Update scripts/w3c.js

d0e1ce8

Co-authored-by: Denis Ah-Kang <[email protected]>

Update scripts/w3c.js

2298cb1

Co-authored-by: Denis Ah-Kang <[email protected]>

Update scripts/w3c.js

c43e382

Co-authored-by: Denis Ah-Kang <[email protected]>

deniak approved these changes Nov 6, 2024

View reviewed changes

tidoust added 4 commits November 14, 2024 10:35

Merge branch 'main' into w3c-api

d7057ab

Merge branch 'main' into w3c-api

b74dbcd

tidoust mentioned this pull request Nov 29, 2024

DCAT versions and aliases #834

Closed

tobie merged commit 19a9511 into tobie:main Dec 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make W3C script use the W3C API as source #828

Make W3C script use the W3C API as source #828

tidoust commented Oct 20, 2024

deniak left a comment

tobie commented Oct 23, 2024

deniak left a comment

tidoust commented Nov 14, 2024

Make W3C script use the W3C API as source #828

Make W3C script use the W3C API as source #828

Conversation

tidoust commented Oct 20, 2024

deniak left a comment

Choose a reason for hiding this comment

tobie commented Oct 23, 2024

deniak left a comment

Choose a reason for hiding this comment

tidoust commented Nov 14, 2024