Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting entries for specification series #811

Open
tidoust opened this issue Aug 20, 2024 · 3 comments
Open

Consider supporting entries for specification series #811

tidoust opened this issue Aug 20, 2024 · 3 comments

Comments

@tidoust
Copy link
Collaborator

tidoust commented Aug 20, 2024

Via discussion in #810 (comment)

Various specs in W3C come in levels that belong to the same series. Both levels and series have shortnames in W3C, but the series shortname is more to be seen as a way to redirect to the "current" level (which evolves over time). For example: css-fonts-3, css-fonts-4, css-fonts-5 all belong to the css-fonts series. As the time of writing, the current spec is css-fonts-4.

Levels may be created after a first iteration on a spec. For example, media-source-1 was known as media-source until level 2 got created. media-source is now the shortname of the underlying series, the current spec being media-source-2.

Specref does not have the concept of series. That means no entry for css-fonts, and no easy way to "follow" the current specification in the latter case, requiring clunky workarounds, as done in #810. It would be great to support series shortnames more directly in Specref.

The end goal would be to make it easy to reference a spec by:

  1. its series shortname, such as css-fonts, in which case the reference goes to the latest version of the current level in the series.
  2. its shortname, such as css-fonts-5, in which case the reference goes to the latest version of that level.
  3. a dated version, such as css-fonts-4-20240201 in which case the reference goes to the specified version. The dated version typically includes the level, we should not need to create or use a css-fonts-20240201 entry!

Adding support for series requires switching the W3C update script to the W3C API, because tr.rdf does not know anything about series either.

Note the need to preserve past dated versions in the case where the level gets introduced afterwards (e.g. media-source-20140717 needs to exist even though the media-source spec is now known as media-source-1). Also note that some specs start without a level, then get a level, and then get back to a no-level mode...

(I don't think the notion of specification series exists in other SDOs, although that could perhaps be useful to track versions of ETSI and ISO specs longer term)

tidoust added a commit that referenced this issue Aug 20, 2024
The `encrypted-media` entry was an alias of `encrypted-media-1`, but the series
shortname should now redirect to `encrypted-media-2`.

Unfortunately, this means adding a bunch of aliases to preserve previous dated
entries. That's slightly ugly (although easy to prepare with a good text editor
that supports multi-line edits) but there's no real better way until Specref
understands series shortnames (tracked in #811).
@tobie
Copy link
Owner

tobie commented Aug 20, 2024

My suggestion would be to have the dated versions tied to the unversioned entry, so: FOO-20240201, not FOO-2-20240201.

@tidoust
Copy link
Collaborator Author

tidoust commented Aug 20, 2024

My suggestion would be to have the dated versions tied to the unversioned entry, so: FOO-20240201, not FOO-2-20240201.

In the css-fonts example, both css-fonts-4 and css-fonts-5 are currently active (that situation is not uncommon for CSS specs). That means it's possible to end up in a situation where both levels get published on the same day. That would create duplicate CSS-FONTS-YYYYMMDD entries. That shouldn't happen much in practice, for sure, but that may happen...

@tobie
Copy link
Owner

tobie commented Aug 20, 2024

Fair point.

tidoust added a commit to tidoust/specref that referenced this issue Oct 20, 2024
This is a complete re-write of the W3C update script to switch from the
still-maintained-but-deprecated `tr.rdf` file to the more complete and current
W3C API instead.

What changes? Essentially nothing substantial in terms of data, but:
- When an entry is updated, the `source` property will target the API endpoint
from which the data was pulled, such as:
`https://api.w3.org/specifications/_shortname_`
- The W3C API has a few additional statuses that were not reported in `tr.rdf`
such as `DNOTE`, `FPWD`, `LCWD`, and the registry statuses.
- The script fills out properties more systematically for versions
- The order of the properties for each entry is not always exactly the same as
the order generated by the previous script.

The first time the script runs, it will:
- Fix a few entries of very old specs in Specref, for which the title is not
the title of the actual spec.
- Add entries for the draft registries published by a couple of groups.
- Complete a few entries with additional versions that did not exist in
`tr.rdf` for some reason.
- Create consistent `obsoletes` properties, as Specref contains a few
`obsoletes` properties that don't have a matching `obsoletedBy` property.

The script only updates recently published entries by default. That is, it does
not attempt to refresh the whole list. That's needed because the W3C API
follows the HAL convention:
https://en.wikipedia.org/wiki/Hypertext_Application_Language

One consequence is that each API request returns only a minimal amount of
information, and re-generating the entire `w3c.json` file requires sending
~30000 requests, which would be at best impractical to do on an hourly basis,
all the more so because the W3C API server has some rate limits rules in place
(6000 requests every 10 minutes). More importantly, that would be a waste of
resources as data essentially never changes once published.

Thus the script takes an incremental approach instead and only refreshes:
1. Specifications recently published... where recently published means
specifications published since the newest publication date known to Specref
minus 2 weeks by default. The "minus 2 weeks" is meant to catch data fixes that
are sometimes made shortly after publication.
2. Specifications for which the base info (title, URL) is not aligned with the
W3C API. That's meant to fix the data in Specref during the transition, and to
catch further updates that could be made to the W3C API once in a while.

All in all, a typical update should send ~500 requests to the W3C API. The
code throttles requests to 1 every 100ms. Running the script should take ~1-2
minutes.

A more thorough refresh may be forced by calling the script with a date as
parameter (format YYYY-MM-DD, with month and day optional). The date gets
interpreted as the synchronization point. For example, to refresh all specs
published since 2023, run:

```
node scripts/w3c.js 2023
```

To force a "full" refresh (any year prior to 1995 would work):

```
node scripts/w3c.js 1995
```

A full refresh sends ~30k requests to the W3C API and may take >2h. I suggest
to run a full refresh manually once, shortly after this script starts being
used, and then to run it again every few months to capture potential fixes that
might have been made to the data in the meantime.

Running that full refresh will also be useful to fix the few `obsoletedBy`
properties that are not fully correct, and to move a few `hasErrata` links
to the right spec version, as some entries have these links at the root level
of the entry in Specref, whereas the latest version is no longer the REC that
linked to the errata.

I worked with @deniak to fix and complete the data in the W3C API where Specref
had more correct info. I also updated entries that contained incorrect info in
Specref.

The script contains a number of comments to explain the different cases that
need to be handled to be able to fully map the data in the W3C API with the
data in Specref. There will remain a few entries where the mapping is somewhat
imperfect, notably when the shortname of a spec evolved from a level-less
shortname to shortname with level, and sometimes back to a level-less
shortname (examples include `user-timing`, `performance-timeline`). There are
also a few entries for old specs that are flagged as retired in Specref
(`isRetired: true`) but not in the W3C API. Mismatches are reported to the
console as warnings. These should be addressed over time. The script preserves
the information in Specref in any case.

The script also preserves the information in Specref in case of transient
network errors while fetching info from the W3C API.

The new overwrites rules are needed during the transition (the changes need to
be made as the same time as the data gets updated), but can be dropped
afterwards. They affect specifications that switched from a shortname without
a level to a shortname with a level. Longer term, these should be handled
through the notion of specification series (see tobie#811).
tidoust added a commit to tidoust/specref that referenced this issue Oct 20, 2024
This is a complete re-write of the W3C update script to switch from the
still-maintained-but-deprecated `tr.rdf` file to the more complete and current
W3C API instead.

What changes? Essentially nothing substantial in terms of data, but:
- When an entry is updated, the `source` property will target the API endpoint
from which the data was pulled, such as:
`https://api.w3.org/specifications/_shortname_`
- The W3C API has a few additional statuses that were not reported in `tr.rdf`
such as `DNOTE`, `FPWD`, `LCWD`, and the registry statuses.
- The script fills out properties more systematically for versions
- The order of the properties for each entry is not always exactly the same as
the order generated by the previous script.

The first time the script runs, it will:
- Fix a few entries of very old specs in Specref, for which the title is not
the title of the actual spec.
- Add entries for the draft registries published by a couple of groups.
- Complete a few entries with additional versions that did not exist in
`tr.rdf` for some reason.
- Create consistent `obsoletes` properties, as Specref contains a few
`obsoletes` properties that don't have a matching `obsoletedBy` property.

The script only updates recently published entries by default. That is, it does
not attempt to refresh the whole list. That's needed because the W3C API
follows the HAL convention:
https://en.wikipedia.org/wiki/Hypertext_Application_Language

One consequence is that each API request returns only a minimal amount of
information, and re-generating the entire `w3c.json` file requires sending
~30000 requests, which would be at best impractical to do on an hourly basis,
all the more so because the W3C API server has some rate limits rules in place
(6000 requests every 10 minutes). More importantly, that would be a waste of
resources as data essentially never changes once published.

Thus the script takes an incremental approach instead and only refreshes:
1. Specifications recently published... where recently published means
specifications published since the newest publication date known to Specref
minus 2 weeks by default. The "minus 2 weeks" is meant to catch data fixes that
are sometimes made shortly after publication.
2. Specifications for which the base info (title, URL) is not aligned with the
W3C API. That's meant to fix the data in Specref during the transition, and to
catch further updates that could be made to the W3C API once in a while.

All in all, a typical update should send ~500 requests to the W3C API. The
code throttles requests to 1 every 100ms. Running the script should take ~1-2
minutes.

A more thorough refresh may be forced by calling the script with a date as
parameter (format YYYY-MM-DD, with month and day optional). The date gets
interpreted as the synchronization point. For example, to refresh all specs
published since 2023, run:

```
node scripts/w3c.js 2023
```

To force a "full" refresh (any year prior to 1995 would work):

```
node scripts/w3c.js 1995
```

A full refresh sends ~30k requests to the W3C API and may take >2h. I suggest
to run a full refresh manually once, shortly after this script starts being
used, and then to run it again every few months to capture potential fixes that
might have been made to the data in the meantime.

Running that full refresh will also be useful to fix the few `obsoletedBy`
properties that are not fully correct, and to move a few `hasErrata` links
to the right spec version, as some entries have these links at the root level
of the entry in Specref, whereas the latest version is no longer the REC that
linked to the errata.

I worked with @deniak to fix and complete the data in the W3C API where Specref
had more correct info. I also updated entries that contained incorrect info in
Specref.

The script contains a number of comments to explain the different cases that
need to be handled to be able to fully map the data in the W3C API with the
data in Specref. There will remain a few entries where the mapping is somewhat
imperfect, notably when the shortname of a spec evolved from a level-less
shortname to shortname with level, and sometimes back to a level-less
shortname (examples include `user-timing`, `performance-timeline`). There are
also a few entries for old specs that are flagged as retired in Specref
(`isRetired: true`) but not in the W3C API. Mismatches are reported to the
console as warnings. These should be addressed over time. The script preserves
the information in Specref in any case.

The script also preserves the information in Specref in case of transient
network errors while fetching info from the W3C API.

The new overwrites rules are needed during the transition (the changes need to
be made as the same time as the data gets updated), but can be dropped
afterwards. They affect specifications that switched from a shortname without
a level to a shortname with a level. Longer term, these should be handled
through the notion of specification series (see tobie#811).
@tidoust tidoust mentioned this issue Nov 29, 2024
tobie pushed a commit that referenced this issue Dec 23, 2024
* Make W3C script use the W3C API as source

This is a complete re-write of the W3C update script to switch from the
still-maintained-but-deprecated `tr.rdf` file to the more complete and current
W3C API instead.

What changes? Essentially nothing substantial in terms of data, but:
- When an entry is updated, the `source` property will target the API endpoint
from which the data was pulled, such as:
`https://api.w3.org/specifications/_shortname_`
- The W3C API has a few additional statuses that were not reported in `tr.rdf`
such as `DNOTE`, `FPWD`, `LCWD`, and the registry statuses.
- The script fills out properties more systematically for versions
- The order of the properties for each entry is not always exactly the same as
the order generated by the previous script.

The first time the script runs, it will:
- Fix a few entries of very old specs in Specref, for which the title is not
the title of the actual spec.
- Add entries for the draft registries published by a couple of groups.
- Complete a few entries with additional versions that did not exist in
`tr.rdf` for some reason.
- Create consistent `obsoletes` properties, as Specref contains a few
`obsoletes` properties that don't have a matching `obsoletedBy` property.

The script only updates recently published entries by default. That is, it does
not attempt to refresh the whole list. That's needed because the W3C API
follows the HAL convention:
https://en.wikipedia.org/wiki/Hypertext_Application_Language

One consequence is that each API request returns only a minimal amount of
information, and re-generating the entire `w3c.json` file requires sending
~30000 requests, which would be at best impractical to do on an hourly basis,
all the more so because the W3C API server has some rate limits rules in place
(6000 requests every 10 minutes). More importantly, that would be a waste of
resources as data essentially never changes once published.

Thus the script takes an incremental approach instead and only refreshes:
1. Specifications recently published... where recently published means
specifications published since the newest publication date known to Specref
minus 2 weeks by default. The "minus 2 weeks" is meant to catch data fixes that
are sometimes made shortly after publication.
2. Specifications for which the base info (title, URL) is not aligned with the
W3C API. That's meant to fix the data in Specref during the transition, and to
catch further updates that could be made to the W3C API once in a while.

All in all, a typical update should send ~500 requests to the W3C API. The
code throttles requests to 1 every 100ms. Running the script should take ~1-2
minutes.

A more thorough refresh may be forced by calling the script with a date as
parameter (format YYYY-MM-DD, with month and day optional). The date gets
interpreted as the synchronization point. For example, to refresh all specs
published since 2023, run:

```
node scripts/w3c.js 2023
```

To force a "full" refresh (any year prior to 1995 would work):

```
node scripts/w3c.js 1995
```

A full refresh sends ~30k requests to the W3C API and may take >2h. I suggest
to run a full refresh manually once, shortly after this script starts being
used, and then to run it again every few months to capture potential fixes that
might have been made to the data in the meantime.

Running that full refresh will also be useful to fix the few `obsoletedBy`
properties that are not fully correct, and to move a few `hasErrata` links
to the right spec version, as some entries have these links at the root level
of the entry in Specref, whereas the latest version is no longer the REC that
linked to the errata.

I worked with @deniak to fix and complete the data in the W3C API where Specref
had more correct info. I also updated entries that contained incorrect info in
Specref.

The script contains a number of comments to explain the different cases that
need to be handled to be able to fully map the data in the W3C API with the
data in Specref. There will remain a few entries where the mapping is somewhat
imperfect, notably when the shortname of a spec evolved from a level-less
shortname to shortname with level, and sometimes back to a level-less
shortname (examples include `user-timing`, `performance-timeline`). There are
also a few entries for old specs that are flagged as retired in Specref
(`isRetired: true`) but not in the W3C API. Mismatches are reported to the
console as warnings. These should be addressed over time. The script preserves
the information in Specref in any case.

The script also preserves the information in Specref in case of transient
network errors while fetching info from the W3C API.

The new overwrites rules are needed during the transition (the changes need to
be made as the same time as the data gets updated), but can be dropped
afterwards. They affect specifications that switched from a shortname without
a level to a shortname with a level. Longer term, these should be handled
through the notion of specification series (see #811).

* Update scripts/w3c.js

Co-authored-by: Denis Ah-Kang <[email protected]>

* Update scripts/w3c.js

Co-authored-by: Denis Ah-Kang <[email protected]>

* Update scripts/w3c.js

Co-authored-by: Denis Ah-Kang <[email protected]>

* Update scripts/w3c.js

Co-authored-by: Denis Ah-Kang <[email protected]>

* Add CLI filter option to only refresh specific specs

The `--filter=[shortname]` CLI option tells the script to only update the
specs whose shortnames start with the provided one. This option is intended
to be used to force refresh of a specific family of specs without having to
run a full and lengthy refresh.

* Drop overwrites / Force refresh of user-timing entries.

The overwrites will only become needed once the related entries get updated
and that will only happen once we run a full refresh. In the meantime, they
create missing entries.

The User Timing specs have been going back and forth between shortnames,
which led to some entries incorrectly associated with `user-timing-1`, while
last publications did not appear under `user-timing-3`. This update performs
a full refresh of the entries and adds a few aliases for continuity (these
entries should not have existed at all but since they do, we need to keep
them around). Main entries are now `user-timing-1`, `user-timing-2` and
`user-timing`, with `user-timing-3` defined as an alias of `user-timing`,
consistent with the info in the W3C API and with how the Web Performance WG wanted to proceed with the third level.

The entries for `vocab-dcat`, `xmlenc-core`, `wot-architecture` and
`wot-thing-description` will need to be manually fixed as well, but that can
wait until we run a full sync.

---------

Co-authored-by: Denis Ah-Kang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants