-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #116 from alphagov/msw/rfc-116
RFC 116: store attachment data in content items
- Loading branch information
Showing
1 changed file
with
393 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,393 @@ | ||
# Store attachment data in content items | ||
|
||
## Summary | ||
|
||
Add a new field to the details hash of content items, `attachments`, | ||
which has metadata about the document's attachments (if any). | ||
|
||
|
||
## Problem | ||
|
||
Documents on GOV.UK can have attachments, which come in a few | ||
different types. Whitehall, which has the richest model of | ||
attachments, has: | ||
|
||
- External attachments (links to other websites) | ||
- HTML attachments | ||
- File attachments (which can have previews) | ||
|
||
When thinking about how attachments are displayed to the user, there | ||
are two sorts: | ||
|
||
- *Document-level attachments*, such on Whitehall publication and | ||
consultation documents, which are displayed separately to the main | ||
content. | ||
|
||
- *Inline attachments*, such as in other Whitehall documents, manual | ||
sections, specialist documents, and travel advice documents, which | ||
are referenced in the page content and rendered as links. | ||
|
||
Publishing applications vary in how easily accessible they make | ||
attachment metadata in their content items. Whitehall only sends | ||
rendered HTML to the Publishing API. Other publishing apps send a | ||
combination of rendered HTML and metadata. | ||
|
||
This inconistency restricts what we can do with attachments, | ||
particularly for Whitehall content. For example, we cannot generate | ||
comprehensive [schema.org][] metadata for attachments: [see this | ||
comment][]. | ||
|
||
Additionally, users of the content API cannot make use of attachments | ||
without parsing the document body. This inhibits creative use of our | ||
content, and makes life more difficult for users such as the National | ||
Archives, who need to be able to record metadata like the HoC paper | ||
number of official publications. | ||
|
||
We should make metadata for Whitehall attachments available in the | ||
same way as other publishing applications, but that requires some | ||
changes into how we represent attachments across the stack. | ||
|
||
|
||
## Proposal | ||
|
||
In our content schemas we already have an [`asset_link` type][], which | ||
is used for attachments in specialist documents and manual sections. | ||
It's also used in travel advice pages for the page image and a single | ||
attachment. This type appears to solve all of our problems: | ||
[Publishing API can render such attachments with govspeak][], and | ||
content items can expose this data to enable programmatic use. | ||
|
||
However, the metadata Govspeak expects and the `asset_link` type are | ||
only *informally* the same. And furthermore, Whitehall has much richer | ||
attachment metadata than either `asset_link` or Govspeak allow. | ||
|
||
Given that `asset_link` is currently used for both images and | ||
attachments, adding all of the missing metadata to `asset_link` feels | ||
a mistake: images and attachments have different metadata, so we | ||
shouldn't conflate the two. | ||
|
||
I propose we do this: | ||
|
||
1. Split `asset_link` into three new types: `file_attachment_asset`, | ||
`specialist_publisher_attachment_asset` and `image_asset`, updating | ||
the schemas which use `asset_link` accordingly, with these fields: | ||
|
||
- `file_attachment_asset`: | ||
- `url` (required) | ||
- `content_type` (required) | ||
- `title` | ||
- `specialist_publisher_attachment_asset` (extends `file_attachment_asset`): | ||
- `content_id` (required) | ||
- `created_at` | ||
- `updated_at` | ||
- `image_asset`: | ||
- `url` (required) | ||
- `content_type` (required) | ||
|
||
**Migration concerns:** This would need to be done in three steps: | ||
1. Update manuals-publisher to not set the `created_at` and `updated_at` for attachments. | ||
2. Republish all manual sections. | ||
3. Update the schemas. | ||
|
||
1. Add new optional fields for extra metadata: | ||
|
||
- `file_attachment_asset`: | ||
- `accessible` | ||
- `alternative_format_contact_email` | ||
- `attachment_type` (enum with permitted values `"file"`) | ||
- `file_size` | ||
- `filename` | ||
- `id` | ||
- `locale` | ||
- `number_of_pages` | ||
- `preview_url` | ||
- `image_asset` | ||
- `alt_text` | ||
- `caption` | ||
- `credit` | ||
|
||
**Migration concerns:** This can be done without breaking | ||
compatibility, as all new fields are optional. | ||
|
||
1. Add a new attachment type for publication attachments: | ||
|
||
`publication_attachment_asset`: | ||
- `command_paper_number` | ||
- `hoc_paper_number` | ||
- `isbn` | ||
- `parliamentary_session` | ||
- `unique_reference` | ||
- `unnumbered_command_paper` | ||
- `unnumbered_hoc_paper` | ||
- plus a one-of combination of: | ||
- the properties of `file_attachment_asset` | ||
- or, | ||
- `attachment_type` (required; enum with permitted values `"html"`) | ||
- `id` | ||
- `locale` | ||
- `title` | ||
- `url` (required) | ||
- or, | ||
- `attachment_type` (required; enum with permitted values `"external"`) | ||
- `id` | ||
- `locale` | ||
- `title` | ||
- `url` (required) | ||
|
||
I've left out the following fields because they can be deduced from | ||
others: `csv?`, `external?`, `external_url`, `file_extension`, `html?`, | ||
`opendocument?`, `pdf?` | ||
|
||
And these fields because they don't seem to be used: | ||
`print_meta_data_contact_address`, `web_isbn`. | ||
|
||
And these fields because the Publishing Workflow team are planning | ||
to remove them from Whitehall: `order_url`, `price_in_pence`. | ||
|
||
And these fields because they don't need to be exposed outside the | ||
publishing app: `attachable_id`, `attachment_data_id`, `deleted`, | ||
`ordering`, `slug`. | ||
|
||
**Migration concerns:** This can be done without breaking | ||
compatibility, as the new type will be unused for now. | ||
|
||
1. Make `attachment_type` and `id` in `file_attachment_asset` mandatory. | ||
|
||
**Migration concerns:** this would need to be done in three steps: | ||
1. Update specialist publisher, travel advice publisher, and | ||
manuals publisher to set the fields. | ||
2. Republish all affected documents. | ||
3. Make the field mandatory. | ||
|
||
1. Add a *mandatory* `details.attachments` field to all formats which | ||
can have attachments. | ||
|
||
**Migration concerns:** this would need to be done in four steps: | ||
1. Add the field as optional. | ||
2. Update publishing apps to write to it. | ||
3. Republish all affected documents. | ||
4. Make the field mandatory. | ||
|
||
1. Add a *mandatory* `details.featured_attachments` list to | ||
publication formats, which is an ordered list of govspeak or HTML | ||
strings (using the `#/definitions/multiple_content_types` schema) | ||
rendering attachments from `details.attachments`. | ||
|
||
**Migration concerns:** this would need to be done in four steps: | ||
1. Add the field as optional. | ||
2. Update publishing apps (just Whitehall and Content Publisher) to write to it. | ||
3. Republish all affected documents. | ||
4. Make the field mandatory. | ||
|
||
1. Remove the `details.documents` field. | ||
|
||
**Migration concerns:** this would need to be done in four steps: | ||
1. Update frontend apps to render from | ||
`details.featured_attachments`. | ||
2. Update publishing apps to not set `details.documents`. | ||
3. Republish all affected documents. | ||
4. Remove the field. | ||
|
||
The final schemas will be: | ||
|
||
``` | ||
local FileAttachmentAssetProperties = { | ||
accessible: { type: "boolean", }, | ||
alternative_format_contact_email: { type: "string", }, | ||
attachment_type: { type: "string", enum: ["file"], }, | ||
content_type: { type: "string", }, | ||
file_size: { type: "integer", }, | ||
filename: { type: "string", }, | ||
id: { type: "string" }, | ||
locale: { "$ref": "#/definitions/locale", }, | ||
number_of_pages: { type: "integer", }, | ||
preview_url: { type: "string", format: "uri", }, | ||
title: { type: "string", }, | ||
url: { type: "string", format: "uri", }, | ||
}; | ||
local HtmlAttachmentAssetProperties = { | ||
attachment_type: { type: "string", enum: ["html"], }, | ||
id: { type: "string" }, | ||
locale: { "$ref": "#/definitions/locale", }, | ||
title: { type: "string", }, | ||
url: { type: "string", format: "uri", }, | ||
}; | ||
local ExternalAttachmentAssetProperties = { | ||
attachment_type: { type: "string", enum: ["external"], }, | ||
id: { type: "string" }, | ||
locale: { "$ref": "#/definitions/locale", }, | ||
title: { type: "string", }, | ||
url: { type: "string", format: "uri", }, | ||
}; | ||
local PublicationAttachmentAssetProperties = { | ||
command_paper_number: { type: "string", }, | ||
hoc_paper_number: { type: "string", }, | ||
isbn: { type: "string", }, | ||
parliamentary_session: { type: "string", }, | ||
unique_reference: { type: "string", }, | ||
unnumbered_command_paper: { type: "boolean", }, | ||
unnumbered_hoc_paper: { type: "boolean", }, | ||
}; | ||
{ | ||
image_asset: { | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"content_type", | ||
"url", | ||
], | ||
properties: { | ||
alt_text: { type: "string", }, | ||
caption: { type: "string", }, | ||
content_type: { type: "string", }, | ||
credit: { type: "string", }, | ||
url: { type: "string", format: "uri", }, | ||
}, | ||
}, | ||
file_attachment_asset: { | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"attachment_type", | ||
"content_type", | ||
"url", | ||
], | ||
properties: FileAttachmentAssetProperties, | ||
}, | ||
specialist_publisher_attachment_asset: { | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"attachment_type", | ||
"content_id", | ||
"content_type", | ||
"url", | ||
], | ||
properties: FileAttachmentAssetProperties + { | ||
content_id: { "$ref": "#/definitions/guid", }, | ||
created_at: { format: "date-time", }, | ||
updated_at: { format: "date-time", }, | ||
}, | ||
}, | ||
publication_attachment_asset: { | ||
oneOf: [ | ||
{ | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"attachment_type", | ||
"content_type", | ||
"url", | ||
], | ||
properties: FileAttachmentAssetProperties + PublicationAttachmentAssetProperties, | ||
}, | ||
{ | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"attachment_type", | ||
"url", | ||
], | ||
properties: HtmlAttachmentAssetProperties + PublicationAttachmentAssetProperties, | ||
}, | ||
{ | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"attachment_type", | ||
"url", | ||
], | ||
properties: ExternalAttachmentAssetProperties + PublicationAttachmentAssetProperties, | ||
} | ||
], | ||
}, | ||
} | ||
``` | ||
|
||
I've validated this proposed schema by regenerating the full schemas | ||
and checking the examples after making the following changes: | ||
|
||
- Change `travel_advice` to use `file_attachment_asset` and `image_asset` | ||
- Change `asset_link_list` to use `specialist_publisher_asset` | ||
|
||
The `what-is-content-design.json` example failed as it is missing | ||
content IDs for its attachments and it was using the specialist | ||
publisher attachment format. For the actual change we'd either | ||
introduce new attachment list types, or do away with `asset_link_list` | ||
entirely and just define the lists in the relevant schemas. | ||
|
||
### Some design considerations | ||
|
||
**Why two lists?** | ||
|
||
As mentioned earlier, there are really two different types of | ||
attachments in documents. There are attachments which appear in the | ||
body, and attachments which appear separately to the normal document | ||
text, like in publications. | ||
|
||
The `details.attachments` list would contain both types of | ||
attachments, so there is one list with everything in. The | ||
`details.featured_attachments` list fulfils its current purpose: | ||
providing the ordered list of document-level attachments for | ||
publication-like document types. | ||
|
||
**Why make the lists mandatory?** | ||
|
||
Is there a difference between a missing list and a present, but empty, | ||
list? Maybe not, but I think it's better to be explicit that there | ||
are no attachments for a document. | ||
|
||
**Why split `asset_link`?** | ||
|
||
`asset_link` is currently used for two different things: images and | ||
attachments. However images have strictly less permissable metadata | ||
than attachments, it doesn't make sense to give a page image an ISBN | ||
for example. These should be separate types. | ||
|
||
**Why have multiple attachment types?** | ||
|
||
Different publishing apps have different allowable metadata, so it | ||
makes sense to have different types constraining what a publishing app | ||
is able to provide. | ||
|
||
We have prior art in using a single-valued enum to make a tagged | ||
union: it's how the top-level schemas do it, eg: | ||
|
||
```json | ||
"schema_name": { | ||
"type": "string", | ||
"enum": [ | ||
"detailed_guide" | ||
] | ||
}, | ||
``` | ||
|
||
**Why `publication_attachment_asset`?** | ||
|
||
Whitehall calls the Publication and Consultation types, which are the | ||
types with featured attachments which present all the extra metadata, | ||
"publicationesque" types. | ||
|
||
**Why do we need a type for specialist publisher?** | ||
|
||
The `specialist_publisher_attachment_asset` is needed because | ||
Specialist Publisher presents the `created_at` and `updated_at` fields | ||
to publishers. Specialist Publisher uses the Publishing API as its | ||
database, so all of that information needs to live in the content | ||
item. | ||
|
||
Manuals Publisher and Travel Advice Publisher have their own | ||
databases, rather than using Publishing API directly as a backing | ||
store, so they dont have this same problem. | ||
|
||
[schema.org]: http://schema.org/ | ||
[see this comment]: https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254 | ||
[`asset_link` type]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/shared/definitions/asset_links.jsonnet | ||
[Publishing API can render such attachments with govspeak]: https://github.com/alphagov/publishing-api/blob/1b8e328540ab96759807787ab87dcb33bbcc72e3/app/presenters/details_presenter.rb#L71-L73 |