-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
RFC 116: store attachment data in content items
- Loading branch information
Showing
1 changed file
with
227 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,227 @@ | ||
# Store attachment data in content items | ||
|
||
## Summary | ||
|
||
Add a new field to the details hash of content items, `attachments`, | ||
which has metadata about the page's attachments (if any). | ||
|
||
|
||
## Problem | ||
|
||
Pages on GOV.UK can have attachments, which come in a few different | ||
types. Whitehall, which probably has the richest model of | ||
attachments, has: | ||
|
||
- External attachments (links to other websites) | ||
- HTML attachments | ||
- File attachments (which can have previews) | ||
|
||
Attachments are referenced in the content of the page, and make up | ||
part of the govspeak (or HTML) the publishing app sends to the | ||
publishing-api. | ||
|
||
Attachments are *not* referenced anywhere else in the content item. | ||
To get the details of an attachment, you have to parse the body of the | ||
page. This restricts what we can do with attachments. For example, | ||
we cannot generate comprehensive [schema.org][] metadata for | ||
attachments: [see this comment][]. | ||
|
||
Additionally, users of the content API cannot make use of attachments | ||
without parsing the page body. This inhibits creative use of our | ||
content. | ||
|
||
|
||
## Proposal | ||
|
||
In our content schemas we already have an [`asset_link` type][], which | ||
is used for attachments in specialist documents and manual sections. | ||
It's also used in travel advice pages for the page image and a single | ||
attachment. This type solves all of our problems: [Publishing API can | ||
render such attachments with govspeak][], and content items can expose | ||
this data to enable programmatic use. | ||
|
||
However, the metadata Govspeak expects and the `asset_link` type are | ||
*informally* the same. And furthermore, Whitehall has much richer | ||
attachment metadata than either `asset_link` or Govspeak allow. | ||
|
||
Given that `asset_link` is currently used for both images and | ||
attachments, adding all of the missing metadata to `asset_link` feels | ||
a mistake: images and attachments have different metadata, so we | ||
shouldn't conflate the two. | ||
|
||
I propose we do this: | ||
|
||
1. Split `asset_link` into two new types: `attachment_asset` and | ||
`image_asset`, updating the schemas which use `asset_link` | ||
accordingly. | ||
|
||
**Migration concerns:** This can be done without breaking | ||
compatibility. | ||
|
||
1. Add new optional fields for extra metadata: | ||
|
||
- Attachments: | ||
- `accessible` | ||
- `alternative_format_contact_email` | ||
- `command_paper_number` | ||
- `file_size` | ||
- `hoc_paper_number` | ||
- `isbn` | ||
- `locale` | ||
- `number_of_pages` | ||
- `parliamentary_session` | ||
- `preview_url` | ||
- `unique_reference` | ||
- `unnumbered_command_paper` | ||
- `unnumbered_hoc_paper` | ||
- Images: | ||
- `alt_text` | ||
- `caption` | ||
- `credit` | ||
- `id` | ||
|
||
I've left out the following fields because they can be deduced from | ||
others: `csv?`, `external?`, `external_url`, `file_extension`, `html?`, | ||
`opendocument?`, `pdf?` | ||
|
||
And these fields because they don't seem to be used: | ||
`print_meta_data_contact_address`, `web_isbn`. | ||
|
||
And these fields because the Publishing Workflow team are planning | ||
to remove them from Whitehall: `order_url`, `price_in_pence`. | ||
|
||
And these fields because they don't need to be exposed outside the | ||
publishing app: `attachable_id`, `attachable_type`, | ||
`attachment_data_id`, `deleted`, `ordering`, `slug`. | ||
|
||
**Migration concerns:** This can be done without breaking | ||
compatibility. | ||
|
||
1. Replace `content_id` in attachments with `id`. | ||
|
||
**Migration concerns:** this would need to be done in five steps: | ||
add `id`; update app to read from both fields but write to `id`; | ||
republish docs; remove `content_id`; update apps to not use | ||
`content_id`. | ||
|
||
1. Remove non-useful metadata fields: | ||
|
||
- Attachments: `created_at` and `updated_at` | ||
- Images: `content_type` | ||
|
||
**Migration concerns:** this would need to be done in two steps: | ||
update apps; remove fields. | ||
|
||
1. Remove the `details.attachments` field and add a *mandatory* | ||
top-level `attachments` field to all formats which can have | ||
attachments. | ||
|
||
**Migration concerns:** this would need to be done in six steps: | ||
add *optional* top-level field; update apps to read from both | ||
fields but write to the top-level one; republish docs; make field | ||
mandatory; remove old field; update apps to not use | ||
`details.attachments`. | ||
|
||
1. Change the `details.documents` list to store govspeak hashes. | ||
|
||
**Migration concerns:** this would need to be done in five steps: | ||
change schema to allow both types; update frontend apps; update | ||
publishing apps; republish docs; change schema to only allow new | ||
type. | ||
|
||
1. Document the relationship between the `attachment_asset` type in | ||
our content schemas and the data Govspeak attachments use. For | ||
example, all fields required in one must be required in the other. | ||
Any pairs of fields which do the same job in both should have the | ||
same name. Implement any changes arising from this, for example, | ||
more Whitehall-like attachment rendering in Govspeak. | ||
|
||
The final `attachment_asset_list` / `attachment_asset` / `image_asset` | ||
schema will be: | ||
|
||
``` | ||
{ | ||
image_asset: { | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"url", | ||
], | ||
properties: { | ||
id: { type: "string", }, | ||
alt_text: { type: "string", }, | ||
caption: { type: "string", }, | ||
credit: { type: "string", }, | ||
url: { type: "string", format: "uri", }, | ||
}, | ||
}, | ||
attachment_asset: { | ||
type: "object", | ||
additionalProperties: false, | ||
required: [ | ||
"url", | ||
"content_type", | ||
], | ||
properties: { | ||
accessible: { type: "boolean", }, | ||
alternative_format_contact_email: { type: "string", }, | ||
command_paper_number: { type: "string", }, | ||
content_type: { type: "string", }, | ||
file_size: { type: "integer", }, | ||
filename: { type: "string", }, | ||
hoc_paper_number: { type: "string", }, | ||
id: { type: "string", }, | ||
isbn: { type: "string", }, | ||
locale: { "$ref": "#/definitions/locale", }, | ||
number_of_pages: { type: "integer", }, | ||
parliamentary_session: { type: "string", }, | ||
preview_url: { type: "string", format: "uri", }, | ||
title: { type: "string", }, | ||
unique_reference: { type: "string", }, | ||
unnumbered_command_paper: { type: "boolean", }, | ||
unnumbered_hoc_paper: { type: "boolean", }, | ||
url: { type: "string", format: "uri", }, | ||
}, | ||
}, | ||
attachment_asset_list: { | ||
description: "An ordered list of attachment links", | ||
type: "array", | ||
items: { | ||
"$ref": "#/definitions/attachment_asset", | ||
}, | ||
}, | ||
} | ||
``` | ||
|
||
### Some design considerations | ||
|
||
**Why two lists?** | ||
|
||
There are really two different types of attachments in documents. | ||
There are attachments which may be referenced in Govspeak but don't | ||
otherwise appear on the page; in this case the attachment hashes are more like metadata | ||
than actual page content. There are also attachments which appear at | ||
the end of the page text, like in publications. | ||
|
||
The top-level `attachments` list would contain both types of | ||
attachments, so there is one list with everything in. The | ||
`details.documents` list would fulfil its current purpose: providing | ||
the ordered list of attachments for publication-like document types. | ||
|
||
**Why split `asset_link`?** | ||
|
||
`asset_link` is currently used for two different things: images and | ||
attachments. However images have strictly less permissable metadata | ||
than attachments, it doesn't make sense to give a page image an ISBN | ||
for example. These should be separate types. | ||
|
||
**Why make the lists mandatory?** | ||
|
||
Is there a difference between a missing list and a present, but empty, | ||
list? Maybe not, but I think it's better to be explicit that there | ||
are no attachments for a document. | ||
|
||
[schema.org]: http://schema.org/ | ||
[see this comment]: https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254 | ||
[`asset_link` type]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/shared/definitions/asset_links.jsonnet | ||
[Publishing API can render such attachments with govspeak]: https://github.com/alphagov/publishing-api/blob/1b8e328540ab96759807787ab87dcb33bbcc72e3/app/presenters/details_presenter.rb#L71-L73 |