From bde9f74a3f445b2ba8c3c28343171e2634237cd6 Mon Sep 17 00:00:00 2001 From: Michael Walker Date: Fri, 3 Jan 2020 11:51:49 +0000 Subject: [PATCH] RFC 116: store attachment data in content items --- ...-store-attachment-data-in-content-items.md | 227 ++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 rfc-116-store-attachment-data-in-content-items.md diff --git a/rfc-116-store-attachment-data-in-content-items.md b/rfc-116-store-attachment-data-in-content-items.md new file mode 100644 index 00000000..e5079adc --- /dev/null +++ b/rfc-116-store-attachment-data-in-content-items.md @@ -0,0 +1,227 @@ +# Store attachment data in content items + +## Summary + +Add a new field to the details hash of content items, `attachments`, +which has metadata about the page's attachments (if any). + + +## Problem + +Pages on GOV.UK can have attachments, which come in a few different +types. Whitehall, which probably has the richest model of +attachments, has: + +- External attachments (links to other websites) +- HTML attachments +- File attachments (which can have previews) + +Attachments are referenced in the content of the page, and make up +part of the govspeak (or HTML) the publishing app sends to the +publishing-api. + +Attachments are *not* referenced anywhere else in the content item. +To get the details of an attachment, you have to parse the body of the +page. This restricts what we can do with attachments. For example, +we cannot generate comprehensive [schema.org][] metadata for +attachments: [see this comment][]. + +Additionally, users of the content API cannot make use of attachments +without parsing the page body. This inhibits creative use of our +content. + + +## Proposal + +In our content schemas we already have an [`asset_link` type][], which +is used for attachments in specialist documents and manual sections. +It's also used in travel advice pages for the page image and a single +attachment. This type solves all of our problems: [Publishing API can +render such attachments with govspeak][], and content items can expose +this data to enable programmatic use. + +However, the metadata Govspeak expects and the `asset_link` type are +*informally* the same. And furthermore, Whitehall has much richer +attachment metadata than either `asset_link` or Govspeak allow. + +Given that `asset_link` is currently used for both images and +attachments, adding all of the missing metadata to `asset_link` feels +a mistake: images and attachments have different metadata, so we +shouldn't conflate the two. + +I propose we do this: + +1. Split `asset_link` into two new types: `attachment_asset` and + `image_asset`, updating the schemas which use `asset_link` + accordingly. + + **Migration concerns:** This can be done without breaking + compatibility. + +1. Add new optional fields for extra metadata: + + - Attachments: + - `accessible` + - `alternative_format_contact_email` + - `command_paper_number` + - `file_size` + - `hoc_paper_number` + - `isbn` + - `locale` + - `number_of_pages` + - `parliamentary_session` + - `preview_url` + - `unique_reference` + - `unnumbered_command_paper` + - `unnumbered_hoc_paper` + - Images: + - `alt_text` + - `caption` + - `credit` + - `id` + + I've left out the following fields because they can be deduced from + others: `csv?`, `external?`, `external_url`, `file_extension`, `html?`, + `opendocument?`, `pdf?` + + And these fields because they don't seem to be used: + `print_meta_data_contact_address`, `web_isbn`. + + And these fields because the Publishing Workflow team are planning + to remove them from Whitehall: `order_url`, `price_in_pence`. + + And these fields because they don't need to be exposed outside the + publishing app: `attachable_id`, `attachable_type`, + `attachment_data_id`, `deleted`, `ordering`, `slug`. + + **Migration concerns:** This can be done without breaking + compatibility. + +1. Replace `content_id` in attachments with `id`. + + **Migration concerns:** this would need to be done in five steps: + add `id`; update app to read from both fields but write to `id`; + republish docs; remove `content_id`; update apps to not use + `content_id`. + +1. Remove non-useful metadata fields: + + - Attachments: `created_at` and `updated_at` + - Images: `content_type` + + **Migration concerns:** this would need to be done in two steps: + update apps; remove fields. + +1. Remove the `details.attachments` field and add a *mandatory* + top-level `attachments` field to all formats which can have + attachments. + + **Migration concerns:** this would need to be done in six steps: + add *optional* top-level field; update apps to read from both + fields but write to the top-level one; republish docs; make field + mandatory; remove old field; update apps to not use + `details.attachments`. + +1. Change the `details.documents` list to store govspeak hashes. + + **Migration concerns:** this would need to be done in five steps: + change schema to allow both types; update frontend apps; update + publishing apps; republish docs; change schema to only allow new + type. + +1. Document the relationship between the `attachment_asset` type in + our content schemas and the data Govspeak attachments use. For + example, all fields required in one must be required in the other. + Any pairs of fields which do the same job in both should have the + same name. Implement any changes arising from this, for example, + more Whitehall-like attachment rendering in Govspeak. + +The final `attachment_asset_list` / `attachment_asset` / `image_asset` +schema will be: + +``` +{ + image_asset: { + type: "object", + additionalProperties: false, + required: [ + "url", + ], + properties: { + id: { type: "string", }, + alt_text: { type: "string", }, + caption: { type: "string", }, + credit: { type: "string", }, + url: { type: "string", format: "uri", }, + }, + }, + attachment_asset: { + type: "object", + additionalProperties: false, + required: [ + "url", + "content_type", + ], + properties: { + accessible: { type: "boolean", }, + alternative_format_contact_email: { type: "string", }, + command_paper_number: { type: "string", }, + content_type: { type: "string", }, + file_size: { type: "integer", }, + filename: { type: "string", }, + hoc_paper_number: { type: "string", }, + id: { type: "string", }, + isbn: { type: "string", }, + locale: { "$ref": "#/definitions/locale", }, + number_of_pages: { type: "integer", }, + parliamentary_session: { type: "string", }, + preview_url: { type: "string", format: "uri", }, + title: { type: "string", }, + unique_reference: { type: "string", }, + unnumbered_command_paper: { type: "boolean", }, + unnumbered_hoc_paper: { type: "boolean", }, + url: { type: "string", format: "uri", }, + }, + }, + attachment_asset_list: { + description: "An ordered list of attachment links", + type: "array", + items: { + "$ref": "#/definitions/attachment_asset", + }, + }, +} +``` + +### Some design considerations + +**Why two lists?** + +There are really two different types of attachments in documents. +There are attachments which may be referenced in Govspeak but don't +otherwise appear on the page; in this case the attachment hashes are more like metadata +than actual page content. There are also attachments which appear at +the end of the page text, like in publications. + +The top-level `attachments` list would contain both types of +attachments, so there is one list with everything in. The +`details.documents` list would fulfil its current purpose: providing +the ordered list of attachments for publication-like document types. + +**Why split `asset_link`?** + +`asset_link` is currently used for two different things: images and +attachments. However images have strictly less permissable metadata +than attachments, it doesn't make sense to give a page image an ISBN +for example. These should be separate types. + +**Why make the lists mandatory?** + +Is there a difference between a missing list and a present, but empty, +list? Maybe not, but I think it's better to be explicit that there +are no attachments for a document. + +[schema.org]: http://schema.org/ +[see this comment]: https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254 +[`asset_link` type]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/shared/definitions/asset_links.jsonnet +[Publishing API can render such attachments with govspeak]: https://github.com/alphagov/publishing-api/blob/1b8e328540ab96759807787ab87dcb33bbcc72e3/app/presenters/details_presenter.rb#L71-L73