RFC 116: store attachment data in content items

alphagov · Jan 23, 2020 · bde9f74 · bde9f74
1 parent 9016b48
commit bde9f74
Showing 1 changed file with 227 additions and 0 deletions.
diff --git a/rfc-116-store-attachment-data-in-content-items.md b/rfc-116-store-attachment-data-in-content-items.md
@@ -0,0 +1,227 @@
+# Store attachment data in content items
+
+## Summary
+
+Add a new field to the details hash of content items, `attachments`,
+which has metadata about the page's attachments (if any).
+
+
+## Problem
+
+Pages on GOV.UK can have attachments, which come in a few different
+types.  Whitehall, which probably has the richest model of
+attachments, has:
+
+- External attachments (links to other websites)
+- HTML attachments
+- File attachments (which can have previews)
+
+Attachments are referenced in the content of the page, and make up
+part of the govspeak (or HTML) the publishing app sends to the
+publishing-api.
+
+Attachments are *not* referenced anywhere else in the content item.
+To get the details of an attachment, you have to parse the body of the
+page.  This restricts what we can do with attachments.  For example,
+we cannot generate comprehensive [schema.org][] metadata for
+attachments: [see this comment][].
+
+Additionally, users of the content API cannot make use of attachments
+without parsing the page body.  This inhibits creative use of our
+content.
+
+
+## Proposal
+
+In our content schemas we already have an [`asset_link` type][], which
+is used for attachments in specialist documents and manual sections.
+It's also used in travel advice pages for the page image and a single
+attachment.  This type solves all of our problems: [Publishing API can
+render such attachments with govspeak][], and content items can expose
+this data to enable programmatic use.
+
+However, the metadata Govspeak expects and the `asset_link` type are
+*informally* the same.  And furthermore, Whitehall has much richer
+attachment metadata than either `asset_link` or Govspeak allow.
+
+Given that `asset_link` is currently used for both images and
+attachments, adding all of the missing metadata to `asset_link` feels
+a mistake: images and attachments have different metadata, so we
+shouldn't conflate the two.
+
+I propose we do this:
+
+1. Split `asset_link` into two new types: `attachment_asset` and
+   `image_asset`, updating the schemas which use `asset_link`
+   accordingly.
+
+   **Migration concerns:** This can be done without breaking
+   compatibility.
+
+1. Add new optional fields for extra metadata:
+
+   - Attachments:
+     - `accessible`
+     - `alternative_format_contact_email`
+     - `command_paper_number`
+     - `file_size`
+     - `hoc_paper_number`
+     - `isbn`
+     - `locale`
+     - `number_of_pages`
+     - `parliamentary_session`
+     - `preview_url`
+     - `unique_reference`
+     - `unnumbered_command_paper`
+     - `unnumbered_hoc_paper`
+   - Images:
+     - `alt_text`
+     - `caption`
+     - `credit`
+     - `id`
+
+   I've left out the following fields because they can be deduced from
+   others: `csv?`, `external?`, `external_url`, `file_extension`, `html?`,
+   `opendocument?`, `pdf?`
+
+   And these fields because they don't seem to be used:
+   `print_meta_data_contact_address`, `web_isbn`.
+
+   And these fields because the Publishing Workflow team are planning
+   to remove them from Whitehall: `order_url`, `price_in_pence`.
+
+   And these fields because they don't need to be exposed outside the
+   publishing app: `attachable_id`, `attachable_type`,
+   `attachment_data_id`, `deleted`, `ordering`, `slug`.
+
+   **Migration concerns:** This can be done without breaking
+   compatibility.
+
+1. Replace `content_id` in attachments with `id`.
+
+    **Migration concerns:** this would need to be done in five steps:
+    add `id`; update app to read from both fields but write to `id`;
+    republish docs; remove `content_id`; update apps to not use
+    `content_id`.
+
+1. Remove non-useful metadata fields:
+
+   - Attachments: `created_at` and `updated_at`
+   - Images: `content_type`
+
+   **Migration concerns:** this would need to be done in two steps:
+   update apps; remove fields.
+
+1. Remove the `details.attachments` field and add a *mandatory*
+   top-level `attachments` field to all formats which can have
+   attachments.
+
+    **Migration concerns:** this would need to be done in six steps:
+    add *optional* top-level field; update apps to read from both
+    fields but write to the top-level one; republish docs; make field
+    mandatory; remove old field; update apps to not use
+    `details.attachments`.
+
+1. Change the `details.documents` list to store govspeak hashes.
+
+   **Migration concerns:** this would need to be done in five steps:
+   change schema to allow both types; update frontend apps; update
+   publishing apps; republish docs; change schema to only allow new
+   type.
+
+1. Document the relationship between the `attachment_asset` type in
+   our content schemas and the data Govspeak attachments use.  For
+   example, all fields required in one must be required in the other.
+   Any pairs of fields which do the same job in both should have the
+   same name.  Implement any changes arising from this, for example,
+   more Whitehall-like attachment rendering in Govspeak.
+
+The final `attachment_asset_list` / `attachment_asset` / `image_asset`
+schema will be:
+
+```
+{
+  image_asset: {
+    type: "object",
+    additionalProperties: false,
+    required: [
+      "url",
+    ],
+    properties: {
+      id: { type: "string", },
+      alt_text: { type: "string", },
+      caption: { type: "string", },
+      credit: { type: "string", },
+      url: { type: "string", format: "uri", },
+    },
+  },
+  attachment_asset: {
+    type: "object",
+    additionalProperties: false,
+    required: [
+      "url",
+      "content_type",
+    ],
+    properties: {
+      accessible: { type: "boolean", },
+      alternative_format_contact_email: { type: "string", },
+      command_paper_number: { type: "string", },
+      content_type: { type: "string", },
+      file_size: { type: "integer", },
+      filename: { type: "string", },
+      hoc_paper_number: { type: "string", },
+      id: { type: "string", },
+      isbn: { type: "string", },
+      locale: { "$ref": "#/definitions/locale", },
+      number_of_pages: { type: "integer", },
+      parliamentary_session: { type: "string", },
+      preview_url: { type: "string", format: "uri", },
+      title: { type: "string", },
+      unique_reference: { type: "string", },
+      unnumbered_command_paper: { type: "boolean", },
+      unnumbered_hoc_paper: { type: "boolean", },
+      url: { type: "string", format: "uri", },
+    },
+  },
+  attachment_asset_list: {
+    description: "An ordered list of attachment links",
+    type: "array",
+    items: {
+      "$ref": "#/definitions/attachment_asset",
+    },
+  },
+}
+```
+
+### Some design considerations
+
+**Why two lists?**
+
+There are really two different types of attachments in documents.
+There are attachments which may be referenced in Govspeak but don't
+otherwise appear on the page; in this case the attachment hashes are more like metadata
+than actual page content.  There are also attachments which appear at
+the end of the page text, like in publications.
+
+The top-level `attachments` list would contain both types of
+attachments, so there is one list with everything in.  The
+`details.documents` list would fulfil its current purpose: providing
+the ordered list of attachments for publication-like document types.
+
+**Why split `asset_link`?**
+
+`asset_link` is currently used for two different things: images and
+attachments.  However images have strictly less permissable metadata
+than attachments, it doesn't make sense to give a page image an ISBN
+for example.  These should be separate types.
+
+**Why make the lists mandatory?**
+
+Is there a difference between a missing list and a present, but empty,
+list?  Maybe not, but I think it's better to be explicit that there
+are no attachments for a document.
+
+[schema.org]: http://schema.org/
+[see this comment]: https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254
+[`asset_link` type]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/shared/definitions/asset_links.jsonnet
+[Publishing API can render such attachments with govspeak]: https://github.com/alphagov/publishing-api/blob/1b8e328540ab96759807787ab87dcb33bbcc72e3/app/presenters/details_presenter.rb#L71-L73