Skip to content

Commit

Permalink
RFC 116: store attachment data in content items
Browse files Browse the repository at this point in the history
  • Loading branch information
barrucadu committed Jan 23, 2020
1 parent 9016b48 commit bde9f74
Showing 1 changed file with 227 additions and 0 deletions.
227 changes: 227 additions & 0 deletions rfc-116-store-attachment-data-in-content-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Store attachment data in content items

## Summary

Add a new field to the details hash of content items, `attachments`,
which has metadata about the page's attachments (if any).


## Problem

Pages on GOV.UK can have attachments, which come in a few different
types. Whitehall, which probably has the richest model of
attachments, has:

- External attachments (links to other websites)
- HTML attachments
- File attachments (which can have previews)

Attachments are referenced in the content of the page, and make up
part of the govspeak (or HTML) the publishing app sends to the
publishing-api.

Attachments are *not* referenced anywhere else in the content item.
To get the details of an attachment, you have to parse the body of the
page. This restricts what we can do with attachments. For example,
we cannot generate comprehensive [schema.org][] metadata for
attachments: [see this comment][].

Additionally, users of the content API cannot make use of attachments
without parsing the page body. This inhibits creative use of our
content.


## Proposal

In our content schemas we already have an [`asset_link` type][], which
is used for attachments in specialist documents and manual sections.
It's also used in travel advice pages for the page image and a single
attachment. This type solves all of our problems: [Publishing API can
render such attachments with govspeak][], and content items can expose
this data to enable programmatic use.

However, the metadata Govspeak expects and the `asset_link` type are
*informally* the same. And furthermore, Whitehall has much richer
attachment metadata than either `asset_link` or Govspeak allow.

Given that `asset_link` is currently used for both images and
attachments, adding all of the missing metadata to `asset_link` feels
a mistake: images and attachments have different metadata, so we
shouldn't conflate the two.

I propose we do this:

1. Split `asset_link` into two new types: `attachment_asset` and
`image_asset`, updating the schemas which use `asset_link`
accordingly.

**Migration concerns:** This can be done without breaking
compatibility.

1. Add new optional fields for extra metadata:

- Attachments:
- `accessible`
- `alternative_format_contact_email`
- `command_paper_number`
- `file_size`
- `hoc_paper_number`
- `isbn`
- `locale`
- `number_of_pages`
- `parliamentary_session`
- `preview_url`
- `unique_reference`
- `unnumbered_command_paper`
- `unnumbered_hoc_paper`
- Images:
- `alt_text`
- `caption`
- `credit`
- `id`

I've left out the following fields because they can be deduced from
others: `csv?`, `external?`, `external_url`, `file_extension`, `html?`,
`opendocument?`, `pdf?`

And these fields because they don't seem to be used:
`print_meta_data_contact_address`, `web_isbn`.

And these fields because the Publishing Workflow team are planning
to remove them from Whitehall: `order_url`, `price_in_pence`.

And these fields because they don't need to be exposed outside the
publishing app: `attachable_id`, `attachable_type`,
`attachment_data_id`, `deleted`, `ordering`, `slug`.

**Migration concerns:** This can be done without breaking
compatibility.

1. Replace `content_id` in attachments with `id`.

**Migration concerns:** this would need to be done in five steps:
add `id`; update app to read from both fields but write to `id`;
republish docs; remove `content_id`; update apps to not use
`content_id`.

1. Remove non-useful metadata fields:

- Attachments: `created_at` and `updated_at`
- Images: `content_type`

**Migration concerns:** this would need to be done in two steps:
update apps; remove fields.

1. Remove the `details.attachments` field and add a *mandatory*
top-level `attachments` field to all formats which can have
attachments.

**Migration concerns:** this would need to be done in six steps:
add *optional* top-level field; update apps to read from both
fields but write to the top-level one; republish docs; make field
mandatory; remove old field; update apps to not use
`details.attachments`.

1. Change the `details.documents` list to store govspeak hashes.

**Migration concerns:** this would need to be done in five steps:
change schema to allow both types; update frontend apps; update
publishing apps; republish docs; change schema to only allow new
type.

1. Document the relationship between the `attachment_asset` type in
our content schemas and the data Govspeak attachments use. For
example, all fields required in one must be required in the other.
Any pairs of fields which do the same job in both should have the
same name. Implement any changes arising from this, for example,
more Whitehall-like attachment rendering in Govspeak.

The final `attachment_asset_list` / `attachment_asset` / `image_asset`
schema will be:

```
{
image_asset: {
type: "object",
additionalProperties: false,
required: [
"url",
],
properties: {
id: { type: "string", },
alt_text: { type: "string", },
caption: { type: "string", },
credit: { type: "string", },
url: { type: "string", format: "uri", },
},
},
attachment_asset: {
type: "object",
additionalProperties: false,
required: [
"url",
"content_type",
],
properties: {
accessible: { type: "boolean", },
alternative_format_contact_email: { type: "string", },
command_paper_number: { type: "string", },
content_type: { type: "string", },
file_size: { type: "integer", },
filename: { type: "string", },
hoc_paper_number: { type: "string", },
id: { type: "string", },
isbn: { type: "string", },
locale: { "$ref": "#/definitions/locale", },
number_of_pages: { type: "integer", },
parliamentary_session: { type: "string", },
preview_url: { type: "string", format: "uri", },
title: { type: "string", },
unique_reference: { type: "string", },
unnumbered_command_paper: { type: "boolean", },
unnumbered_hoc_paper: { type: "boolean", },
url: { type: "string", format: "uri", },
},
},
attachment_asset_list: {
description: "An ordered list of attachment links",
type: "array",
items: {
"$ref": "#/definitions/attachment_asset",
},
},
}
```

### Some design considerations

**Why two lists?**

There are really two different types of attachments in documents.
There are attachments which may be referenced in Govspeak but don't
otherwise appear on the page; in this case the attachment hashes are more like metadata
than actual page content. There are also attachments which appear at
the end of the page text, like in publications.

The top-level `attachments` list would contain both types of
attachments, so there is one list with everything in. The
`details.documents` list would fulfil its current purpose: providing
the ordered list of attachments for publication-like document types.

**Why split `asset_link`?**

`asset_link` is currently used for two different things: images and
attachments. However images have strictly less permissable metadata
than attachments, it doesn't make sense to give a page image an ISBN
for example. These should be separate types.

**Why make the lists mandatory?**

Is there a difference between a missing list and a present, but empty,
list? Maybe not, but I think it's better to be explicit that there
are no attachments for a document.

[schema.org]: http://schema.org/
[see this comment]: https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254
[`asset_link` type]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/shared/definitions/asset_links.jsonnet
[Publishing API can render such attachments with govspeak]: https://github.com/alphagov/publishing-api/blob/1b8e328540ab96759807787ab87dcb33bbcc72e3/app/presenters/details_presenter.rb#L71-L73

0 comments on commit bde9f74

Please sign in to comment.