diff --git a/rfc-116-store-attachment-data-in-content-items.md b/rfc-116-store-attachment-data-in-content-items.md new file mode 100644 index 00000000..14ff5694 --- /dev/null +++ b/rfc-116-store-attachment-data-in-content-items.md @@ -0,0 +1,147 @@ +# Store attachment data in content items + +## Summary + +Add a new field to the details hash of content items, `attachments`, +which has metadata about the page's attachments (if any). + + +## Problem + +Pages on GOV.UK can have attachments, which come in a few different +types. Whitehall, which probably has the richest model of +attachments, has: + +- External attachments (links to other websites) +- HTML attachments +- File attachments (which can have previews) + +Attachments are referenced in the content of the page, and make up +part of the govspeak (or HTML) the publishing app sends to the +publishing-api. + +Attachments are *not* referenced anywhere else in the content item. +To get the details of an attachment, you have to parse the body of the +page. This restricts what we can do with attachments. For example: + +- We cannot generate comprehensive [schema.org][] metadata for attachments: [see this comment](https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254). +- Publishing API is unable to tell Asset Manager to take an asset out of draft state, which means publishing apps have to communicate with both. + +Additionally, users of the content API cannot make use of attachments +without parsing the page body. This inhibits creative use of our +content. + + +## Proposal + +1. Add a new required field called `attachments` (of type [`asset_link_list`][]) to the `details` of formats which can have attachments. + +2. Add a new optional `preview_url` field to the `asset_link` type, making the schema: + + ``` + { + asset_link: { + type: "object", + additionalProperties: false, + required: [ + "url", + "content_type", + ], + properties: { + content_id: { + "$ref": "#/definitions/guid", + }, + url: { + type: "string", + format: "uri", + }, + preview_url: { + type: "string", + format: "uri", + }, + content_type: { + type: "string", + }, + title: { + type: "string", + }, + created_at: { + format: "date-time", + }, + updated_at: { + format: "date-time", + }, + }, + }, + asset_link_list: { + description: "An ordered list of asset links", + type: "array", + items: { + "$ref": "#/definitions/asset_link", + }, + }, + } + ``` + +### Some design considerations + +**Why in the details hash?** + +[That's how specialist publisher does it][], and it seems sensible. + +**Why change `asset_link`? Why not a new schema?** + +I think the only thing missing from the `asset_link` schema is a +preview URL, so any new `attachment` schema would be almost identical, +which would be confusing. Plus, we already use `asset_link` for +specialist documents. + +**Why make the `asset_link_list` mandatory?** + +Is there a difference between a missing list and a present, but empty, +list? Maybe not, but I think it's better to be explicit that there +are no attachments for a document. + +For implementing this RFC, the field should be optional, to avoid +breaking publishing apps which haven't yet been updated. But once +Publishing API has ben updated to accept the field, and publishing +apps have been updated to set it, then it should be made mandatory. + +**Why `preview_url`? Why not other metadata?** + +Whitehall CSV attachments have automatically generated previews, which +show you some initial portion of the file without needing to download +it. That seems like a very useful feature to me, so it would be nice +to expose it in the metadata. + +Other whitehall metadata, like ISBN, command/act paper number, and +price (for ordering a physical copy) seem much more special-case and +I'm not sure there would be much benefit to having them. + +Information about how to ask for an accessible format only makes sense +as free text, so I see that as less useful to add to the content item, +which I imagine as being consumed by machines. + +### Does this solve the problems? + +**Can we generate schema.org metadata for attachments?** + +Yes, the only thing we need for that is the attachment URL. + +**Can Publishing API communicate with Asset Manager?** + +With a little fiddling of URLs, yes. Asset URLs follow one of two +formats: URLs for whitehall assets, which have a "legacy URL path", +and URLs for all other assets, which have a UUID. Publishing API +could use the `publishing_app` field to determine which URL format to +expect, extract the path or UUID (if it's not an external URL), and +send messages to Asset Manager about the asset. + +Publishing apps would still need to talk to both Publishing API and +Asset Manager unless we implement uploading assets in Publishing API, +however. But that is beyond the scope of this RFC. + + +[schema.org]: http://schema.org/ +[`asset_link_list`]: https://github.com/alphagov/govuk-content-schemas/blob/master/formats/shared/definitions/asset_links.jsonnet +[That's how specialist publisher does it]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/specialist_document.jsonnet#L38