From 06195023a1cff8fe88e6ebe30b28a9f819d3c3bb Mon Sep 17 00:00:00 2001 From: Michael Walker Date: Mon, 6 Jan 2020 09:54:10 +0000 Subject: [PATCH] fixup! RFC 116: store attachment data in content items --- ...-store-attachment-data-in-content-items.md | 126 ++++++------------ 1 file changed, 44 insertions(+), 82 deletions(-) diff --git a/rfc-116-store-attachment-data-in-content-items.md b/rfc-116-store-attachment-data-in-content-items.md index ab7062c3..aeaf37af 100644 --- a/rfc-116-store-attachment-data-in-content-items.md +++ b/rfc-116-store-attachment-data-in-content-items.md @@ -22,10 +22,9 @@ publishing-api. Attachments are *not* referenced anywhere else in the content item. To get the details of an attachment, you have to parse the body of the -page. This restricts what we can do with attachments. For example: - -- We cannot generate comprehensive [schema.org][] metadata for attachments: [see this comment](https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254). -- Publishing API is unable to tell Asset Manager to take an asset out of draft state, which means publishing apps have to communicate with both. +page. This restricts what we can do with attachments. For example, +we cannot generate comprehensive [schema.org][] metadata for +attachments: [see this comment][]. Additionally, users of the content API cannot make use of attachments without parsing the page body. This inhibits creative use of our @@ -34,67 +33,45 @@ content. ## Proposal -1. Add a new required field called `attachments` (of type [`asset_link_list`][]) to the `details` of formats which can have attachments. - -2. Add a new optional `preview_url` field to the `asset_link` type, making the schema: - - ``` - { - asset_link: { - type: "object", - additionalProperties: false, - required: [ - "url", - "content_type", - ], - properties: { - content_id: { - "$ref": "#/definitions/guid", - }, - url: { - type: "string", - format: "uri", - }, - preview_url: { - type: "string", - format: "uri", - }, - content_type: { - type: "string", - }, - title: { - type: "string", - }, - created_at: { - format: "date-time", - }, - updated_at: { - format: "date-time", - }, - }, - }, - asset_link_list: { - description: "An ordered list of asset links", - type: "array", - items: { - "$ref": "#/definitions/asset_link", - }, - }, - } - ``` +In our content schemas we already have an [`asset_link` type][], which +[Specialist Publisher uses for storing its attachment data][]. This +solves almost all of our problems: [Publishing API can render such +attachments with govspeak][], and content items can expose this data +to enable programmatic use. + +However, the metadata Govspeak expects and the `asset_link` type are +*informally* the same. And furthermore, Whitehall has much richer +attachment metadata than either `asset_link` or Govspeak allow. So +this proposal has three parts: + +1. Add a new required field called `attachments` (of type + `asset_link_list`) to the `details` of formats which can have + attachments. + +2. Document the relationship between the `asset_link` type in our + content schemas and the data Govspeak attachments use. For + example, all fields required in one must be required in the other. + Any pairs of fields which do the same job in both should have the + same name. Implement any changes arising from this. + +3. Add new optional fields to the `asset_link` type: + + - `file_size` (used by govspeak) + - `preview_url` (used by whitehall) ### Some design considerations **Why in the details hash?** -[That's how specialist publisher does it][], and it seems sensible. +That's how Specialist Publisher does it and how Publishing API expects +it if it's going to use Govspeak to render attachments. **Why change `asset_link`? Why not a new schema?** -I think the only thing missing from the `asset_link` schema is a -preview URL, so any new `attachment` schema would be almost identical, -which would be confusing. Plus, we already use `asset_link` for -specialist documents. +I think there are only a couple of fields missing from the +`asset_link` schema, so any new `attachment` schema would be almost +identical, which would be confusing. Plus, we already use +`asset_link` for specialist documents. **Why make the `asset_link_list` mandatory?** @@ -107,7 +84,7 @@ breaking publishing apps which haven't yet been updated. But once Publishing API has ben updated to accept the field, and publishing apps have been updated to set it, then it should be made mandatory. -**Why `preview_url`? Why not other metadata?** +**Why `preview_url`? Why not more whitehall metadata?** Whitehall CSV attachments have automatically generated previews, which show you some initial portion of the file without needing to download @@ -115,34 +92,19 @@ it. That seems like a very useful feature to me, so it would be nice to expose it in the metadata. Other whitehall metadata, like ISBN, command/act paper number, and -price (for ordering a physical copy) seem much more special-case and -I'm not sure there would be much benefit to having them. +price (for ordering a physical copy) seem much more special-case. We +don't want to overcomplicate Govspeak by adding many new attributes to +its attachment model. We can instead add new fields as we need to +and, in doing so, might discover that Whitehall has some features +which aren't needed. Information about how to ask for an accessible format only makes sense as free text, so I see that as less useful to add to the content item, which I imagine as being consumed by machines. -### Does this solve the problems? - -**Can we generate schema.org metadata for attachments?** - -Yes, the only thing we need for that is the attachment URL and content -type. - -**Can Publishing API communicate with Asset Manager?** - -With a little fiddling of URLs, yes. Asset URLs follow one of two -formats: URLs for whitehall assets, which have a "legacy URL path", -and URLs for all other assets, which have a UUID. Publishing API -could use the `publishing_app` field to determine which URL format to -expect, extract the path or UUID (if it's not an external URL), and -send messages to Asset Manager about the asset. - -Publishing apps would still need to talk to both Publishing API and -Asset Manager unless we implement uploading assets in Publishing API, -however. But that is beyond the scope of this RFC. - [schema.org]: http://schema.org/ -[`asset_link_list`]: https://github.com/alphagov/govuk-content-schemas/blob/master/formats/shared/definitions/asset_links.jsonnet -[That's how specialist publisher does it]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/specialist_document.jsonnet#L38 +[see this comment]: https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254 +[`asset_link` type]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/shared/definitions/asset_links.jsonnet +[Specialist Publisher uses for storing its attachment data]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/specialist_document.jsonnet#L38 +[Publishing API can render such attachments with govspeak]: https://github.com/alphagov/publishing-api/blob/1b8e328540ab96759807787ab87dcb33bbcc72e3/app/presenters/details_presenter.rb#L71-L73