Skip to content

Commit

Permalink
RFC 116: store attachment data in content items
Browse files Browse the repository at this point in the history
  • Loading branch information
barrucadu committed Jan 3, 2020
1 parent 9016b48 commit e387912
Showing 1 changed file with 147 additions and 0 deletions.
147 changes: 147 additions & 0 deletions rfc-116-store-attachment-data-in-content-items.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Store attachment data in content items

## Summary

Add a new field to the details hash of content items, `attachments`,
which has metadata about the page's attachments (if any).


## Problem

Pages on GOV.UK can have attachments, which come in a few different
types. Whitehall, which probably has the richest model of
attachments, has:

- External attachments (links to other websites)
- HTML attachments
- File attachments (which can have previews)

Attachments are referenced in the content of the page, and make up
part of the govspeak (or HTML) the publishing app sends to the
publishing-api.

Attachments are *not* referenced anywhere else in the content item.
To get the details of an attachment, you have to parse the body of the
page. This restricts what we can do with attachments. For example:

- We cannot generate comprehensive [schema.org][] metadata for attachments: [see this comment](https://github.com/alphagov/govuk_publishing_components/pull/1247#pullrequestreview-338008254).
- Publishing API is unable to tell Asset Manager to take an asset out of draft state, which means publishing apps have to communicate with both.

Additionally, users of the content API cannot make use of attachments
without parsing the page body. This inhibits creative use of our
content.


## Proposal

1. Add a new required field called `attachments` (of type [`asset_link_list`][]) to the `details` of formats which can have attachments.

2. Add a new optional `preview_url` field to the `asset_link` type, making the schema:

```
{
asset_link: {
type: "object",
additionalProperties: false,
required: [
"url",
"content_type",
],
properties: {
content_id: {
"$ref": "#/definitions/guid",
},
url: {
type: "string",
format: "uri",
},
preview_url: {
type: "string",
format: "uri",
},
content_type: {
type: "string",
},
title: {
type: "string",
},
created_at: {
format: "date-time",
},
updated_at: {
format: "date-time",
},
},
},
asset_link_list: {
description: "An ordered list of asset links",
type: "array",
items: {
"$ref": "#/definitions/asset_link",
},
},
}
```

### Some design considerations

**Why in the details hash?**

[That's how specialist publisher does it][], and it seems sensible.

**Why change `asset_link`? Why not a new schema?**

I think the only thing missing from the `asset_link` schema is a
preview URL, so any new `attachment` schema would be almost identical,
which would be confusing. Plus, we already use `asset_link` for
specialist documents.

**Why make the `asset_link_list` mandatory?**

Is there a difference between a missing list and a present, but empty,
list? Maybe not, but I think it's better to be explicit that there
are no attachments for a document.

For implementing this RFC, the field should be optional, to avoid
breaking publishing apps which haven't yet been updated. But once
Publishing API has ben updated to accept the field, and publishing
apps have been updated to set it, then it should be made mandatory.

**Why `preview_url`? Why not other metadata?**

Whitehall CSV attachments have automatically generated previews, which
show you some initial portion of the file without needing to download
it. That seems like a very useful feature to me, so it would be nice
to expose it in the metadata.

Other whitehall metadata, like ISBN, command/act paper number, and
price (for ordering a physical copy) seem much more special-case and
I'm not sure there would be much benefit to having them.

Information about how to ask for an accessible format only makes sense
as free text, so I see that as less useful to add to the content item,
which I imagine as being consumed by machines.

### Does this solve the problems?

**Can we generate schema.org metadata for attachments?**

Yes, the only thing we need for that is the attachment URL.

**Can Publishing API communicate with Asset Manager?**

With a little fiddling of URLs, yes. Asset URLs follow one of two
formats: URLs for whitehall assets, which have a "legacy URL path",
and URLs for all other assets, which have a UUID. Publishing API
could use the `publishing_app` field to determine which URL format to
expect, extract the path or UUID (if it's not an external URL), and
send messages to Asset Manager about the asset.

Publishing apps would still need to talk to both Publishing API and
Asset Manager unless we implement uploading assets in Publishing API,
however. But that is beyond the scope of this RFC.


[schema.org]: http://schema.org/
[`asset_link_list`]: https://github.com/alphagov/govuk-content-schemas/blob/master/formats/shared/definitions/asset_links.jsonnet
[That's how specialist publisher does it]: https://github.com/alphagov/govuk-content-schemas/blob/47a751e7eb193738c2ec43be03b149527a2b8e15/formats/specialist_document.jsonnet#L38

0 comments on commit e387912

Please sign in to comment.