Skip to content
duckduckgrayduck edited this page Apr 8, 2023 · 8 revisions

DocumentCloud Add-On Example

This repository contains an example Add-On for DocumentCloud. It is designed to be copied and modified to allow one to easily write Add-Ons to bring custom functionality to DocumentCloud. Please use the green "Use this Template" button to create a new repository instead of forking. This will be easier as the new Add-On will diverge significantly from this code base.

Files

addon.py

This file is part of the python-documentcloud library.

This file contains a base class AddOn, which implements shared functionality for all DocumentCloud Add-Ons to use. In most cases, you should not need to edit this file. You will subclass this class in main.py.

Upon initializing this class, it parses the JSON passed in as an argument, and populates a number of member variables.

  • client - A DocumentCloud client. This is a python library (https://github.com/MuckRock/python-documentcloud) allowing easy access to the DocumentCloud API. It will be configured with the access token and refresh token passed in, which gives you access to the API as the user who activated the Add-On.

  • id - a UUID to identify this run. This is used to update progress, status and upload files for this particular run of the Add-On. This will be None if called from the test_addon.py script.

  • documents - A list of document IDs selected when the Add-On was activated

  • query - The search query which was active when the Add-On was activated

  • user_id - The user ID of the user who activated this run of the Add-On.

  • org_id - The organization ID of the active organization of the user who activated this run of the Add-On.

  • data - The Add-On specific data.

There are also some methods which provide useful functionality for an Add-On. Please note that set_progress, set_message, and upload_file will only work when the Add-On is invoked from DocumentCloud. If you invoke the Add-On from the command line during development, these methods will be no-ops.

  • set_progress(self, progress) - This takes a single integer argument between 0 and 100 which represents the percent of progress the Add-On run has made, to inform the user of the progress. As it takes some time to be shown to the user, this is primarily of use for long running Add-Ons.

  • set_message(self, message) - Takes a string of max length 255 characters. This sets a status message to let the user know what the status of the Add-On run is. Similar to set_progress, it is mostly useful for long running Add-Ons.

  • upload_file(self, file) - Takes a file object to attach to this Add-On run. This will be presented to the user for download. This is useful for Add-Ons which want to return data such as a CSV file or other exports of data to the user. It is currently limited to one file per run, so please ZIP your files if you need to return more than one. The file will be available for download for five days, after which it will be permanently deleted from the server.

  • send_mail(self, subject, content) - This is used to email yourself at the email address associated with your DocumentCloud account. This can be used to send a notification when an Add-On run is complete or just to send additional information to the user who ran the Add-On. It takes two character strings, one for the subject and one for the body content of the email. The content is plain text and does not currently support Markdown or HTML.

For scheduled Add-Ons, there are two additional methods that provide some useful features.

  • store_event_data(self, scratch) - Allows you store some data between scheduled Add-On runs. Scratch is a JSON file, where you can designate items you would like to store between runs. The Scraper Add-On, for example, keeps track of which documents on a site have already been seen between runs so that documents are uploaded twice. The Klaxon Site Monitor Add-On keeps track of an archive.org timestamp between Add-On runs.
  • load_event_data(self) - Loads stored event data at the beginning of a new scheduled run. The event data is stored as a JSON file.

The script also accepts command line options to allow for easier testing for development purposes. It requires your DocumentCloud username and password if the add-on requires authentication, which is used to fetch a refresh and access token. They can be passed in as command line arguments (--username and --password), or in environment variables (DC_USERNAME and DC_PASSWORD).

You can also pass in a list of document IDs (--documents), a search query (--query), and JSON parameters for your Add-On (--data) - be sure to properly quote your JSON at the command line.

Example invocation:

python main.py --documents 123 --data '{"name": "World"}'

main.py

This is the file to edit to implement your Add-On specific functionality. You should define a class which inherits from AddOn from addon.py. Then you can instantiate a new instance and call the main method, which is the entry point for your Add-On logic. You may access the data parsed by AddOn as well as using the helper methods defined there. The HelloWorld example Add-On demonstrates using many of these features.

If you need to add more files, remember to instantiate the main Add-On class from a file called main.py - that is what the GitHub action will call with the Add-On parameters upon being dispatched.

If your Add-On will do the same task on a large sets of documents and you want to handle this more efficiently without bumping up the hard timeout, you can inherit from SoftTimeOutAddOn like so: class HelloWorldAddOn(SoftTimeOutAddOn):

You will also need to import the SoftTimeOutAddOn module like so: from documentcloud.addon import SoftTimeOutAddOn

You may specify the soft timeout in seconds under the class definition and before main(): soft_time_limit = 60 will set the soft timeout to one minute. The default soft timeout is five minutes. If you convert your Add-On to use SoftTimeOuts, please change the hard time out in the workflow to a time longer than five minutes in the run-addon.yml workflow file mentioned later in this guide.

The DocumentCloud Filecoin Add-On is a good example of an Add-On that uses the soft timeout system to efficiently handle large sets of documents.

The Soft Timeout Test Add-On also exists as a really basic example of how to implement an Add-On using soft timeouts.

config.yaml

This is a YAML file which defines the data your Add-On expects to receive. DocumentCloud will use it to show a corresponding form with the proper fields. It uses the JSON Schema format, but allows you to use YAML for convenience. You may read more about JSON Schema, but here are the basics to get started:

# The config,yaml format is a JSON Schema document in YAML format
# https://yaml.org/
#
# Learn more about JSON Schema:
# https://json-schema.org/
#
# This example configuration shows the options supported by the
# DocumentCloud add-on system

# The title is the title of your Add-On
title: Hello World
# The description will be shown above the form when activating the Add-On
description: This is an updated simple test add-on
# Top level type should always be object
type: object
# How does this add-on accept documents
# If more than one type is specified, the user will be prompted to choose one
# This is a not a JSON Schema property
documents:
  # By the current search query
  - query
  # By the currently selected documents
  - selected
# Properties are the fields for your form
properties:
  # the key is the name of the variable that will be returned to your code
  name:
    # the title is what will be shown as the form label
    title: Name
    # a string is text - it will use a text input type
    type: string
    # description will be shown under the form field
    # the top level description accepts markdown syntax and will be converted to HTML
    description: Please enter your full name
  url:
    title: URL
    type: string
    # format will validate the string is of the given format
    # All accepted formats:
    # https://json-schema.org/understanding-json-schema/reference/string.html#built-in-formats
    # Common formats:
    # date-time, time, date, email, uri, uuid, regex
    # date, time and date-time will create input boxes of the corresponding type
    format: uri
    # default will pre-populate the form field with this value
    # it will also use this value if the add-on is dispatched with no value
    # for this property
    default: https://www.example.com
  age:
    title: Age
    # a number is an integer or floating point number
    # an integer is only integers 
    # these will use a number input type
    type: integer
  vip:
    title: VIP
    # a boolean is either true or false - it will use a checkbox for the form
    type: boolean
  keywords:
    # an array is an ordered collection of elements
    # the form will present buttons to allow the input of an arbitrary number
    # of array items
    type: array
    # items specifies the options of the individual items in the array
    # you may use the same options as for top level properties
    # items of this array must be strings
    items:
      type: string
    description: Keywords to search and notify on
    default:
      - court
      - foia

# which properties are required to run this add-on - will be checked by them form
required:
  - name
  - age

# If you would like your Add-On to use the soft timeout system 
# to more efficiently run long jobs, you must add the following line:
version: 2


# Event options lets you configure add-ons to be run automatically on certain events
# These can be at regularly scheduled intervals, or on every document upload
eventOptions:
  # The name defines which one of your properties is shown for scheduled events
  # It should be a field which can identify the scheduled run
  name: url
  # events is a list of which event options the user may select from when scheduling
  # Add-On runs
  events:
    # once an hour, day or week
    - hourly
    - daily
    - weekly
    # or run the Add-On for every new file uploaded
    - upload

At the top level you have the following properties:

  • title - The title for your Add-On
  • description - a description for your Add-On - will be displayed above the form when someone runs the add-on
  • type - This should always be set to object
  • documents - This is a list containing none, one or both of query or selected specifying how your Add-On accepts documents. See Document Selection for more details.
  • properties - This is an object describing the data fields your add-on accepts
    • The name will be the name of the variable the data is returned in
    • title - The label shown on the form for this field
    • type - This may be string, number or boolean
  • required - a list of the required properties
  • version - 2 if you plan on using soft timeouts, otherwise you may exclude this property.
  • eventOptions - options for allowing the Add-On to run on scheduled events. This is optional and if left out, the Add-On will not be able to be scheduled. This option is useful for Add-Ons such as scrapers that you would like to run automatically on a schedule. There are options for running once an hour, day, or week, as well as on every document upload.

requirements.txt

This is a standard pip requirements.txt file. It allows you to specify python packages to be installed before running the Add-On. You may add any dependencies your Add-On has here. By default we install the python-documentcloud API library and the requests HTTP request package. You may upgrade the python-documentcloud version when new releases come out in order to take advantage of new features.

.github/workflows/run-addon.yml

This is the GitHub Actions configuration file for running the add-on. It references a reusable workflow from the MuckRock/documentcloud-addon-workflows repository. This workflow sets up python, installs dependencies and runs the main.py to start the Add-On. It accepts two inputs:

  • timeout - Number of minutes to time out. The default is 5. You may increase this if your add-on will run for longer than that. Please note that the longest timeout Github Actions supports is 360 minutes (six hours). This is the hard timeout. If you are performing the same task on large sets of documents, please consider implementing your Add-On using soft timeouts.
  • python-version - The version of python you would like to use. Defaults to 3.10.

To set an input such as timeout:

jobs:
  Run-Add-On:
    uses: MuckRock/documentcloud-addon-workflows/.github/workflows/update-config.yml@v1
    with:
      timeout: 30

It is recommended you use the reusable workflow in order to receive future improvements to the workflow. If needed you may fork the reusable workflow and edit it as needed. If you do edit it, you should leave the first step in place, which uses the UUID as its name, to allow DocumentCloud to identify the run.

It would be possible to make a similar workflow for other programming languages if one wanted to write Add-Ons in a language besides Python.

.github/workflows/update-config.yml

This is the GitHub Actions configuration file for updating the configuration file. It references a reusable workflow from the MuckRock/documentcloud-addon-workflows repository. This workflow sends a POST request to DocumentCloud whenever a new config.yaml file is pushed to the repository. It accepts one input:

LICENSE

The license this code is provided under, the 3-Clause BSD License

Run Your Add-On in DocumentCloud

If you write your own Add-On, you can run it from with DocumentCloud's user interface through a few simple steps.

First, install the Github DocumentCloud App. Note that for this to work properly, you must have your primary Github and MuckRock accounts set to use the same email address. You can set your primary MuckRock account email here.

As you add the Github DocumentCloud App, give it access to only those repositories in your Github account that are Add-Ons you want to run. You can modify this from this page once you have the app installed in Github.

A screenshot of the above linked webpage, showing a single repository linked to the DocumentCloud Github app.

Then your Add-On will appear for you under "Browse All Add-Ons" and you can activate it there.

Permissions and Security

Currently, the DocumentCloud team reviews and vets each Add-On that's integrated directly within the site (i.e., the ones you see in the Add-On dropdown). Add-Ons that a user downloads and runs locally, however, are not necessarily vetted or reviewed by the DocumentCloud team and you should only run Add-Ons that are published by individuals you trust.

Currently, Add-Ons are essentially given full access to your user account, and can do anything you can while logged in, including reading all of your documents, deleting or modifying them, sharing documents with other users, and much more.

For Add-Ons run through the site, they do not see your account credentials, just a unique token granted to that Add-On. For Add-Ons run through a GitHub Action or run locally, there is the potential for a maliciously written Add-On to obtain your credentials so it is particularly important to understand and trust the source of the Add-On before you run it.

As we open up Add-Ons to additional third-party contributions, we'll begin to offer more limited access tokens that constrain permissions to just the documents and actions explicitly granted to them, as well as defining certain time scopes for that access.

Document Selection

When you run an Add-On via the DocumentCloud web interface, it will take one of four options for what documents to act on:

  • Selected: When you run the Add-On, it will try to act on the documents that are currently selected with a check mark.
  • Query: When you run the Add-On, it will try to act on all of the documents that are currently listed in the search results, including documents that are not in the current view. Note that large numbers of search results or search results that include documents you don't have permissions to will often be more likely to have errors.
  • Both: Some Add-Ons will let you select between the two options above. If you don't currently have any documents selected, it will default to acting on the documents in the search results while letting you know that you may select documents instead. To do so, cancel the Add-On, select the documents, and pick the Add-On again.
  • Neither: Some Add-Ons don't actually take any documents as input, such as an Add-On that imports documents from a link or scrapes a webpage for document links.

Note that currently, these options determine what specific document IDs are sent to the Add-On, but the Add-On still has permissions to your entire document collection. In the future, as we better understand Add-On use cases, we plan to restrict access permissions to only the subset of documents that an Add-On requires to successfully run.

Reference

Full parameter reference

This is a reference of all of the data passed in to the Add-On. A single JSON object is passed in to main.py as a quoted string. The init function parses this out and converts it to useful python objects for your main function to use. The following are the top level keys in the object.

  • token - An access token which will be valid for 5 minutes, giving you API access authorized as the user who activated the add-on. The init function uses this value to configure the DocumentCloud client object.

  • refresh_token - A refresh token which will be valid for 1 day, giving you API access to new refresh tokens when they expire. The init function uses this value to configure the DocumentCloud client object.

  • base_uri - This can be used to point the API server to other instances, such as our internal staging server. It should not be used unless you are running your own instance of DocumentCloud. It is also used in the initialization of the DocumentCloud client.

  • auth_uri - The corresponding auth_uri if a base_uri is specified.

  • documents - This is the list of Document IDs which is passed in to main

  • query - This is the search query which is passed in to main

  • data - This is the Add-On specific data, as defined when registering the Add-On with DocumentCloud. It is passed in to main in the params dictionary under the key data

  • user and organization - The user ID and organiation ID of the user who activated the Add-On. They are also passed in to main through the params dictionary under the keys user and organization respectively.

  • id - A UUID to uniquely identify this Add-On run. It allows DocumentCloud to identify the run, as well as allowing the run to send back progress, status message and file updates.