Skip to content

ResearchHub/pdf2html

Repository files navigation

pdf2html

Overview

This repo contains two packages:

  • pdf2html: The pdf2html command line utility, written primarily in C++.
  • pdf2html-lambda: A lambda function that calls the pdf2html CLI to convert PDF files from S3, written in Go.

Requirements

The following tools are required in order to develop and run the CLI and lambda function:

Additionally, gcc and make are required. This can be installed on a Mac by installing the xcode command line tools, via:

xcode-select --install

The following are also required, but can be installed via homebrew. This can be easily done usingg the Brewfile at the root of the repo. If you don't have homebrew or are not on a Mac, then follow the links to go through the install process for each.

To install via the Brewfile, run the following command at the root of the repo:

brew bundle install

Building pdf2html-lambda

The lambda function is written in go and is managed using AWS SAM (serverless application model). The lambbda itself is packaged as a docker container, leveraging the docker container support of AWS lambda. This allows us to build the pdf2html CLI utility directly into the final container of the pdf2html-lambda function.

Most of the build process is managed using Make. You can run the following command to build the lambda, which will build the pdf2html CLI utility as well.

make build

Note: Be sure to rebuild the lambda after every code change to ensure the version of the lambda that is run is always up-to-date.

Running locally

We use MinIO to simulate AWS S3 locally. MinIO runs as a docker container, which is configured via the root level docker-compose.yml file.

You can run all of the required backend services by running:

make start.backend

Then you can start the lambda by running:

make start.lambda

You can also invoke the lambda once with a test payload by running:

make invoke