book-segmentation

Data, code and trained models to segment the document structure of printed books and label each segment according to ten categories:

Data, categorization system, and models described in more detail here:

Lara McConnaughey, Jennifer Dai and David Bamman (2017), "The Labeled Segmentation of Printed Books" (EMNLP 2017)

This model makes use of data from Ted Underwood's DataMunging repo

Usage

To segment a book from the HathiTrust named book.zip using the default model: python code/segment_book.py book.zip models/labseg10/

This should output a list of page numbers and labels for all pages in book.zip.

Numpy (pip install numpy --user), scipy (pip install scipy --user) and Tensorflow 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
data		data
models/labseg10		models/labseg10
README.md		README.md