Skip to content

Labeled segmentation for the document structure of printed books

Notifications You must be signed in to change notification settings

dbamman/book-segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

book-segmentation

Data, code and trained models to segment the document structure of printed books and label each segment according to ten categories:

  • Title page (including half titles)
  • Ad card (advertisements)
  • Publisher information
  • Dedication
  • Preface
  • Table of contents
  • Text
  • Appendix
  • Index
  • N/A

Data, categorization system, and models described in more detail here:

Lara McConnaughey, Jennifer Dai and David Bamman (2017), "The Labeled Segmentation of Printed Books" (EMNLP 2017)

This model makes use of data from Ted Underwood's DataMunging repo

Usage

To segment a book from the HathiTrust named book.zip using the default model: python code/segment_book.py book.zip models/labseg10/

This should output a list of page numbers and labels for all pages in book.zip.

Dependencies

Numpy (pip install numpy --user), scipy (pip install scipy --user) and Tensorflow 1.0

About

Labeled segmentation for the document structure of printed books

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages