The Percy Bysshe Shelley Manuscript Corpus

This corpus was prepared by Tatsuo Tokoo with support from a Grant-in-Aid for Scientific Research (KAKENHI) from the Japan Society for the Promotion of Science (JSPS) in 2001-2002. It is presented here with minor revisions and tools for manipulation by the Maryland Institute for Technology in the Humanities.

The corpus is licensed under a Creative Commons Attribution 3.0 Unported License and all supporting software is released by MITH under the Apache License, Version 2.0.

The Corpus Format

The corpus uses a custom markup scheme implemented by Tatsuo Tokoo. Each line represents a line of the manuscript, with metadata about the line following the text and set apart by forward slashes. Within the text of the line, angle brackets are used to indicate "doubtful words" and square brackets to indicate deletions.

Working with the Corpus

The src directory contains utilities for working with this format in Java (or any other language that targets the Java Virtual Machine; the library itself is written in Scala), along with some simple command line tools.

The library uses the Maven build system. If you have Maven installed on your machine, you can use the following command to convert the corpus to a format appropriate for topic modeling with MALLET, for example:

mvn compile exec:java \
  -Dexec.mainClass="edu.umd.mith.sga.mss.MalletConverter" \
  -Dexec.args="pbs-mss-pages.txt"

If you also have MALLET installed, you can then use the following commands to train a topic model:

mallet import-file --remove-stopwords --keep-sequence \
  --input pbs-mss-pages.txt --output pbs-mss-pages.mallet

mallet train-topics \
  --num-topics 30 --optimize-interval 10 \
  --input pbs-mss-pages.mallet \
  --output-topic-keys pbs-mss-pages-30-keys.txt \
  --output-model pbs-mss-pages-30.model

The examples directory includes some sample output files.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
examples		examples
src/main		src/main
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
TODO.md		TODO.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Percy Bysshe Shelley Manuscript Corpus

The Corpus Format

Working with the Corpus

About

Releases

Packages

License

dmwheeles/pbs-mss

Folders and files

Latest commit

History

Repository files navigation

The Percy Bysshe Shelley Manuscript Corpus

The Corpus Format

Working with the Corpus

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages