From suriano.huygens.knaw.nl:
Christofforo Suriano was the first Venetian envoy to the Dutch Republic. His arrival in The Hague marks the beginning of official Dutch-Venetian diplomatic relations. This digital edition presents all 725 letters sent by Suriano to the Venetian Senate between 1616 and 1623, totaling approximately 7,000 pages. His correspondence is of great importance for Dutch, Venetian, and European history in the first phase of the Thirty Years’ War.
For more info, see the Suriano project.
In this repo we prepare the data for the website of the correspondence of Christofforo Suriano:
- the entrance of that website is suriano.huygens.knaw.nl, this contains materials not covered in this repo;
- the correspondence itself is served by edition.suriano.huygens.knaw.nl.
This repo also contains a text-fabric copy of the corpus (in fact, that copy has been instrumental to build the data for the website).
This copy contains the transcriptions and thumbnails of the scans.
Here are the express instructions to get going:
- install Python
pip install 'text-fabric[all]'
tf HuygensING/suriano
This will first download the data, and note that the thumbnails occupy 300 MB of space, so this may take a while. After that a browser window opens with an interface on the Suriano correspondence. You can read the text, and if you click on a camera icons (📷) you see the scan of the corresponding page.
You can also run your own programs on the corpus, through the Text-Fabric API. Here is a tutorial to get started.
The following people all played a role in the construction of this dataset.
Research at Huygens
- Nina Lamal Researcher, deeply familiar with all ins and outs of the corpus;
- Helmer Helmers researcher and leader of the Suriano project.
Funder
- Menno Witteveen Entrepreneur and historian.
Transcribers
- Alexa Bianchini
- Ruben Celani
- Michele Correra
- Flavia di Giampaolo
- Federica D’Uonno
- Vera Frantellizzi
- Cristina Lezzi
- Giorgia Priotti
- Angelo Restaino
- Filippo Sedda
Development by TeamText
- Dirk Roorda Developer, wrote the conversion code and new functions for Text-Fabric for Named Entity Recognition;
- Sebastiaan van Daalen Front-end developer of TextAnnoViz (the main framework of the website) and having exquisite knowledge of historical manuscript editions;
- Bram Buitendijk Back-end developer on the annotation infrastructure (un-t-ann-gle and AnnoRepo);
- Hayco de Jong Back-end developer on the search infrastructure (TextRepo, Broccoli and Brinta);
- Bas Leenknegt Allround developer on back-end and front-end matters (TAV, TextRepo);
- Hennie Brugman team leader, connecting the dots with time and budget and people.
Here we describe how we have constructed the Suriano dataset (and how you can replicate it).
We proceed as follows:
- there are incoming page scans, they are renamed and checked for completeness;
- there are incoming transcriptions in Word, they are converted to TEI, and in the process the page sequences in the transcriptions are compared with the pages in the scans;
- the TEI is converted to Text-Fabric;
- using Text-Fabric and a spreadsheet of named entity triggers, we mark thousands of named entities in the text;
- by means of Text-Fabric we generate a WATM export: a set of text fragments and annotations;
- by means of Text-Fabric we generate IIIF manifests for the page scans;
- the WATM output is exported to the TeamText virtual machine;
- the manifests and other static files are exported to a persistent volume on our k8s network.
This is where the control of this repo stops. The infrastructure of TeamText takes over from here:
- the WATM is imported in TextRepo and AnnoRepo: essentially it is a stream of tokens and a set of web annotations;
- additional configuration to steer the final display and the search indexes is added to Broccoli and Brinta;
- finally, TextAnnoViz displays the letters on the website, fed by the contents of AnnoRepo, TextRepo, Brinta and Broccoli.
There is a large degree of isomorphism between the Text-Fabric data and the final website data.
Technical information on the actual deployment is in our internal repository (not open, behind a firewall).
The source data and the TEI that we derived from it, and more, is available on SurfDrive (public readonly link). This does not include the original high resolution scans, since they are not available as a downloadable package. These scans are also at SurfDrive, but not accessible via a public link. If you are interested in these scans, contact Nina Lamal.
Note that (very) low resolution versions of these scans are provided in this repo: thumb.
Many aspects of the curation process have been carried out by programs, in a rule-based way. These processes have produced a number of report files.
There are extensive README files in the report directory.
- README_DATASOURCE: description of the contents of the datasource directory on SurfDrive;
- README_SCANS: description of the contents of the scan directory on SurfDrive;
- README: description of the report files.
See the README.md in the programs directory.