Skip to content

Commit

Permalink
Merge pull request #8 from petermr/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
petermr committed May 18, 2015
2 parents 30fd0f8 + b1c9102 commit 70459ce
Show file tree
Hide file tree
Showing 399 changed files with 82,622 additions and 34 deletions.
65 changes: 65 additions & 0 deletions docs/CREATING_CM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# CREATING CM directories

*** BY FAR THE SAFEST WAY IS TO USE QUICKSCRAPE ***

Any other method is likely to lead to fies out of sync unless you are acreful with what you are doing.
However it can be sometimes useful to create CMDirs from single files.
This documentation may not have been thoroughly checked.

## Input

For a command:

``` norma -i foo/bar/a12345.suffix -o plugh/xyzzy```

the system will create a CMDir of the form:

``` plugh/xyzzy/a12345```

It will then use ```suffix``` to create either reserved files (e.g. ```fulltext.xml```) in the CMDir or reserved subdirectories
of the form:

``` plugh/xyzzy/a12345/image```

to hold the images. As there can be several images (e.g. ```plugh/xyzzy/a12345.png``` ) we use the given names, such as:

``` plugh/xyzzy/a12345/image/a12345.png```

This is verbose and also leads to a separate CMDir for each image.

## File types

The following suffixes are supported:

### Single reserved files

The CMDir is generated from the ```-o mydir``` parameter and the input baseNames ```(FilenameUtile.getBaseName())```

```mydir/bar``` is the CMDir.

```foo/bar.xml``` is copied to ```mydir/bar/fulltext.xml```
```foo/bar.html``` is copied to ```mydir/bar/fulltext.html```
```foo/bar.pdf``` is copied to ```mydir/bar/fulltext.pdf```
```foo/bar.epub``` is copied to ```mydir/bar/fulltext.epub```
```foo/bar.txt``` is copied to ```mydir/bar/fulltext.txt```

### Image files

```foo/bar.png``` is copied to ```mydir/bar/image/bar.png```

Analogous copies for:
```gif```, ```jpg```, ```tif```

### Supplemental Data files

```foo/bar.doc``` is copied to ```mydir/bar/supplement/bar.png```

Analogous copies for:
```docx```, ```csv```, ```tex```, ```ppt```, ```pptx```, ...

### SVG Data files

```foo/bar.svg``` is copied to ```mydir/bar/svg/bar.svg```



Binary file added examples/hocr-tesseract-ijsem-140.zip
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>
</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract 3.03' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "ijs.0.000174-0-000.pbm.png"; bbox 0 0 994 516; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 273 4 951 34">
<p class='ocr_par' dir='ltr' id='par_1_1' title="bbox 273 4 951 34">
<span class='ocr_line' id='line_1_1' title="bbox 273 4 951 34; baseline 0 -7"><span class='ocrx_word' id='word_1_1' title='bbox 273 4 448 34; x_wconf 78' lang='eng' dir='ltr'><em>‘Rhodoplanes</em></span> <span class='ocrx_word' id='word_1_2' title='bbox 458 4 616 34; x_wconf 77' lang='eng' dir='ltr'><em>cryptolactis’</em></span> <span class='ocrx_word' id='word_1_3' title='bbox 633 5 701 27; x_wconf 91' lang='eng' dir='ltr'><em>DSM</em></span> <span class='ocrx_word' id='word_1_4' title='bbox 711 5 773 27; x_wconf 84' lang='eng'><em>9987</em></span> <span class='ocrx_word' id='word_1_5' title='bbox 791 5 951 33; x_wconf 88' lang='eng' dir='ltr'>(AB087718)</span>
</span>
</p>
</div>
<div class='ocr_carea' id='block_1_2' title="bbox 0 9 324 504">
<p class='ocr_par' dir='ltr' id='par_1_2' title="bbox 0 9 324 504">
<span class='ocr_line' id='line_1_2' title="bbox 165 9 272 51; baseline 0 -1"><span class='ocrx_word' id='word_1_6' title='bbox 165 9 272 51; x_wconf 95' lang='eng' dir='ltr'><em> </em></span>
</span>
<span class='ocr_line' id='line_1_3' title="bbox 138 51 272 120; baseline 0 -4"><span class='ocrx_word' id='word_1_7' title='bbox 138 51 272 120; x_wconf 95' lang='eng' dir='ltr'><em> </em></span>
</span>
<span class='ocr_line' id='line_1_4' title="bbox 63 120 251 187; baseline 0.027 -4"><span class='ocrx_word' id='word_1_8' title='bbox 63 120 251 187; x_wconf 95' lang='eng' dir='ltr'><em> </em></span>
</span>
<span class='ocr_line' id='line_1_5' title="bbox 0 187 303 321; baseline 0.02 -70"><span class='ocrx_word' id='word_1_9' title='bbox 0 187 303 321; x_wconf 95' lang='eng' dir='ltr'><em> </em></span>
</span>
<span class='ocr_line' id='line_1_6' title="bbox 0 321 324 461; baseline 0 0"><span class='ocrx_word' id='word_1_10' title='bbox 0 321 324 461; x_wconf 95' lang='eng' dir='ltr'><em> </em></span>
</span>
<span class='ocr_line' id='line_1_7' title="bbox 0 461 275 504; baseline 0 12"><span class='ocrx_word' id='word_1_11' title='bbox 0 461 275 504; x_wconf 95' lang='eng' dir='ltr'><em> </em></span>
</span>
</p>
</div>
<div class='ocr_carea' id='block_1_3' title="bbox 244 67 990 445">
<p class='ocr_par' dir='ltr' id='par_1_3' title="bbox 244 67 844 168">
<span class='ocr_line' id='line_1_8' title="bbox 288 67 844 100; baseline 0.002 -7"><span class='ocrx_word' id='word_1_12' title='bbox 288 70 455 100; x_wconf 81' lang='eng' dir='ltr'><em>Rhodoplanes</em></span> <span class='ocrx_word' id='word_1_13' title='bbox 465 78 547 93; x_wconf 77' lang='eng' dir='ltr'><em>roseus</em></span> <span class='ocrx_word' id='word_1_14' title='bbox 557 72 624 94; x_wconf 92' lang='eng' dir='ltr'><em>DSM</em></span> <span class='ocrx_word' id='word_1_15' title='bbox 635 67 710 94; x_wconf 86' lang='eng' dir='ltr'><em>5909T</em></span> <span class='ocrx_word' id='word_1_16' title='bbox 721 70 844 100; x_wconf 83' lang='eng' dir='ltr'>(D25313)</span>
</span>
<span class='ocr_line' id='line_1_9' title="bbox 244 136 758 168; baseline 0 -6"><span class='ocrx_word' id='word_1_17' title='bbox 244 139 410 168; x_wconf 80' lang='eng' dir='ltr'><em>Rhodoplanes</em></span> <span class='ocrx_word' id='word_1_18' title='bbox 420 139 517 168; x_wconf 83' lang='eng' dir='ltr'><em>elegans</em></span> <span class='ocrx_word' id='word_1_19' title='bbox 527 136 628 162; x_wconf 80' lang='eng' dir='ltr'><em>A8130T</em></span> <span class='ocrx_word' id='word_1_20' title='bbox 635 139 728 167; x_wconf 88' lang='eng' dir='ltr'>(D2531</span> <span class='ocrx_word' id='word_1_21' title='bbox 737 139 758 168; x_wconf 94' lang='eng'><em>1)</em></span>
</span>
</p>

<p class='ocr_par' dir='ltr' id='par_1_4' title="bbox 263 203 670 235">
<span class='ocr_line' id='line_1_10' title="bbox 263 203 670 235; baseline 0.002 -6"><span class='ocrx_word' id='word_1_22' title='bbox 263 207 345 230; x_wconf 82' lang='eng' dir='ltr'><strong>Strain</strong></span> <span class='ocrx_word' id='word_1_23' title='bbox 356 203 499 230; x_wconf 86' lang='eng' dir='ltr'><em>TUT3530T</em></span> <span class='ocrx_word' id='word_1_24' title='bbox 509 207 670 235; x_wconf 87' lang='eng' dir='ltr'><strong>(AB087717)</strong></span>
</span>
</p>

<p class='ocr_par' dir='ltr' id='par_1_5' title="bbox 281 273 990 445">
<span class='ocr_line' id='line_1_11' title="bbox 303 273 889 305; baseline 0.002 -7"><span class='ocrx_word' id='word_1_25' title='bbox 303 276 473 298; x_wconf 77' lang='eng' dir='ltr'><em>Blastochlaris</em></span> <span class='ocrx_word' id='word_1_26' title='bbox 484 276 563 298; x_wconf 82' lang='eng' dir='ltr'><em>viridis</em></span> <span class='ocrx_word' id='word_1_27' title='bbox 573 277 657 299; x_wconf 91' lang='eng' dir='ltr'>ATCC</span> <span class='ocrx_word' id='word_1_28' title='bbox 671 273 889 305; x_wconf 71' lang='eng' dir='ltr'>19567T(D25314)</span>
</span>
<span class='ocr_line' id='line_1_12' title="bbox 336 343 937 375; baseline 0.002 -7"><span class='ocrx_word' id='word_1_29' title='bbox 336 345 506 368; x_wconf 80' lang='eng' dir='ltr'><em>Blastochloris</em></span> <span class='ocrx_word' id='word_1_30' title='bbox 516 345 659 374; x_wconf 76' lang='eng' dir='ltr'><em>sulfoviridis</em></span> <span class='ocrx_word' id='word_1_31' title='bbox 671 346 736 369; x_wconf 88' lang='eng' dir='ltr'><em>DSM</em></span> <span class='ocrx_word' id='word_1_32' title='bbox 748 343 807 369; x_wconf 84' lang='eng' dir='ltr'><em>729T</em></span> <span class='ocrx_word' id='word_1_33' title='bbox 815 346 937 375; x_wconf 88' lang='eng' dir='ltr'>(D86514)</span>
</span>
<span class='ocr_line' id='line_1_13' title="bbox 281 412 990 445; baseline 0 -7"><span class='ocrx_word' id='word_1_34' title='bbox 281 415 538 444; x_wconf 79' lang='eng' dir='ltr'><em>Rhodopseudomonas</em></span> <span class='ocrx_word' id='word_1_35' title='bbox 545 415 659 444; x_wconf 81' lang='eng' dir='ltr'><em>palustris</em></span> <span class='ocrx_word' id='word_1_36' title='bbox 674 416 758 439; x_wconf 90' lang='eng' dir='ltr'>ATCC</span> <span class='ocrx_word' id='word_1_37' title='bbox 771 412 990 445; x_wconf 75' lang='eng' dir='ltr'>17001T(D25312)</span>
</span>
</p>
</div>
<div class='ocr_carea' id='block_1_4' title="bbox 293 481 956 513">
<p class='ocr_par' dir='ltr' id='par_1_6' title="bbox 293 481 956 513">
<span class='ocr_line' id='line_1_14' title="bbox 293 481 956 513; baseline 0 -6"><span class='ocrx_word' id='word_1_38' title='bbox 293 484 465 507; x_wconf 76' lang='eng' dir='ltr'><em>Rhodoblastus</em></span> <span class='ocrx_word' id='word_1_39' title='bbox 476 484 625 513; x_wconf 74' lang='eng' dir='ltr'><em>acidophilus</em></span> <span class='ocrx_word' id='word_1_40' title='bbox 634 485 718 508; x_wconf 88' lang='eng' dir='ltr'>ATCC</span> <span class='ocrx_word' id='word_1_41' title='bbox 728 481 820 508; x_wconf 90' lang='eng' dir='ltr'><em>25092T</em></span> <span class='ocrx_word' id='word_1_42' title='bbox 828 484 956 513; x_wconf 85' lang='eng' dir='ltr'>(M34128)</span>
</span>
</p>
</div>
</div>
</body>
</html>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 70459ce

Please sign in to comment.