Skip to content

Commit

Permalink
Update ingest test fixtures
Browse files Browse the repository at this point in the history
  • Loading branch information
christinestraub authored Nov 3, 2023
1 parent 1311330 commit d0b5a9d
Show file tree
Hide file tree
Showing 10 changed files with 239 additions and 279 deletions.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -59,16 +59,6 @@
},
"text": "Data on environmental sustainable corrosion inhibitor for stainless steel in aggressive environment"
},
{
"type": "Title",
"element_id": "c21a7f75a507e8d1d940e30b66575616",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 1
},
"text": "(Jee"
},
{
"type": "Title",
"element_id": "6d1999c49562bd7c2b15a41327b8fc36",
Expand Down Expand Up @@ -429,16 +419,6 @@
},
"text": "Fig. 2. Corrosion rate versus exposure time for stainless steel immersed in 0.5 M H2SO4 solution in the absence and presence of ES."
},
{
"type": "UncategorizedText",
"element_id": "57e2eb94df928d0cf17b2c0d41ae042e",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 3
},
"text": "100 4"
},
{
"type": "UncategorizedText",
"element_id": "ad57366865126e55649ecb23ae1d4888",
Expand Down Expand Up @@ -521,13 +501,13 @@
},
{
"type": "Image",
"element_id": "aa4f0eca72d0603d384878e68fe5be57",
"element_id": "506dff384be7ac4026b4227e860b3a39",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 4
},
"text": "5 1 os =10; =o ° © —\" 205 i —~é é —ip a5 — Control -2 — & 2.5 T T T 0.0000001 + —-0.00001 0.001 O14 Current Density (A/cm2)"
"text": "25 14 os ~1 2 0 i = —4 Zs 4 8 — bg 14 é — 2g 137 — Control 24 8g 25 T T T 0.0000001 0.00001 0.001 01 Current Density (A/cm2)"
},
{
"type": "FigureCaption",
Expand Down Expand Up @@ -599,16 +579,6 @@
},
"text": "C/0"
},
{
"type": "UncategorizedText",
"element_id": "e2b6d7e2ab125149fa820500cedfffbb",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 4
},
"text": "—=—Cc/0"
},
{
"type": "FigureCaption",
"element_id": "a24d672821f7acb0bbe2c8a813debe16",
Expand All @@ -631,13 +601,13 @@
},
{
"type": "Image",
"element_id": "273fb301b173075f79b2cbdab962e2ff",
"element_id": "59b3c54b48b40dffd68bd4d3e1859e95",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 5
},
"text": "SEM HV: Q0KY WD: 14.89 rmrm ‘9EM MAO: 209 x Det: DOE Pectomsence In nanospact"
"text": "SEM HV: 20.0KV 9D: 14.90 men ‘DEM MAO: 209 x Det: DOE 260 om Pectormence In nanos pac:"
},
{
"type": "FigureCaption",
Expand All @@ -661,13 +631,13 @@
},
{
"type": "Image",
"element_id": "520d1da08c86ce165cd2843e2dc27f98",
"element_id": "032df41d39ff55ef057e900ef83bad04",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 5
},
"text": "SEMHV: 20.0KV WD: 15.54 mm EM ING: ACO x Dei: OSE"
"text": "SEM HY: 70.0KV SEM IAAG: 400 x"
},
{
"type": "FigureCaption",
Expand Down Expand Up @@ -739,6 +709,16 @@
},
"text": "Austenitic stainless steel Type 316 was used in this study with chemical composition reported in [1,2]. The chemicals used were of annular grade. The inhibitor concentrations are in the range of 2, 4, 6, 8 and 10 g [3–5]. The structural formula of egg shell powder is shown in Fig. 9."
},
{
"type": "Image",
"element_id": "ee7729e0ad3c974c68a2b6bc1f09378a",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 6
},
"text": "Tie} = O iy oH H3;COCHN™ OH"
},
{
"type": "FigureCaption",
"element_id": "389bd6e22f3ac105897fa0a75807197d",
Expand Down Expand Up @@ -929,6 +909,16 @@
},
"text": "steps of the linear polarization plot are substituted to get corrosion current. Nova software was used with linear polarization resistance (LPR) and the current was set to 10 mA (maximum) and 10 nA (minimum). LSV staircase parameter start potential (cid:3) 1.5 v, step potential 0.001 m/s and stop potential of þ1.5 v set was used in this study."
},
{
"type": "Title",
"element_id": "c9015d53b90846454375a2fdf2829c66",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 7
},
"text": "Acknowledgements"
},
{
"type": "Title",
"element_id": "ee7d6fc036b5c1d6c5f5ebb9bf533f01",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,16 +59,6 @@
},
"text": "A benchmark dataset for the multiple depot vehicle scheduling problem"
},
{
"type": "Title",
"element_id": "77b037daa0a8a3f7349bd57dda36499f",
"metadata": {
"data_source": {},
"filetype": "application/pdf",
"page_number": 1
},
"text": "(eee"
},
{
"type": "Title",
"element_id": "9d8efece3117b2eec928f8ee4d4888e4",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -511,7 +511,7 @@
},
{
"type": "Image",
"element_id": "2326155e29fbcc80862533eba5d9c75c",
"element_id": "0ef9d50781a8637826772ff44e47f462",
"metadata": {
"data_source": {
"url": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper.pdf",
Expand All @@ -524,7 +524,7 @@
"filetype": "application/pdf",
"page_number": 4
},
"text": "Efficient Data Annotation Model Customization Document Images Community Platform ‘a >) ¥ DIA Model Hub i .) Customized Model Training] == | Layout Detection Models | ——= DIA Pipeline Sharing ~ OCR Module = { Layout Data stuctue ) = (storage Visualization VY"
"text": "Model Customization Document Images Community Platform Efficient Data Annotation ¥ DIA Model Hub Customized Model Training) == | Layout Detection Models | <== DIA Pipeline Sharing OCR Module S— | Layout Data Structure | === | Storage & Visualization A u r Libran"
},
{
"type": "NarrativeText",
Expand Down Expand Up @@ -659,7 +659,7 @@
},
"filetype": "application/pdf",
"page_number": 5,
"text_as_html": "<table><thead><th>Dataset</th><th>| Base Model'|</th><th>Notes</th></thead><tr><td>PubLayNet B8]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>nned modern magazines and scientific reports</td></tr><tr><td>Newspapei</td><td>F</td><td>canned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>"
"text_as_html": "<table><thead><th>Dataset</th><th>| Base Mode!'|</th><th>| Notes</th></thead><tr><td>PubLayNet 38]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRInA BJ</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset BT]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table>"
},
"text": "Base Model1 Large Model Notes Dataset PubLayNet [38] PRImA [3] Newspaper [17] TableBank [18] HJDataset [31] F / M M F F F / M M - - F - Layouts of modern scientific documents Layouts of scanned modern magazines and scientific reports Layouts of scanned US newspapers from the 20th century Table region on modern scientific and business document Layouts of history Japanese documents"
},
Expand Down Expand Up @@ -801,7 +801,7 @@
},
{
"type": "Image",
"element_id": "3b30176246b01e00c3051a7e2a11669c",
"element_id": "553c63e448f250b7466cdca0d5058f24",
"metadata": {
"data_source": {
"url": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper.pdf",
Expand All @@ -814,7 +814,7 @@
"filetype": "application/pdf",
"page_number": 6
},
"text": "- ° . 3 a a 4 a 3 oo er ‘ 2 § 8 a 8 3 3 ‘ £ 4 A g a 9 ‘ 3 ¥ Coordinate g 4 5 3 + § 3 H Extra Features [O=\") [Bo] eaing i Text | | Type | | ower ° & a ¢ o [ coordinatel textblock1, 3 3 ’ g Q 3 , textblock2 , layoutl ] 4 q ® A list of the layout elements Ff"
"text": "0 3 B Rectangle Qvodilateral S 4 9 3 8 s > o oy ° vy 3 Coordinate § a g 3 + i} a HY Block] [Block] [Reading] 4 $ Extra features | {ON | |e ote £ o w a c o [ coordinatel textblock1 | 3 ve , ut 3 i lock2 1. 1 Ey oy textbloc! , ayoutl ] g q @ A list of the layout elements 4"
},
{
"type": "FigureCaption",
Expand Down Expand Up @@ -1085,7 +1085,7 @@
},
"filetype": "application/pdf",
"page_number": 8,
"text_as_html": "<table><thead><th>block.pad(top, bottom,</th><th>right,</th><th>left)</th><th>Enlarge the current block according to the input</th></thead><tr><td>block.scale(fx, fy)</td><td></td><td></td><td>Scale the current block given the ratio ion in x and y di</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>; block1. intersect (block2)</td><td></td><td></td><td>Return the intersection region of block and block2. . . . Coordinate type to be determined based on the inputs.</td></tr><tr><td>; block1.union(block2)</td><td></td><td></td><td>Return the union region of block1 and block2. . . . Coordinate type to be determined based on the inputs.</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td></td><td>Convert the absolute coordinates of block to ' ' relative coordinates to block2</td></tr><tr><td>. block1.condition_on(block2)</td><td></td><td></td><td>Calculate the absolute coordinates of block1 given . the canvas block2’s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td></td><td>Obtain the image segments in the block region</td></tr></table>"
"text_as_html": "<table><thead><th>block.pad(top, bottom,</th><th>right,</th><th>left) |</th><th>Enlarge the current block according to the input</th></thead><tr><td>block.scale(fx, fy)</td><td></td><td></td><td>Scale the current block given the ratio . ; ; in x and y direction</td></tr><tr><td>. block.shift(dx, dy)</td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td></td><td>Return the intersection region of block1 and block2. . . . Coordinate type to be determined based on the</td></tr><tr><td>. block1.union(block2)</td><td></td><td></td><td>Return the union region of block1 and block2. . . . Coordinate type to be determined based on the</td></tr><tr><td>. block1.relative_to(block2)</td><td></td><td></td><td>Convert the absolute coordinates of block1 to . . relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2)</td><td></td><td></td><td>Calculate the absolute coordinates of block1 given 7 . the canvas block2’s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td></td><td>Obtain the image segments in the block region</td></tr></table>"
},
"text": "Operation Name Description block.pad(top, bottom, right, left) Enlarge the current block according to the input Scale the current block given the ratio in x and y direction block.scale(fx, fy) Move the current block with the shift distances in x and y direction block.shift(dx, dy) Whether block1 is inside of block2 block1.is in(block2) Return the intersection region of block1 and block2. Coordinate type to be determined based on the inputs. block1.intersect(block2) Return the union region of block1 and block2. Coordinate type to be determined based on the inputs. block1.union(block2) Convert the absolute coordinates of block1 to relative coordinates to block2 block1.relative to(block2) Calculate the absolute coordinates of block1 given the canvas block2’s absolute coordinates block1.condition on(block2) Obtain the image segments in the block region block.crop image(image)"
},
Expand Down Expand Up @@ -1193,7 +1193,7 @@
},
{
"type": "Image",
"element_id": "294441b6458d8a005ea7588ecb6efc10",
"element_id": "f47a514361f98885c44032539defd182",
"metadata": {
"data_source": {
"url": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper.pdf",
Expand All @@ -1206,7 +1206,7 @@
"filetype": "application/pdf",
"page_number": 9
},
"text": "x09 Burpunog uayor Aeydsiq 1 vondo 10g Guypunog usyoy apir:z uondo Mode I: Showing Layout on the Original Image Mode Il: Drawing OCR'd Text at the Correspoding Position"
"text": "10g Bupunog vayoy fejdsiq :, uondo 10g 6upunog uayo4 apit 7 vondo Mode I: Showing Layout on the Original Image Mode II: Drawing OCR'd Text at the Correspoding Position"
},
{
"type": "NarrativeText",
Expand Down Expand Up @@ -1295,7 +1295,7 @@
},
{
"type": "Image",
"element_id": "6e6e9ba62b25fdfb8734842354a7ce64",
"element_id": "3b10103e6e1a9915917ddaacc2b32a87",
"metadata": {
"data_source": {
"url": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper.pdf",
Expand All @@ -1308,7 +1308,7 @@
"filetype": "application/pdf",
"page_number": 10
},
"text": "Intra-column reading order Token Categories tie (Adress 2) tee (NE sumber Variable HEE company type Column Categories (J tite we) adaress —_ (7) section Header by ‘e * Column reading order a a (a) Illustration of the original Japanese Maximum Allowed Height BRE B>e EER eR (b) Illustration of the recreated document with dense text structure for better OCR performance"
"text": "Intra-column reading order i s & s e Number Column reading order Variable Company Type | L st id fa | f fs - i. il (a) Illustration of the original Japanese document with detected layout elements highlighted in colored boxes = Column Categories z (J tite = Ei r f| Sah i mai — = 3 YW a2 mx ia 2 Ae ion Hea g i 2 ae section Header & H 4 fe § i Ls ie & 3 (b) Illustration of the recreated document with dense text structure for better OCR performance"
},
{
"type": "NarrativeText",
Expand Down Expand Up @@ -1482,7 +1482,7 @@
},
{
"type": "Image",
"element_id": "e55a2d8ec9ea5f1d6c11788f33f2b97d",
"element_id": "38ec7bb20e004337f270a2753a9c1672",
"metadata": {
"data_source": {
"url": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper.pdf",
Expand All @@ -1495,7 +1495,7 @@
"filetype": "application/pdf",
"page_number": 11
},
"text": "(spe peepee, ‘Active Learning Layout Annotate Layout Dataset | + ‘Annotation Toolkit ¥ a Deep Leaming Layout Model Training & Inference, ¥ ; Handy Data Structures & Post-processing El Apis for Layout Det a LAR ror tye eats) 4 Text Recognition | <—— Default ane Customized ¥ ee Layout Structure Visualization & Export | <—— | visualization & Storage The Japanese Document Helpful LayoutParser Digitization Pipeline Modules"
"text": "——EE : Active Learning Layout Annotate Layout Dataset | + | annotation Toolkit ¥ ¥ Layout Detection Deep Learning ‘ieee Model Training & Inferenc an Handy Data Structures 2 bis fol Layout Date 4 Text Recognition | <— [ Pe/autand Customized ¥ jee eae rs Visualization & Export | *——\"| Vicyaiization & Storage The Japanese Document Helpful LayoutParser Digitization Pipeline Modules"
},
{
"type": "NarrativeText",
Expand Down Expand Up @@ -1703,7 +1703,7 @@
},
{
"type": "Image",
"element_id": "7f494d0f1a8170f2ed0da01c039fcbd2",
"element_id": "022b00b724d86f4bf89acd047d4ed816",
"metadata": {
"data_source": {
"url": "/home/runner/work/unstructured/unstructured/test_unstructured_ingest/example-docs/layout-parser-paper.pdf",
Expand All @@ -1716,7 +1716,7 @@
"filetype": "application/pdf",
"page_number": 13
},
"text": "(@) Partial table at the bottom (&) Full page table (6) Partial table at the top (d) Mis-detected tet line"
"text": "(2) Partial table at the bottom (b) Full page table (c) Partial table at the top (@) Mis-detected text line"
},
{
"type": "FigureCaption",
Expand Down
Loading

0 comments on commit d0b5a9d

Please sign in to comment.