Skip to content

Package for demonstrating a novel way to approach table extraction @ Doctrine

License

Notifications You must be signed in to change notification settings

DoctrineLegal/poc-table-pdf-extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

POC Table pdf extraction

image image

Colibrie is a blazing fast tool to extract tables from PDFs

Why Colibrie?

  • Efficient: Colibrie is faster by multiple order of magnitude than any actual existing solution
  • Fidel visual: Colibrie can provide 1:1 HTML representation of any tables it'll find
  • Reliable: Colibri will find every valid tables without exception if the PDF is compatible with the core principle of Colibrie
  • Output: Each table can be export into multiple formats, which include :
    • Pandas Dataframe.
    • HTML.

Benchmark :

Some number to compare Camelot (a popular library to extract tables from PDF) and Colibrie

Tables extracted
Times in second camelot colibrie
camelot colibrie valid false positive valid false positive pages count pdf file
0.53 0.00545 1 0 1 0 1 small pdf
5.95 0.02100 4 0 4 0 11 medium pdf
105.00 0.21900 62 1 61 0 167 big pdf
182.00 0.69000 175 1 177 0 269 giant pdf

Current limitation

  • Colibrie only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
  • For the moment Colibrie doesn't work on PDF with tables that has no structural lines (like this one or this one ) but it can handle a few missing lines (like this one or this one)

Installation

using source

pip install poetry

git clone https://github.com/DoctrineLegal/poc-table-pdf-extract

cd colibrie

poetry install

using pip

pip install colibrie

Usage

PDF used in example : example.pdf

from colibrie.extract_tables import extract_table

tables = extract_table('example.pdf')

for table in tables:
   print(table.to_html())
   df = table.to_df()

Output :

Classifi cation des associations agréées de surveillance
de la qualité de l’air
Classifi cation des bureaux d’études techniques,
des cabinets d’ingénieurs-conseils
et des sociétés de conseils
Catégorie
Échelon
Coeffi cient
Salaire
minimal
hiérarchique
Position
Coeffi cient
Salaire
minimal
hiérarchique
7
1
2
3
4
5
6
7
8
9
10
11
12
255
268
282
296
311
327
344
362
381
401
422
444
1 307,13 €
1 373,77 €
1 445,53 €
1 517,30 €
1 594,19 €
1 676,20 €
1 763,34 €
1 855,61 €
1 953,01 €
2 055,53 €
2 163,17 €
2 275,94 €
ETAM
1.1.
230
1 558,80 €
1.2.
240
1 587,50 €
1.3.
250
1 618,50 €
6
1
2
3
4
5
6
7
8
9
10
11
12
310
326
344
363
384
406
430
457
485
515
549
585
1 589,06 €
1 671,08 €
1 763,34 €
1 860,74 €
1 968,38 €
2 081,16 €
2 204,18 €
2 342,58 €
2 486,11 €
2 639,89 €
2 814,17 €
2 998,71 €
2.1.
275
1 683,75 €
2.2.
310
1 786,70 €
2.3.
355
1 922,60 €

About

Package for demonstrating a novel way to approach table extraction @ Doctrine

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages