Colibrie is a blazing fast tool to extract tables from PDFs
- Efficient: Colibrie is faster by multiple order of magnitude than any actual existing solution
- Fidel visual: Colibrie can provide 1:1 HTML representation of any tables it'll find
- Reliable: Colibri will find every valid tables without exception if the PDF is compatible with the core principle of Colibrie
- Output: Each table can be export into multiple formats, which include :
- Pandas Dataframe.
- HTML.
Some number to compare Camelot (a popular library to extract tables from PDF) and Colibrie
Tables extracted | |||||||
---|---|---|---|---|---|---|---|
Times in second | camelot | colibrie | |||||
camelot | colibrie | valid | false positive | valid | false positive | pages count | pdf file |
0.53 | 0.00545 | 1 | 0 | 1 | 0 | 1 | small pdf |
5.95 | 0.02100 | 4 | 0 | 4 | 0 | 11 | medium pdf |
105.00 | 0.21900 | 62 | 1 | 61 | 0 | 167 | big pdf |
182.00 | 0.69000 | 175 | 1 | 177 | 0 | 269 | giant pdf |
- Colibrie only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
- For the moment Colibrie doesn't work on PDF with tables that has no structural lines (like this one or this one ) but it can handle a few missing lines (like this one or this one)
pip install poetry
git clone https://github.com/DoctrineLegal/poc-table-pdf-extract
cd colibrie
poetry install
pip install colibrie
PDF used in example : example.pdf
from colibrie.extract_tables import extract_table
tables = extract_table('example.pdf')
for table in tables:
print(table.to_html())
df = table.to_df()
Classifi cation des associations agréées de surveillance de la qualité de l’air | Classifi cation des bureaux d’études techniques, des cabinets d’ingénieurs-conseils et des sociétés de conseils | ||||||
Catégorie | Échelon | Coeffi cient | Salaire minimal hiérarchique | Position | Coeffi cient | Salaire minimal hiérarchique | |
7 | 1 2 3 4 5 6 7 8 9 10 11 12 | 255 268 282 296 311 327 344 362 381 401 422 444 | 1 307,13 € 1 373,77 € 1 445,53 € 1 517,30 € 1 594,19 € 1 676,20 € 1 763,34 € 1 855,61 € 1 953,01 € 2 055,53 € 2 163,17 € 2 275,94 € | ETAM | 1.1. | 230 | 1 558,80 € |
1.2. | 240 | 1 587,50 € | |||||
1.3. | 250 | 1 618,50 € | |||||
6 | 1 2 3 4 5 6 7 8 9 10 11 12 | 310 326 344 363 384 406 430 457 485 515 549 585 | 1 589,06 € 1 671,08 € 1 763,34 € 1 860,74 € 1 968,38 € 2 081,16 € 2 204,18 € 2 342,58 € 2 486,11 € 2 639,89 € 2 814,17 € 2 998,71 € | 2.1. | 275 | 1 683,75 € | |
2.2. | 310 | 1 786,70 € | |||||
2.3. | 355 | 1 922,60 € |