tabula-py
is a simple Python wrapper of tabula-java, which can read tables in a PDF.
You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.
You can see the example notebook and try it on Google Colab, or we highly recommend reading our documentation, especially the FAQ section.
- Java 8+
- Python 3.8+
I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the documentation for the detailed installation for Windows 10.
- Documentation
- FAQ would be helpful if you have an issue
- Example notebook on Google Colaboratory
Ensure you have a Java runtime and set the PATH for it.
pip install tabula-py
If you want to leverage faster execution with jpype, install with jpype
extra.
pip install tabula-py[jpype]
tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.
import tabula
# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')
# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
See an example notebook for more details. I also recommend reading the tutorial article written by @aegis4048, and another tutorial written by @tdpetrou.
Interested in helping out? I'd love to have your help!
You can help by:
- Reporting a bug.
- Adding or editing documentation.
- Contributing code via a Pull Request. See also for the contribution
- Write a blog post or spread the word about
tabula-py
to people who might be able to benefit from using it.
- @lahoffm
- @jakekara
- @lcd1232
- @kirkholloway
- @CurtLH
- @nikhilgk
- @krassowski
- @alexandreio
- @rmnevesLH
- @red-bin
- @Gallaecio
- @red-bin
- @alexandreio
- @bpben
- @Bueddl
- @cjotade
- @codeboy5
- @manohar-voggu
- @deveshSingh06
- @grfeller
- @djbrown
- @swar
- @mvoggu
- @tdpetrou
You can also support our continued work on tabula-py
with a donation on GitHub Sponsors or Patreon.