Skip to content

Latest commit

 

History

History
230 lines (146 loc) · 10.5 KB

File metadata and controls

230 lines (146 loc) · 10.5 KB

Account Invoice Import Simple PDF

Beta License: AGPL-3 OCA/edi Translate me on Weblate Try me on Runbot

This module is an extension of the module account_invoice_import: it adds support for simple PDF invoices i.e. PDF invoice that don't have an embedded XML file. This module has been developped to solve the drawbacks of the OCA module account_invoice_import_invoice2data ; its advantages are the following:

  • Possibility to add support for a new vendor without developper skills: the accountant can do it!
  • Adding support for a new vendor is faster.
  • More tolerance on vendor invoice layout changes.
  • Easier to install.

With this module, you can import all the invoices that you were able to import with the module account_invoice_import_invoice2data. In fact, this module uses the same design when importing a PDF vendor bill:

  1. raw text extraction of the PDF file,
  2. identify the partner using the VAT number (if the VAT number is present in the raw text extraction) or some keywords,
  3. use regular expressions (regex) to extract the data needed to create the vendor bill in Odoo (single line configuration).

The main difference with the OCA module account_invoice_import_invoice2data is that the regular expressions are auto-generated from the configuration made by the user in Odoo. No need to be a regex expert! But you can still write regex to extract some fields for some very specific needs.

The module can extract the following fields:

  • Total Amount with taxes
  • Total Untaxed Amount
  • Total Tax Amount
  • Invoice Date
  • Due Date
  • Start Date
  • End Date
  • Invoice Number
  • Description (for that field, you have to write a regex)

In this list, only 3 fields are required:

  • Invoice Date
  • 2 out of the 3 Amount fields (the 3rd can be deducted from the 2 others: Total Amount = Total Untaxed + Total Tax)

To take advantage of the fields Start Date and End Date, you need the OCA module account_invoice_start_end_dates from the account-closing project.

To know the full story behind the development of this module, read Akretion's blog post.

Table of contents

The most important technical component of this module is the tool that converts the PDF to text. Converting PDF to text is not an easy job. As outlined in this blog post, different tools can give quite different results. The best results are usually achieved with tools based on a PDF viewer, which exclude pure-python tools. But pure-python tools are easier to install than tools based on a PDF viewer. It is important to understand that, if you change the PDF to text tool, you will certainly have a slightly different text output, which may oblige you to update the field extraction rule, which can be time-consuming if you have already configured many vendors.

The module supports 4 different extraction methods:

  1. PyMuPDF which is a Python binding for MuPDF, a lightweight PDF toolkit/viewer/renderer published under the AGPL licence by the company Artifex Software.
  2. pdftotext python library, which is a python binding for the pdftotext tool.
  3. pdftotext command line tool, which is based on poppler, a PDF rendering library used by xpdf and Evince (the PDF reader of Gnome).
  4. pdfplumber, which is a python library built on top the of the python library pdfminer.six. pdfplumber is a pure-python solution, so it's very easy to install on all OSes.

PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber often gives lower-quality text output, but its advantage is that it's a pure-Python solution, so you will always be able to install it whatever your technical environnement is.

You can choose one extraction method and only install the tools/libs for that method.

To install PyMuPDF, if you use Debian (Bullseye aka v11 or higher) or Ubuntu (20.04 or higher), run the following command:

sudo apt install python3-fitz

You can also install it via pip:

sudo pip3 install --upgrade PyMuPDF

but beware that PyMuPDF is just a binding on MuPDF, so it will require MuPDF and all the development libs required to compile the binding. That's why PyMuPDF is much easier to install via the packages of your Linux distribution (package name python3-fitz on Debian/Ubuntu, but the package name may be different in other distributions) than with pip.

To install pdftotext python lib, run:

sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

and then install the lib via pip:

sudo pip3 install --upgrade pdftotext

On OSes other than Debian/Ubuntu, follow the instructions on the project page.

To install pdftotext command line, run:

sudo apt install poppler-utils

To install the pdfplumber python lib, run:

sudo pip3 install --upgrade pdfplumber

This module also requires the following Python libraries:

  • regex which is backward-compatible with the re module of the Python standard library, but has additional functionalities.
  • dateparser which is a powerful date parsing library.

The dateparser lib depends itself on regex. So you can install these Python libraries via pip with the following command:

sudo pip3 install --upgrade dateparser

The dateparser lib is not compatible with all regex lib versions. As of September 2022, the version requirement declared by dateparser for regex is !=2019.02.19, !=2021.8.27, <2022.3.15. So the latest version of regex which is compatible with dateparser is 2022.3.2. To know the version of regex installed in your environment, run:

sudo pip3 show regex

To force regex to version 2022.3.2, run:

sudo pip3 install regex==2022.3.2

By default, for the PDF to text conversion, the module tries the different methods in the order mentionned in the INSTALL section: it will first try to use PyMuPDF; if it fails (for example because the lib is not properly installed), then it will try to use the pdftotext python lib, if that one also fails, it will try to use pdftotext command line and, if it also fails, it will eventually try pdfplumber. If none of the 4 methods work, Odoo will display an error message.

If you want to force Odoo to use a specific text extraction method, go to the menu Configuration > Technical > Parameters > System Parameters and create a new System Parameter:

  • Key: invoice_import_simple_pdf.pdf2txt
  • Value: select the proper value for the method you want to use:
    1. pymupdf
    2. pdftotext.lib
    3. pdftotext.cmd
    4. pdfplumber

In this configuration, Odoo will only use the selected text extraction method and, if it fails, it will display an error message.

You will find a full demonstration about how to configure each Vendor and import the PDF invoices in this screencast.

Bugs are tracked on GitHub Issues. In case of trouble, please check there if your issue has already been reported. If you spotted it first, help us smashing it by providing a detailed and welcomed feedback.

Do not contact contributors directly about support or help with technical issues.

  • Akretion

This module is maintained by the OCA.

Odoo Community Association

OCA, or the Odoo Community Association, is a nonprofit organization whose mission is to support the collaborative development of Odoo features and promote its widespread use.

Current maintainer:

alexis-via

This module is part of the OCA/edi project on GitHub.

You are welcome to contribute. To learn how please visit https://odoo-community.org/page/Contribute.