Introduction

This program parses, reads and extracts Key Performance Indicators (KPI) from a PDF document. The software design can be best described as a Pipeline architecture as each step in the process is executed sequentially and depends on the previous step. This is reflected in the naming of the directories and files going from A_Configuration_and_logs all the way through G_main, in which a Jupyter Notebook can execute the pipeline steps by calling the method analyze_pdfs() (located in Rainer.F_Extract.Extract). Any individual step, in addition to this README.md, might be described by a separate README.md file in each of the directories.

Information and Copyright

This program was designed by Rainer Gogel ([email protected]), Frankfurt in February 2022 as part of the module "Projekt Intelligente System" at the University of Applied Sciences in Frankfurt, Germany. All rights reserved. No guarantees.

General Information:

This program heavily relies on the coordinates of text objects in a PDF document. PDFMiner (the original program, not my PDFMiner.py file) stores coordinates of Layout objects such as LTTextContainer, LTTextLine or LTChar. The XY origin is at the bottom left corner where x = 0 and y = 0. The Layout objects are hierarchical and can be looped through. For instance, a LTTextContainer has LTTextLine objects which themselves have LTChar objects. All of these objects have their own xy-coordinates.

My program has a threefold approach: Find neighbour values, find table values and extract values from matching sentences.

I. Find neighbours and extract values

The XY-coordinates allow to search for positional neighbour layout objects. These neighbours might contain the searched
keyword value if, for instance, the keyword and the keyword value appear in a graphic next to each other.

II. Find table values

PDF docs often contain tables which might or might not be recognizable by visual detection programs due to their
lack of separation lines typically found in tables. My approach here is to first get the y-coordinates of the 
keyword and the x-coordinates of the reporting year assuming this is the table column header and then search for 
values with these xy-coordinates.

III. Extract values from sentences that match certain search criteria

Matching sentences are sentences that contain the keyword and any other term from "search_word_list" in the 
"config.ini" file. Then values that match certain criteria will be searched for in these matching sentences.

The program then aggregates the results of the three approaches and weighs them according to the parameters set under [F_Extract] in the "config.ini" file. Please see the README.md under "A_Configuration_and_Logs".

How to run the program

Step 1:

IMPORTANT: First, set the ABSOLUTE Path of the "config.ini" file as "config_ini_path" parameter:

directory "A_Configuration_and_Logs"  -> "conf_and_log.py" file -> "config_ini_path" parameter

Second, then also set the base_path variable as the absolute path to folder "Rainer" in the "config.ini" file:

directory "A_Configuration_and_Logs"  -> "config.ini" file -> "base_path" parameter

Then set the settings parameter. Please read the README.md in "A_Configuration_and_Logs".

Step 2 (Optional):

As some PDF docs do not contain positional metadata (i.e. coordinates), they first need to be transformed.
Please see the README.md in "C_File_Conversion". If this transformation shall be done, please run the "transform_pdf()"-
function in "file_conversion.py". The source directory is "Annual_Reports", the target directory is 
"Annual_Reports_Converted", but this can be changed in the "config.ini" file under [C_File_Conversion].

Step 3:

Put all the PDF docs that you want to analyze into: "B_Reports.Reports_For_Analysis"

Step 4:

Run the program in G_MAIN. There is a Jupyter notebook in this directory which calls the "analyze_pdfs()" method from
"F_Extract.Extract.py". The result can be displayed in a pandas DataFrame object whose method "to_csv" or "to_excel"
will save the result in the directory and with the name specified as parameter in these methods. Of course, this call
cann also be done from a Python file and called from the command line.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
A_Configuration_and_Logs		A_Configuration_and_Logs
B_Reports/Sample_Reports		B_Reports/Sample_Reports
C_File_Conversion		C_File_Conversion
D_Search		D_Search
E_Collect		E_Collect
F_Extract		F_Extract
G_MAIN		G_MAIN
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Information and Copyright

General Information:

I. Find neighbours and extract values

II. Find table values

III. Extract values from sentences that match certain search criteria

How to run the program

Step 1:

Step 2 (Optional):

Step 3:

Step 4:

About

Releases

Packages

Languages

rainergo/UASFRA-MS-ProjektIntellSys

Folders and files

Latest commit

History

Repository files navigation

Introduction

Information and Copyright

General Information:

I. Find neighbours and extract values

II. Find table values

III. Extract values from sentences that match certain search criteria

How to run the program

Step 1:

Step 2 (Optional):

Step 3:

Step 4:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages