In this repository we will provide all our code and data. The result is available here: www.pharmagelder.ch
If you are looking for the data, see next chapter.
We provide the data in many different formats.
- searchable database: www.pharmagelder.ch
- Different statistics on companies and addresses (tops, etc.)
- CSV-Sheets for each canton and newest (2019) data
- All matched addresses for 2019
- The raw export 2019 without matching: The addresses and the accumulated data
- database dump of the current version (2020.08.31): pharmagelder.sql
- data/README.md for scripting / exporting
- data/_docs/PDF_SCREENSHOTS.md to find the best template
Checkout Git Checkout this Github Repo at a folder of your choice
git init
git remote add origin [email protected]:blickvisual/pharmagelder.git
Install Python Dependencies
#If you use virtual env (recommended)
source env/bin/activate
#Install dependencies
pip install -r requirements.txt
Install Abbeyy Finereader For OCR-Recognition, we used Abbeyy Finereader. It's not for free, but worth it. https://www.abbyy.com/de-de/finereader/
Install libPostal
Have a look at the GitHub Repository. On Mac, do the following:
brew install curl autoconf automake libtool pkg-config
git clone https://github.com/openvenues/libpostal
cd libpostal
./bootstrap.sh
./configure --datadir=[...some dir with a few GB of space...]
make -j4
sudo make install
# On Linux it's probably a good idea to run
sudo ldconfig
If pip install postal
throws an error, install pip install nose
first.
In the Folder data/1. pdfexport/files you will find a subfolder for each pharmaceutical company. Put the PDFs (if available) there. Each folder needs two files:
-
- Lists.ipynb
-
- Accumulations.ipynb
0. Lists.ipynb
extracts all doctors and organisations. 1. Accumulations.ipynb
extracts the accumulations - all the anonym doctors and organisations. Recommendation: If 1. Accumulations.ipynb
doesnt work anymore: Do it by hand.
The exported files will be stored in the folder data/1. pdfexport/export/. Do never change something directly in this csv-files. Use the python-notebooks or do it later on google spreadsheets.
As mentioned before, each pharmaceutical company gets two python notebooks. Update them (or copy them) to export another company. There are plenty of helper functions for extracting the data. Everything documented in data/README.md. The code ist stored in data/lib
Read the Code Documentation data/README.md!
There are different kind of exports:
- Readble PDFs -> Use tabulaPy
- Scanned PDFs -> Use Abbeyy to generate CSV and then Pandas
- Excel -> Use Pandas
- Website -> Copy the source code, then use Beautiful Soup
You will find many examples in the data/1. pdfexport/files folder.
After everything is extraced, create a google spreadsheet for each file. Be aware: During import, set number recognition to False. We made for lists
and accummulations
a folder each.
Then you need a table where you store the IDs of each file. Then you are able to sync google spreadsheet with Git.
See the detailed documentation in tasks/2. spreadsheet_to_github/0. Download Spreadsheets.ipynb There are two files in the folder tasks/2. spreadsheet_to_github/
- 0. Download Spreadsheets.ipynb use this to trigger the sync manually per jupyter notebook
- 0. Download Spreadsheets.py use this if you like to automate the sync. You need to edit it!
Now everyone can edit the files, make fixes while your git repository is always up to date. The files will be stored in the folder data/2. manual check
We take our "old" data from the live database, not old csv files. Take a dump of your live database and import it to your own mysql database.
Now export the data: 3. export livedata/0. export database.ipynb
. This file exports the data to a csv.
Be aware: To not use this data for calculations! The export drops duplicates and merges different tables. The generated files are only for matching purpose later!
Now you need to concatenate all files, clean them and prepare them for fuzzy matching. Therefore run each notebook in the folder tasks/4. transform files
.
Have a look at 4. check dataset.ipynb
. If there are addresses witouth a source (eq pharma company), add them to the csv tasks/6. database export/sources/liste_companies.csv
and `pharma_source.csv``
Now we need to find pairs. We do this in four steps:
- Compare the data to the exported live data. It uses fuzzy matching and looks for the most similar address.
0_match_to_livedb.ipynb
Run it twice and changerun_for = 'hcp'
! - There are some tricky address. We force them to pairs. There are two google spreadsheets, for hcp and hco each. There we put the tricky candidats. The following scripts will use this spread sheets and creates the matching
1_overright_parent_by_force_matching.ipynb
You need to update the spread sheet url! - Now we compare all addresses without a sibling to each others and create a network
2_match_missing.ipynb
Run it twice and changerun_for = 'hcp'
! - Lets group everything. After that, our data is ready.
3_group_matches.ipynb
Now we create a new database dump with only the new data. You can run this on your live database and the update queries will do the rest.
6. database export/0. create database import.ipynb
data/4. database/data_dump.sql
is your file. If the import fails, you can import file by file and find the problem.
To check your data, there are some notebooks. This notebooks will create excel sheets.
To do this, your local mysql database needs to be up to date, so if you have not imported data_dump.sql
yet, do it know!
tasks/99. Analyzes/Summarize Pharmas by disclosure.ipynb
will create a table with information about each company and its disclosuretasks/99. Analyzes/Summarize Pharmas by Recipient Type.ipynb
does the same, but categorizes by hcp and hco.Export Recipients 2018 Clean.ipynb
exports all recipients from year 2018Export Recipients 2018.ipynb
does the same, will give you each address and group, the address is in it. Be patient...Export all Recipients.ipynb
exports everything
Ask author: https://github.com/simonhuwiler/