... because I can't get my hands on the detailed results of more recent censuses.
Important Note
If you use this data in a publication, Statistics Indonesia (BPS) requires you to cite, or otherwise give acknowledgement, that your data is sourced from BPS.
If you cite me -- or mention that I convert from PDF to CSV -- I'll be glad, though I don't know how exactly.
-
The PDF file is the original source material (book) from which I extract the data. The PDF file contains tabulated population count data, which I extracted using a tool called Camelot.
-
The Python files are the scripts I used to extract and tidy the data.
-
The CSV files are the outputs from the Python files. They contain the population data in CSV format, which can be loaded and read using Excel.
ID-population-kec-by-book.csv
has all the rows for district, region, and province aggregates are mixed together. This is exactly as found in the book. You can use this if you want to find something and you need it to be exactly as found in the book.ID-population-kec-tidy.csv
is the tidier format. The format is one row for one district. I would recommend you to use this.warnings-row-with-newline.csv
is there just for debugging purposes and does not contain any meaningful population data.
Environment used to perform this work:
- Windows 7
- Python 3.8.5
pandas
1.3.4camelot-py
0.10.1
The following is the steps that I do to obtain the data:
-
Make sure the dependencies are installed.
-
Have the PDF file and the Python scripts in one folder.
-
Read the data from the PDF by invoking (in the folder):
python reading_data.py
This step will create
ID-population-kec-by-book.csv
andwarnings-row-with-newline.csv
. -
Tidy the data into more convenient format by invoking:
python transforming_data_tidy.py
This step will create
ID-population-by-kec-tidy.csv
.
This work is available thanks to:
- Statistics Indonesia (Badan Pusat Statistik), the Indonesian official statistics bureau that carried out the census and published the data.
- Camelot, the Python library used to pull the data from PDF format.
Also:
- Original link from where I downloaded the PDF file. I cannot make absolute guarantee that this file is original, but I think it's fine.