Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PyPy to speed up *_postgres2es.py scripts #137

Open
anthonyfok opened this issue Oct 22, 2021 · 0 comments · May be fixed by #139
Open

Use PyPy to speed up *_postgres2es.py scripts #137

anthonyfok opened this issue Oct 22, 2021 · 0 comments · May be fixed by #139
Assignees
Labels
Enhancement New feature or request

Comments

@anthonyfok
Copy link
Member

anthonyfok commented Oct 22, 2021

While tweaking the Dockerfile for ghcr.io/opendrr/python-env, replacing some pip-installed Python libraries with Debian-prepackaged ones, and searching for python3-psyopg2 (apt search psycopg2), I came across the python3-psyopg2cffi package, which eventually led me to PyPy, the alternative Python implementation with Just-In-Time compiler that promises much faster speed than the normal interpreted CPython implementation.

Early test results look promising. The time to run the following command:

geojsonobject = self.getGeoJson(sqlquerystring, self.pgConnection())

is reduced from 11 seconds with python3 down to 6 seconds with pypy3. populateElasticsearchIndex(), not much change.
Anyhow, it appears we could shave off the time needed to run the *_postgres2es.py scripts by up to 30% (which varies among different scripts and datasets apparently).

CPython:

$ PYTHONUNBUFFERED=1 python3 psra_postgres2es.py | ts -s
00:00:03 SELECT *, ST_AsGeoJSON(geom_poly)  FROM results_psra_national.psra_all_indicators_s  ORDER BY psra_all_indicators_s."Sauid"  LIMIT 10000  OFFSET 0
00:00:14 populateElasticsearchIndex()
00:00:24 SELECT *, ST_AsGeoJSON(geom_poly)  FROM results_psra_national.psra_all_indicators_s  ORDER BY psra_all_indicators_s."Sauid"  LIMIT 10000  OFFSET 10000
00:00:36 populateElasticsearchIndex()
00:00:49 ...

PyPy:

$ PYTHONUNBUFFERED=1 pypy3 psra_postgres2es.py | ts -s
00:00:01 SELECT *, ST_AsGeoJSON(geom_poly)  FROM results_psra_national.psra_all_indicators_s  ORDER BY psra_all_indicators_s."Sauid"  LIMIT 10000  OFFSET 0
00:00:07 populateElasticsearchIndex()
00:00:17 SELECT *, ST_AsGeoJSON(geom_poly)  FROM results_psra_national.psra_all_indicators_s  ORDER BY psra_all_indicators_s."Sauid"  LIMIT 10000  OFFSET 10000
00:00:22 populateElasticsearchIndex()
00:00:34 ...

Pull request will follow soon.

Prerequisite:

  • pypy3 and python3-psycopg2cffi preinstalled in ghcr.io/opendrr/python-env container image (probably tagged 1.1.0?)
@anthonyfok anthonyfok added this to the Sprint 45 milestone Oct 22, 2021
@anthonyfok anthonyfok self-assigned this Oct 22, 2021
anthonyfok added a commit to anthonyfok/opendrr-api that referenced this issue Oct 22, 2021
anthonyfok added a commit to anthonyfok/opendrr-api that referenced this issue Oct 22, 2021
anthonyfok added a commit to anthonyfok/python-env that referenced this issue Oct 22, 2021
Fast-forward to debian:sid-20201012-slim, which is the last snapshot
before Python 3.9 became the default in Debian

Install PyPy (pypy3 7.3.2) for speed improvement; see OpenDRR/opendrr-api#137

Refresh most Python libraries from Debian packages, and avoid building
psycopg2 from source:

 - python3-numpy: 1.19.2 (was 1.18.5)
 - python3-pandas: 1.0.5 (was 1.0.4)
 - python3-psycopg2: 2.8.5 (was 2.6)
 - python3-psycopg2cffi: 2.8.1 (newly added)
 - python3-requests: 2.23.0 (was 2.23.0)
 - python3-sqlalchemy: 1.3.19 (was 1.3.17)

From PyPI using pip3:

 - elasticsearch: 7.12.0 (was 7.7.1)

Install git 2.30.2 and git-lfs 2.13.2 from Debian 11 (bullseye)
as the old git-lfs 2.11.0 hangs at "GIT_LFS_SKIP_SMUDGE=1 git checkout".

Move installation of extra utilities needed by add_data.sh from
OpenDRR/opendrr-api python/Dockerfile to this repo:

 - dos2unix eatmydata jq moreutils nano time

Add org.opencontainers.image.* labels

Clean up /var/lib/apt/lists/* to further save space.

Image size has been reduced from 662MB to 561MB.
anthonyfok added a commit to anthonyfok/opendrr-api that referenced this issue Oct 22, 2021
@anthonyfok anthonyfok modified the milestones: Sprint 45, Sprint 46 Nov 8, 2021
@anthonyfok anthonyfok modified the milestones: Sprint 46, Sprint 47 Nov 22, 2021
@anthonyfok anthonyfok removed this from the Sprint 47 milestone Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants