Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres throws FATAL: sorry, too many clients already #799

Closed
nesnoj opened this issue Jun 7, 2022 · 7 comments · Fixed by #1062 · May be fixed by #1034
Closed

Postgres throws FATAL: sorry, too many clients already #799

nesnoj opened this issue Jun 7, 2022 · 7 comments · Fixed by #1062 · May be fixed by #1034
Assignees
Labels
🐛 bug Something isn't working

Comments

@nesnoj
Copy link
Member

nesnoj commented Jun 7, 2022

In motorized individual travel #553 I introduced parallelization in c3542f3 which also leads to parallel DB queries/commits.

As reported via email postgres sometimes throws FATAL: sorry, too many clients already. For testing I increased the config param max_connections from 100 (default) to 300 and the error now occurs only rarely (and only with many other parallel tasks). Tracking the open connections I see that my tasks use a max. of 4 connections at once. I observed that the connections in my tasks are closed over time, although I don't know how long they stay idle. So with 20 tasks (additional to the other non-emobility running tasks) the limit of 100 is easily exceeded.

Shall we increase the max_connections or is there a better way of handling connections?

@nesnoj nesnoj added the 🐛 bug Something isn't working label Jun 7, 2022
@nesnoj nesnoj assigned gnn and nesnoj Jun 7, 2022
@nesnoj
Copy link
Member Author

nesnoj commented Jun 8, 2022

Now I'm getting some psycopg2.errors.DeadlockDetected exceptions which seem to be an Airflow problem already discussed here but not finally solved - there's an open issue on this.

Full (exemplary) log:

[2022-06-08 05:31:48,801] {taskinstance.py:901} INFO - Executing <Task(MotorizedIndividualTravel (versioned)): generate_model_data_eGon100RE_bunch75-89> on 2022-06-07T21:46:34.686278+00:00
[2022-06-08 05:31:48,810] {standard_task_runner.py:54} INFO - Started process 3770084 to run task
[2022-06-08 05:31:48,839] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'egon-data-processing-pipeline', 'generate_model_data_eGon100RE_bunch75-89', '2022-06-07T21:46:34.686278+00:00', '--job_id', '218', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/dags/pipeline.py', '--cfg_path', '/tmp/tmpn1015rip']
[2022-06-08 05:31:48,839] {standard_task_runner.py:78} INFO - Job 218: Subtask generate_model_data_eGon100RE_bunch75-89
[2022-06-08 05:31:49,085] {logging_mixin.py:120} INFO - Running <TaskInstance: egon-data-processing-pipeline.generate_model_data_eGon100RE_bunch75-89 2022-06-07T21:46:34.686278+00:00 [running]> on host XXX
[2022-06-08 05:31:49,284] {logging_mixin.py:120} INFO - SCENARIO: eGon100RE, SCENARIO VARIATION: Reference 2050, BUNCH: 75-89
[2022-06-08 05:31:50,761] {logging_mixin.py:120} INFO - 44300 EV loaded (1000 unique) in 15 grid districts.
[2022-06-08 05:31:50,763] {logging_mixin.py:120} INFO - GENERATE MODEL DATA...
[2022-06-08 05:31:50,763] {logging_mixin.py:120} INFO - Processing grid district: bus 1437... (1/15)
[2022-06-08 05:31:50,768] {logging_mixin.py:120} INFO -   Loading trips...
[2022-06-08 05:32:11,277] {logging_mixin.py:120} INFO -   Preprocessing data...
[2022-06-08 05:32:17,087] {logging_mixin.py:120} INFO -   Generating load timeseries...
[2022-06-08 05:33:13,918] {logging_mixin.py:120} INFO -   Writing model timeseries...
[2022-06-08 05:35:41,220] {taskinstance.py:1150} ERROR - (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 77606 waits for ShareLock on transaction 27186; blocked by process 77675.
Process 77675 waits for ShareLock on transaction 27177; blocked by process 77606.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (8,92) in relation "egon_etrago_load_pkey"

[SQL: INSERT INTO grid.egon_etrago_load (scn_name, load_id, bus, type, carrier, p_set, q_set, sign) VALUES (%(scn_name)s, %(load_id)s, %(bus)s, %(type)s, %(carrier)s, %(p_set)s, %(q_set)s, %(sign)s)]
[parameters: {'scn_name': 'eGon100RE', 'load_id': 879, 'bus': 7423, 'type': None, 'carrier': 'land transport EV', 'p_set': None, 'q_set': None, 'sign': -1}]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
Traceback (most recent call last):
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 77606 waits for ShareLock on transaction 27186; blocked by process 77675.
Process 77675 waits for ShareLock on transaction 27177; blocked by process 77606.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (8,92) in relation "egon_etrago_load_pkey"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/storage/eGon-data-sh/code/src/egon/data/datasets/__init__.py", line 194, in skip_task
    result = super(type(task), task).execute(*xs, **ks)
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/storage/eGon-data-sh/code/src/egon/data/datasets/emobility/motorized_individual_travel/model_timeseries.py", line 942, in generate_model_data_bunch
    write_model_data_to_db(
  File "/storage/eGon-data-sh/code/src/egon/data/datasets/emobility/motorized_individual_travel/model_timeseries.py", line 788, in write_model_data_to_db
    write_to_db()
  File "/storage/eGon-data-sh/code/src/egon/data/db.py", line 327, in commit
    ret = func(*args, **kwargs)
  File "/storage/eGon-data-sh/code/src/egon/data/datasets/emobility/motorized_individual_travel/model_timeseries.py", line 710, in write_to_db
    session.commit()
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1046, in commit
    self.transaction.commit()
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 504, in commit
    self._prepare_impl()
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 483, in _prepare_impl
    self.session.flush()
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2540, in flush
    self._flush(objects)
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2682, in _flush
    transaction.rollback(_capture_exception=True)
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2642, in _flush
    flush_context.execute()
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 239, in save_obj
    _emit_insert_statements(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 1083, in _emit_insert_statements
    c = cached_connections[connection].execute(statement, multiparams)
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
    ret = self._execute_context(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context
    self._handle_dbapi_exception(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
    util.raise_(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/storage/eGon-data-sh/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 77606 waits for ShareLock on transaction 27186; blocked by process 77675.
Process 77675 waits for ShareLock on transaction 27177; blocked by process 77606.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (8,92) in relation "egon_etrago_load_pkey"

[SQL: INSERT INTO grid.egon_etrago_load (scn_name, load_id, bus, type, carrier, p_set, q_set, sign) VALUES (%(scn_name)s, %(load_id)s, %(bus)s, %(type)s, %(carrier)s, %(p_set)s, %(q_set)s, %(sign)s)]
[parameters: {'scn_name': 'eGon100RE', 'load_id': 879, 'bus': 7423, 'type': None, 'carrier': 'land transport EV', 'p_set': None, 'q_set': None, 'sign': -1}]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
[2022-06-08 05:35:41,223] {taskinstance.py:1187} INFO - Marking task as FAILED. dag_id=egon-data-processing-pipeline, task_id=generate_model_data_eGon100RE_bunch75-89, execution_date=20220607T214634, start_date=20220608T033148, end_date=20220608T033541
[2022-06-08 05:35:46,532] {local_task_job.py:102} INFO - Task exited with return code 1

@nesnoj
Copy link
Member Author

nesnoj commented Jun 10, 2022

As proposed by you @gnn I'm testing with some modifications done in 157e29a.

@nesnoj
Copy link
Member Author

nesnoj commented Jun 22, 2022

Options:

  1. Increase max_connections
  2. Handle in Add error handling and retry to session_scope() #816

@nesnoj
Copy link
Member Author

nesnoj commented Jun 30, 2022

We decided to go for option 1.

@gnn After some reading I tried to adjust the docker-compose.yml as follows (added command: postgres -c max_connections=300):

version: '3'
services:
  egon-data-local-database-service:
    user: {uid}:{gid}
    image: postgres:12-postgis
    container_name: {--docker-container-name}
    restart: unless-stopped
    build:
      context: .
      dockerfile: {airflow}/Dockerfile.postgis
    ports:
    - "{--database-host}:{--database-port}:5432"
    environment:
      POSTGRES_DB: {--airflow-database-name}
      POSTGRES_USER: {--database-user}
      POSTGRES_PASSWORD: {--database-password}
    command: postgres -c max_connections=300
    volumes:
    - ./database-data/:/var/lib/postgresql/data
    - /dev/shm:/dev/shm

However, after logging in (docker exec -i -t <CONTAINER_ID> /bin/bash) and cat /var/lib/postgresql/data/postgresql.conf | grep max_connections I found the default value to be retained. Is this because airflow takes care of the db startup skipping my args?
Could you please take care of this?

@nesnoj
Copy link
Member Author

nesnoj commented Sep 7, 2022

@gnn If option 1 is still the preferred way, could you come up with a idea or/and a solution?
Otherwise this could also be included in #816 but until then this requires the manual edit of postgres' conf each time the container is created..

@nesnoj
Copy link
Member Author

nesnoj commented Oct 26, 2022

We decided to go for option 1.

@gnn After some reading I tried to adjust the docker-compose.yml as follows (added command: postgres -c max_connections=300):

version: '3'
services:
  egon-data-local-database-service:
    user: {uid}:{gid}
    image: postgres:12-postgis
    container_name: {--docker-container-name}
    restart: unless-stopped
    build:
      context: .
      dockerfile: {airflow}/Dockerfile.postgis
    ports:
    - "{--database-host}:{--database-port}:5432"
    environment:
      POSTGRES_DB: {--airflow-database-name}
      POSTGRES_USER: {--database-user}
      POSTGRES_PASSWORD: {--database-password}
    command: postgres -c max_connections=300
    volumes:
    - ./database-data/:/var/lib/postgresql/data
    - /dev/shm:/dev/shm

However, after logging in (docker exec -i -t <CONTAINER_ID> /bin/bash) and cat /var/lib/postgresql/data/postgresql.conf | grep max_connections I found the default value to be retained. Is this because airflow takes care of the db startup skipping my args? Could you please take care of this?

@gnn Uh, I was just looking at the wrong place, the solution works out! As you just said, the config file isn't touched by the custom command: statement! But it is actually set which is confirmed checking SELECT * FROM pg_settings WHERE name='max_connections'; - here the boot_val is 100 but setting is 1000 😄 :

      name       | setting | unit |                       category                       |                     short_desc                     | extra_desc |  context   | vartype |    source    | min_val | max_val | enumvals | boot_val | reset_val | sourcefile | sourceline | pending_restart 
-----------------+---------+------+------------------------------------------------------+----------------------------------------------------+------------+------------+---------+--------------+---------+---------+----------+----------+-----------+------------+------------+-----------------
 max_connections | 1000    |      | Connections and Authentication / Connection Settings | Sets the maximum number of concurrent connections. |            | postmaster | integer | command line | 1       | 262143  |          | 100      | 1000      |            |            | f

Also, it is retained during restart!

So it might be unnecessary to use another script to set the value. Instead, the line above seems to be sufficient. Or am I overlooking something @gnn?

Also, thanks to @Bachibouzouk for your help!

@gnn
Copy link
Collaborator

gnn commented Oct 27, 2022

Yeah, I meant SELECT * FROM pg_settings WHERE name='max_connections';.
Or

SELECT name, setting, short_desc, vartype, max_val, boot_val, reset_val, pending_restart
FROM pg_settings
WHERE name = 'max_connections';

if you want you want to limit it to "important" stuff and avoid line breaks (at least on my monitor; YMMV). Or SHOW max_connections ; if you just want the value. That's what I used to check that max_connections was set to 500 after starting the CI. I read the thread you linked, although after digging it up myself (could've saved some time there by reading your post thoroughly) and went for a different solution, because I had some sort of problem with command: postgres -c max_connections=300. You can do a git show f25cb1f789865ffe4eec8be251c7b308ce587543 on the server to check out how I set max_connections. The commit is not officially part of the CI, but it's a separate file in the "patches" directory, which is on the same level as "code" and "working-directory". Here's a short explanation.
Anayway, you should be able to increase the max_connections value on a live server by manipulating the value in pg_settings via SQL.

@gnn gnn mentioned this issue Nov 1, 2022
8 tasks
gnn added a commit that referenced this issue Nov 14, 2022
gnn added a commit that referenced this issue Nov 14, 2022
gnn added a commit that referenced this issue Dec 7, 2022
More generally, implement a function which caches `Engine` objects on a
per argument basis and change `engine` to return the same `Engine` for
every call from the same process, as determined by process id.
Since engines are responsible for managing the connection pool, are
supposed to be created once, globally per application, but are also not
supposed to be shared across process boundaries, this is the best
compromise which hopefully fixes or at least mitigates issue #799.
@gnn gnn closed this as completed in #1062 Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
2 participants