Skip to content
This repository has been archived by the owner on May 14, 2020. It is now read-only.

Free database space in demo environment #247

Closed
8 tasks done
giorgiosironi opened this issue Sep 9, 2019 · 31 comments
Closed
8 tasks done

Free database space in demo environment #247

giorgiosironi opened this issue Sep 9, 2019 · 31 comments
Assignees
Labels
continuous delivery the ability to get changes of all types live, safely and quickly in a sustainable way jats-ingester

Comments

@giorgiosironi
Copy link
Member

giorgiosironi commented Sep 9, 2019

demo.libero.pub has an alert open for >80% of disk utilization. 86% currently so if unaddressed the demo may stop working in 2-3 weeks.

Old Docker images have all been cleaned up already. It boils down to the data in the postgres database of jats-ingester.

ubuntu@ip-172-31-66-198:~$ sudo du -h --max-depth=1 /var/lib/docker/volumes
40M     /var/lib/docker/volumes/data-blog-articles
8.0K    /var/lib/docker/volumes/data-s3
25M     /var/lib/docker/volumes/data-search
56M     /var/lib/docker/volumes/data-scholarly-articles
12K     /var/lib/docker/volumes/sample-configuration_public-api
2.5G    /var/lib/docker/volumes/data-jats-ingester
1016K   /var/lib/docker/volumes/sample-configuration_public-browser
2.6G    /var/lib/docker/volumes

Proposed solution

  • Find the right commands to run in airflow, psql or else to clean up a sizable subset of old logs, or data that is not necessary anymore to run the demo. For example, logs of DAGs executed more than 1 month ago, or even all the DAGs runs that were executed more than 1 month ago
  • Execute in demo and possibly in the unstable environments
  • Store the knowledge somewhere accessible (README?) for when this happens again

/cc @GiancarloFusiello

Checklist

  • unstable environment has running DAGs
  • unstable environment has running redis broker
  • unstable environment does not have kombu_* tables anymore
  • unstable environment has much less than 80% full disk space
  • demo environment has running DAGs
  • demo environment has running redis broker
  • demo environment does not have kombu_* tables anymore
  • demo environment has much less than 80% full disk space
@giorgiosironi giorgiosironi added the continuous delivery the ability to get changes of all types live, safely and quickly in a sustainable way label Sep 9, 2019
@giorgiosironi
Copy link
Member Author

Labelled with continuous delivery because there isn't a clear operations label.

@giorgiosironi
Copy link
Member Author

Possible target for data retention in this service: 2 weeks, with use case of debugging recent ingestions.

@GiancarloFusiello GiancarloFusiello self-assigned this Sep 12, 2019
@GiancarloFusiello
Copy link

@giorgiosironi Are there instructions for accessing the jats-ingester database? i.e host address, getting SSH access

@GiancarloFusiello
Copy link

Adding for future reference. How to access unstable/demo environments https://github.com/libero/environments/#tasks

@GiancarloFusiello
Copy link

airflow-db=# SELECT
airflow-db-# pg_database.datname,
airflow-db-# pg_size_pretty(pg_database_size(pg_database.datname)) AS size
airflow-db-# FROM pg_database;

  datname   |  size
------------+---------
 postgres   | 7675 kB
 airflow-db | 2526 MB
 template1  | 7537 kB
 template0  | 7537 kB
(4 rows)

@GiancarloFusiello
Copy link

airflow-db=# SELECT
airflow-db-# relname as "Table",
airflow-db-# pg_size_pretty(pg_total_relation_size(relid)) As "Size",
airflow-db-# pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size"
airflow-db-# FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;

         Table         |    Size    | External Size
-----------------------+------------+---------------
 kombu_message         | 612 MB     | 77 MB
 xcom                  | 603 MB     | 565 MB
 log                   | 504 MB     | 103 MB
 task_instance         | 473 MB     | 381 MB
 job                   | 191 MB     | 128 MB
 celery_taskmeta       | 78 MB      | 39 MB
 dag_run               | 56 MB      | 38 MB
 task_fail             | 104 kB     | 80 kB
 dag                   | 72 kB      | 56 kB
 kombu_queue           | 40 kB      | 32 kB
 task_reschedule       | 24 kB      | 24 kB
 kube_worker_uuid      | 24 kB      | 16 kB
 users                 | 24 kB      | 24 kB
 variable              | 24 kB      | 24 kB
 slot_pool             | 24 kB      | 24 kB
 alembic_version       | 24 kB      | 16 kB
 sla_miss              | 24 kB      | 24 kB
 kube_resource_version | 24 kB      | 16 kB
 celery_tasksetmeta    | 24 kB      | 24 kB
 dag_pickle            | 16 kB      | 16 kB
 import_error          | 16 kB      | 16 kB
 connection            | 16 kB      | 16 kB
 known_event           | 16 kB      | 16 kB
 chart                 | 16 kB      | 16 kB
 known_event_type      | 8192 bytes | 8192 bytes
(25 rows)

@giorgiosironi
Copy link
Member Author

Kombu is the message queue right? Best candidate for a cleanup that doesn't delete useful data.

@GiancarloFusiello
Copy link

GiancarloFusiello commented Sep 12, 2019

From first glance, we can eliminate 612mb used by the kombu_message table by periodically clearing entries after a certain timestamp or using another method of messaging rather than using the database as a message broker. For more information about Kombu.

It seems like we are not the only ones dealing with the issue of database entries retention. This blog post by Clairvoyant is about their experience of Apache Airflow maintenance and details similar issues we are facing and have open-sourced their maintenance DAGs.

As a next step, I will look into seeing if we can use these DAGs on the Libero jats-ingester.

@GiancarloFusiello
Copy link

GiancarloFusiello commented Sep 12, 2019

I can also see that Airflow latest release is version 1.10.5. jats-ingester is currently using version 1.10.3. There's nothing in the change log that suggests any changes relating to this but there are a lot of changes/new features. I would be interested in seeing if any new settings have been added that might deal with this issue.

@giorgiosironi
Copy link
Member Author

using another method of messaging rather than using the database as a message broker

Probably fair to say the message persistence is not very important at this moment, if we want to transition to something that keeps messages in memory. Especially as in this context the queue is internal to the application rather than a communication medium between services.

@GiancarloFusiello
Copy link

I would be interested in seeing if any new settings have been added that might deal with this issue

Just checked the config file in 1.10.5. No new settings relating to this issue. Focusing on using Clairvoyant DAGs.

@GiancarloFusiello
Copy link

unstable database size readings:

airflow-db=# SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;
  datname   |  size
------------+---------
 postgres   | 7683 kB
 airflow-db | 643 MB
 template1  | 7545 kB
 template0  | 7545 kB
(4 rows)

airflow-db=# SELECT relname as "Table", pg_size_pretty(pg_total_relation_size(relid)) As "Size", pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size" FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;
         Table         |    Size    | External Size
-----------------------+------------+---------------
 kombu_message         | 201 MB     | 25 MB
 log                   | 164 MB     | 36 MB
 task_instance         | 123 MB     | 94 MB
 job                   | 62 MB      | 40 MB
 dag_run               | 31 MB      | 21 MB
 xcom                  | 28 MB      | 18 MB
 celery_taskmeta       | 25 MB      | 12 MB
 dag                   | 72 kB      | 56 kB
 import_error          | 64 kB      | 56 kB
 task_fail             | 48 kB      | 40 kB
 kombu_queue           | 40 kB      | 32 kB
 slot_pool             | 24 kB      | 24 kB
 users                 | 24 kB      | 24 kB
 variable              | 24 kB      | 24 kB
 task_reschedule       | 24 kB      | 24 kB
 kube_resource_version | 24 kB      | 16 kB
 alembic_version       | 24 kB      | 16 kB
 kube_worker_uuid      | 24 kB      | 16 kB
 sla_miss              | 24 kB      | 24 kB
 celery_tasksetmeta    | 24 kB      | 24 kB
 chart                 | 16 kB      | 16 kB
 connection            | 16 kB      | 16 kB
 known_event           | 16 kB      | 16 kB
 dag_pickle            | 16 kB      | 16 kB
 known_event_type      | 8192 bytes | 8192 bytes
(25 rows)

@giorgiosironi
Copy link
Member Author

This night demo.libero.pub was down for 1 hour:
https://alerts.newrelic.com/accounts/1451451/incidents/84974895/violations
May or may not be related to this.

Despite the broken DAGs, we should be able to deploy the Kombu Redis change if it works in unstable, and then delete kombu_* tables?

@giorgiosironi
Copy link
Member Author

giorgiosironi commented Sep 18, 2019

Current demo status:

ubuntu@ip-172-31-66-198:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G   20G     0 100% /

@giorgiosironi
Copy link
Member Author

Current unstable status is much better, but don't know why:

ubuntu@ip-172-31-72-104:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G  8.2G   12G  43% /

@giorgiosironi
Copy link
Member Author

Added a Checklist to the issue description because I was starting to forget all aspects to check. Deployment to demo is triggered by Git tags in environments.

@GiancarloFusiello
Copy link

GiancarloFusiello commented Sep 19, 2019

Environment: unstable.libero.pub

  • dropped kombu tables
  • ran maintenance DAGs

Results:

ubuntu@ip-172-31-72-104:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G  9.2G   11G  48% /

airflow-db=# SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;
  datname   |  size
------------+---------
 postgres   | 7683 kB
 airflow-db | 458 MB
 template1  | 7545 kB
 template0  | 7545 kB
(4 rows)

airflow-db=# SELECT relname as "Table", pg_size_pretty(pg_total_relation_size(relid)) As "Size", pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size" FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;
         Table         |    Size    | External Size
-----------------------+------------+---------------
 log                   | 174 MB     | 38 MB
 task_instance         | 131 MB     | 100 MB
 job                   | 66 MB      | 43 MB
 dag_run               | 33 MB      | 23 MB
 celery_taskmeta       | 27 MB      | 13 MB
 xcom                  | 18 MB      | 7688 kB
 dag                   | 72 kB      | 56 kB
 import_error          | 64 kB      | 56 kB
 task_fail             | 48 kB      | 40 kB
 kube_worker_uuid      | 24 kB      | 16 kB
 sla_miss              | 24 kB      | 24 kB
 celery_tasksetmeta    | 24 kB      | 24 kB
 variable              | 24 kB      | 24 kB
 users                 | 24 kB      | 24 kB
 task_reschedule       | 24 kB      | 24 kB
 kube_resource_version | 24 kB      | 16 kB
 slot_pool             | 24 kB      | 24 kB
 alembic_version       | 24 kB      | 16 kB
 chart                 | 16 kB      | 16 kB
 known_event           | 16 kB      | 16 kB
 dag_pickle            | 16 kB      | 16 kB
 connection            | 16 kB      | 16 kB
 known_event_type      | 8192 bytes | 8192 bytes
(23 rows)

The overall disk space usage has increased but I suspect this is due to other services. Can see a 200mb reduction by airflow-db most likely due to removing the kombu tables. I don't see any difference to the task_instance or log table sizes.

I can confirm that data over 30 days old has been removed:

airflow-db=# select count(*) from task_instance where execution_date < '2019-08-20 00:00:00.00+00';
 count
-------
     0
(1 row)

airflow-db=# select count(*) from task_instance where execution_date < '2019-08-21 00:00:00.00+00';
 count
-------
  1412
(1 row)

@giorgiosironi should I run the command to truly erase deleted data from postgres?

@giorgiosironi
Copy link
Member Author

Yes, also including VACUUM if it helps?

@GiancarloFusiello
Copy link

Before running VACUUM there are some other things to consider:

Environment: unstable.libero.pub

ubuntu@ip-172-31-72-104:~$ sudo df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G  9.2G   11G  48% /

ubuntu@ip-172-31-72-104:~$ sudo du -h --max-depth=1 /var
1.5M    /var/backups
12G    /var/lib
24K    /var/tmp
884K    /var/db
78M    /var/cache
261M    /var/log
4.0K    /var/crash
4.0K    /var/local
36K    /var/snap
4.0K    /var/mail
28K    /var/spool
4.0K    /var/opt
12G    /var

ubuntu@ip-172-31-72-104:~$ sudo du -h --max-depth=1 /var/lib/docker
869M    /var/lib/docker/volumes
132K    /var/lib/docker/network
4.0K    /var/lib/docker/tmp
9.9G    /var/lib/docker/overlay2
4.5M    /var/lib/docker/containers
20K    /var/lib/docker/builder
23M    /var/lib/docker/image
20K    /var/lib/docker/plugins
4.0K    /var/lib/docker/trust
72K    /var/lib/docker/buildkit
4.0K    /var/lib/docker/swarm
4.0K    /var/lib/docker/runtimes
11G    /var/lib/docker

Also we may want to clear old images:

ubuntu@ip-172-31-72-104:~$ docker images
REPOSITORY                                      TAG                                        IMAGE ID            CREATED             SIZE
liberoadmin/jats-ingester                       58b5ed50fe75336096277287c62607e893694eaa   739d4d4fd787        3 hours ago         822MB
liberoadmin/jats-ingester                       b9a9459b40274801c0c318058f65cec2b547eed9   c53f5058bee2        3 days ago          821MB
liberoadmin/jats-ingester                       66babb725e418f175fc79a2d774245e396a95e6d   50a23b954b89        3 days ago          820MB
liberoadmin/pattern-library                     eaf37b88f2a409730a89d3328ff093b99621b3a8   610613a2a174        3 weeks ago         20.4MB
redis                                           5.0.5-alpine                               ed7d2ff5a623        4 weeks ago         29.3MB
liberoadmin/jats-ingester                       bbbb88aca5e439f93e7624875cef70bf3600e08c   2614af153335        5 weeks ago         820MB
liberoadmin/pattern-library                     d2e19a77aae00388885e662b5762bd569a06e0d6   8668e6541ed7        6 weeks ago         20.4MB
liberoadmin/browser                             3fdd32e20638208fa2c8b78baaddbd94d895b87a   8883d13fc39c        2 months ago        147MB
liberoadmin/content-store                       591c6c04973c9705fec450d805dae2fad9dee7ab   01ba3ba5d06d        2 months ago        158MB
liberoadmin/dummy-api                           230d84f944ec4da7030e328a8bd1f242717476d1   da581becc94f        2 months ago        88.6MB
liberoadmin/search                              f13c7fe2aa5f3cd1e2f62234995788bed7147b91   07f2f1b565a2        2 months ago        205MB
docker.elastic.co/elasticsearch/elasticsearch   7.1.1                                      b0e9f9f047e6        3 months ago        894MB
postgres                                        11.2                                       3eda284d1840        4 months ago        312MB
postgres                                        11.2-alpine                                cd1fb3df8252        4 months ago        70.8MB
linkyard/yaml                                   1.1.0                                      09bbee4819be        10 months ago       35.3MB
nginx                                           1.15.5-alpine                              aae476eee77d        11 months ago       17.7MB
nginx                                           1.15.2-alpine                              36f3464a2197        14 months ago       18.6MB

@giorgiosironi
Copy link
Member Author

docker image prune -a is my go-to command, though it's not in a cron yet. Will remove all images that are currently not used by a running container, not just the old ones; so it's safe to run in most conditions.

It may leave a 10-month old image around if it's in use; the creation time of the image may be much older with respect to the time it has been pulled on that machine (which is not tracked).

@giorgiosironi
Copy link
Member Author

Environment: demo.libero.pub

ubuntu@ip-172-31-66-198:~$ sudo du -h --max-depth=1 /var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/
2.8G    /var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/merged
8.0K    /var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/work
2.0G    /var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/diff
4.7G    /var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/

Seems to be files written inside an Elasticsearch container:

ubuntu@ip-172-31-66-198:~$ sudo find /var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/ | grep elasticsearch|  head -n 2
/var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/merged/tmp/hsperfdata_elasticsearch
/var/lib/docker/overlay2/777f629d847ee55ffc7817d0c5a3c103138add35b219e86d3e4a730b912a7880/merged/tmp/hsperfdata_elasticsearch/1

Stopping the container as I can't log in with docker exec due to lack of disk space.

@giorgiosironi
Copy link
Member Author

Have to actually remove the container to clear the space. It has a data-search volume so should keep state.

@giorgiosironi
Copy link
Member Author

giorgiosironi commented Sep 19, 2019

This freed 2GB:

ubuntu@ip-172-31-66-198:~/sample-configuration$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G   18G  1.8G  92% /

Deployment can now proceed and will restart this container.

@giorgiosironi
Copy link
Member Author

A deployment cleared most of the space, that apparently consisted of files written inside containers (e.g. Elasticsearch logs):

ubuntu@ip-172-31-66-198:~/sample-configuration$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G  9.9G  9.5G  51% /

Recreating the containers during deployment deleted all those files. This also explains why unstable didn't have the same problem, as deployments are very frequent there.

@giorgiosironi
Copy link
Member Author

Everything is healthy at the Docker level now, but

curl -v -X POST https://demo--api-gateway.libero.pub/search/populate

returns a 502.
Boils down to

PUT http://search_elasticsearch:9200/articles/article/3914828 [status:403 request:0.131s]
[2019-09-19 15:58:45,005] ERROR in app: Exception on /populate [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2311, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1834, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1737, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1832, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1818, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./search/populate.py", line 51, in populate
    index='articles', doc_type='article', id=document['id'], body=document
  File "./search/index.py", line 14, in index_document
    search.index(index=index, doc_type=doc_type, id=id, body=body)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 84, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 354, in index
    body=body,
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 353, in perform_request
    timeout=timeout,
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 236, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/base.py", line 162, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.AuthorizationException: AuthorizationException(403, 'cluster_block_exception', 'blocked by: [FORBIDDEN
/12/index read-only / allow delete (api)];')

The index being in some kind of read-only mode?

@giorgiosironi giorgiosironi changed the title Free database space in jats-ingester demo Free database space in demo environment Sep 19, 2019
@giorgiosironi
Copy link
Member Author

(renamed the ticket since the space is shared between all applications in the same environment)

@GiancarloFusiello
Copy link

Everything is healthy at the Docker level now, but

curl -v -X POST https://demo--api-gateway.libero.pub/search/populate

returns a 502.

The index being in some kind of read-only mode?

Appears to be related to the disk space usage: https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html

@GiancarloFusiello
Copy link

Perhaps we should also look at log rotation? https://www.elastic.co/guide/en/elasticsearch/reference/7.3/logging.html

@GiancarloFusiello
Copy link

demo latest:

ubuntu@ip-172-31-66-198:~/sample-configuration$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       20G   11G  8.8G  55% /
  • Kombu tables removed
  • redis deployed and in operation
  • maintenance dags running

@GiancarloFusiello
Copy link

will create separate tickets for elasticsearch specific tasks.

@giorgiosironi happy to close this?

@giorgiosironi
Copy link
Member Author

Looks healthy and the checklist is complete, thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
continuous delivery the ability to get changes of all types live, safely and quickly in a sustainable way jats-ingester
Projects
None yet
Development

No branches or pull requests

2 participants