DBT Pod running out of disk space #145

njuguna-n · 2024-09-03T15:05:51Z

Describe the bug
A dbt pod running on MoH Kenya servers was terminated due to exceeding its storage limit. See logs below.

To Reproduce

SSH into the MoH servers using credentials shared on 1password
RUn kubectl -n echis-cht-sync describe pod cht-sync-dbt-789964dbb6-dcqch to view details about the pod and note the message providing the reason the pod was evicted.

Expected behavior
The pod does not run out of disk space

Logs

Name:             cht-sync-dbt-789964dbb6-dcqch
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Sat, 31 Aug 2024 06:02:09 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 51464084Ki. Container dbt was using 30335059, request is 0, has larger consumption of ephemeral-storage.
IP:               10.42.52.181
IPs:
  IP:           10.42.52.181
Controlled By:  ReplicaSet/cht-sync-dbt-789964dbb6

The text was updated successfully, but these errors were encountered:

njuguna-n · 2024-09-04T08:30:47Z

Ran bash in the container with kubectl exec -it cht-sync-dbt-789964dbb6-ssrjl --container dbt bash -n echis-cht-sync
Running df -h to view the available disk space on the file system yields the results below

Filesystem                         Size  Used Avail Use% Mounted on
overlay                            983G  362G  571G  39% /
tmpfs                               64M     0   64M   0% /dev
tmpfs                               30G     0   30G   0% /sys/fs/cgroup
shm                                 64M   16K   64M   1% /dev/shm
/dev/mapper/ubuntu--vg-ubuntu--lv  983G  362G  571G  39% /etc/hosts
tmpfs                               30G     0   30G   0% /proc/acpi
tmpfs                               30G     0   30G   0% /proc/scsi
tmpfs                               30G     0   30G   0% /sys/firmware

Running du -h --max-depth=1 | sort -n -r to show disk usage in the top-level directories in the /dbt directory sorted by size yields the results below

740K	./logs
33M	.
27M	./dbt_packages
8.0K	./.dbt
3.8M	./target
1.3M	./old_manifest

The dbt logs don't seem to be an issue currently. The pod is still running without issues so I will check in on this later today or tomorrow.

andrablaj · 2024-09-09T09:30:37Z

@njuguna-n @mrjones-plip any updates on this problem? Are there still storage concerns?

njuguna-n · 2024-09-09T09:33:50Z

@andrablaj I have been unable to SSH into the server this morning to check on the new pod probably due to some ongoing maintenance. I will update here once I get access.

eljhkrr · 2024-09-09T14:11:34Z

Connected with MoH infrastructure team today and they haven't expanded storage yet. Plan is to complete it this week.

njuguna-n · 2024-09-09T15:01:21Z

Thanks @eljhkrr. Wi will still need to get to the bottom of what is taking up the disk space.

PS: The servers are still offline so I will try logging in again tomorrow.

njuguna-n · 2024-09-11T07:38:06Z

Server are still inaccessible so no update on this.

njuguna-n · 2024-09-12T07:41:01Z

This is still an issue. The pod has restarted two more times with the same error. Logs below

Name:             cht-sync-dbt-789964dbb6-ssrjl
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Tue, 03 Sep 2024 05:49:09 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 50599036Ki. Container dbt was using 31786940, request is 0, has larger consumption of ephemeral-storage.

Name:             cht-sync-dbt-789964dbb6-k6pkp
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Tue, 10 Sep 2024 14:08:55 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 51163920Ki. Container dbt was using 30294666, request is 0, has larger consumption of ephemeral-storage.

This is not an issue with the dbt container or its logs since it only takes up about 32MB. Having a look at cht-sync-couch2pg pod, it failed with a similar reason and takes up even less space

Name:             cht-sync-couch2pg--elgeyomarakwet-echis-go-ke-668577fd5b-v27gg
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Sat, 31 Aug 2024 06:02:09 +0000
Labels:           app=cht-sync
                  pod-template-hash=668577fd5b
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 3438972Ki. Container couch2pg was using 8Ki, request is 0, has larger consumption of ephemeral-storage.

The only other pod is the postgres one which is the culprit. Running kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- df -h /tmp yields the result below

Filesystem      Size  Used Avail Use% Mounted on
overlay         983G  482G  452G  52% /

Running kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- du -sh /* we get the logs below. I have edited them slightly to remove some permission denied ones.

0	/bin
4.0K	/boot
16K	/dev
944K	/etc
4.0K	/home
0	/lib
0	/lib64
4.0K	/media
4.0K	/mnt
4.0K	/opt
0	/proc
16K	/root
20K	/run
0	/sbin
4.0K	/srv
0	/sys
4.0K	/tmp
408M	/usr
459G	/var

Running kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- du -sh /var/* results in

4.0K	/var/backups
1.8M	/var/cache
459G	/var/lib
4.0K	/var/local
0	/var/lock
96K	/var/log
4.0K	/var/mail
4.0K	/var/opt
0	/var/run
4.0K	/var/spool
4.0K	/var/tmp

Now trying to find out what is taking up the storage in /var/lib

njuguna-n · 2024-09-12T07:57:03Z

After running du -h --max-depth=1 | sort -n -r in the var/lib library I get the result below

460G	./postgresql
460G	.
72K	./systemd
28K	./pam
28K	./apt
24K	./ucf
6.3M	./dpkg
4.0K	./misc

My theory is that the large amounts of data being transformed by dbt result in postgres using up most of the temporary storage available while running the queries. I am not sure how we can resolve this other than increasing the resource limits available. @witash @dianabarsan any ideas?

witash · 2024-09-12T10:46:12Z

I thought they were running cht-sync to the existing production postgres outside the cluster?
They changed to a postgres in the cluster instead?

witash · 2024-09-12T11:40:50Z

Also the eviction threshold 52732898884 (50GB) seems quite high
can't see where its configured but we could probably lower even to 1GB or something

njuguna-n · 2024-09-12T11:43:13Z

@witash wouldn't that mean that the pods would get evicted more often?

witash · 2024-09-12T11:47:54Z

I think this message

Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 50599036Ki. Container dbt was using 31786940, request is 0, has larger consumption of ephemeral-storage

Means that because the available ephemeral storage on the node (50599036Ki) is less than the threshold (52732898884) its evicting pods. so by setting the threshold lower it would evict pods only if available ephemeral storage was less than 1GB

njuguna-n · 2024-09-13T07:50:14Z

@eljhkrr are you familiar with where we can set this eviction threshold?

njuguna-n · 2024-09-13T09:37:06Z

This might be unrelated but worth highlighting. It seems the postgres pod lost data, running select min(m.saved_timestamp) from v1.medic m; results in 2024-09-10 18:15:48.398. Most of the currently running pods started at around the same time.
cc: @alexosugo @eljhkrr @witash @dianabarsan

NAME                                                             READY   STATUS                   RESTARTS         AGE
cht-sync-couch2pg--baringo-echis-go-ke-8cf5cc87f-fl8k5           1/1     Running                  7 (22h ago)      4d18h
cht-sync-couch2pg--bomet-echis-go-ke-7b557946ff-fvfgz            1/1     Running                  1 (2d14h ago)    2d14h
cht-sync-couch2pg--bungoma-echis-go-ke-7bb48c859c-dzzxt          1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--busia-echis-go-ke-68bf657f8-2tcnp             1/1     Running                  7 (22h ago)      2d15h
cht-sync-couch2pg--elgeyomarakwet-echis-go-ke-668577fd5b-mdwfg   1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--embu-echis-go-ke-64cfcc67f4-kmv7w             1/1     Running                  7 (22h ago)      2d15h
cht-sync-couch2pg--garissa-echis-go-ke-76b84c6874-qjdq4          1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--homabay-echis-go-ke-76fcc5ddf7-4n896          1/1     Running                  15 (27m ago)     2d18h
cht-sync-couch2pg--isiolo-echis-go-ke-8d6f699cc-tq8xb            1/1     Running                  0                4d18h
cht-sync-couch2pg--kajiado-echis-go-ke-7bcc5886c6-b5965          1/1     Running                  0                2d14h
cht-sync-couch2pg--kakamega-echis-go-ke-676d9dd994-x8df2         1/1     Running                  29 (28m ago)     2d15h
cht-sync-couch2pg--kericho-echis-go-ke-64dffd4f48-h4njh          1/1     Running                  4 (24h ago)      2d14h
cht-sync-couch2pg--kiambu-echis-go-ke-5df788c65f-nx7pt           1/1     Running                  8 (29m ago)      2d18h
cht-sync-couch2pg--kilifi-echis-go-ke-5b6b78dfb7-wzft4           1/1     Running                  10 (39h ago)     2d18h
cht-sync-couch2pg--kirinyaga-echis-go-ke-fbdcfc748-d8ggp         1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--kisii-echis-go-ke-7557c7fb64-bvmc4            1/1     Running                  7 (38h ago)      2d18h
cht-sync-couch2pg--kisumu-echis-go-ke-75bf9844c8-526pl           1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--kitui-echis-go-ke-7bdfff86bf-crrpw            1/1     Running                  9 (28m ago)      2d18h
cht-sync-couch2pg--kwale-echis-go-ke-6f8fb8749f-ldwkc            1/1     Running                  8 (43h ago)      2d15h
cht-sync-couch2pg--laikipia-echis-go-ke-89774b7d9-4z2cd          1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--lamu-echis-go-ke-588fbb88d4-h9fq4             1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--machakos-echis-go-ke-7f5448f4fb-bw62p         1/1     Running                  7 (27h ago)      4d18h
cht-sync-couch2pg--makueni-echis-go-ke-5b6c6d6ccb-wgng6          1/1     Running                  0                2d14h
cht-sync-couch2pg--mandera-echis-go-ke-5cbdbb975c-sgtml          1/1     Running                  1 (2d14h ago)    2d14h
cht-sync-couch2pg--marsabit-echis-go-ke-86687777-j9lnx           1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--meru-echis-go-ke-54fbcd74cf-tqpjr             1/1     Running                  7 (2d2h ago)     2d18h
cht-sync-couch2pg--migori-echis-go-ke-6bd8dfb78b-6tcvg           1/1     Running                  12 (21h ago)     2d18h
cht-sync-couch2pg--mombasa-echis-go-ke-68f77b68bd-59rbz          1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--muranga-echis-go-ke-59c49b5d6d-s6tlh          1/1     Running                  7 (43h ago)      2d15h
cht-sync-couch2pg--nairobi-echis-go-ke-57d4d46b9f-rf8lv          1/1     Running                  86 (10h ago)     2d15h
cht-sync-couch2pg--nakuru-echis-go-ke-5564d9d797-g4bbb           1/1     Running                  1 (35h ago)      2d14h
cht-sync-couch2pg--nandi-echis-go-ke-8c6849695-97ckz             1/1     Running                  8 (27m ago)      2d15h
cht-sync-couch2pg--narok-echis-go-ke-57b78f54b8-sclgn            1/1     Running                  0                2d14h
cht-sync-couch2pg--nyamira-echis-go-ke-6b4db7dfb8-nsmv6          1/1     Running                  7 (22h ago)      2d15h
cht-sync-couch2pg--nyandarua-echis-go-ke-5c6c6cdfd5-btk9d        1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--nyeri-echis-go-ke-7b4cb79757-6mfrw            1/1     Running                  10 (27h ago)     4d18h
cht-sync-couch2pg--samburu-echis-go-ke-c88c6c459-67wmr           1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--taitataveta-echis-go-ke-745fb9dcf4-vjbkt      1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--tanariver-echis-go-ke-f59488b7f-rnnb2         1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--tharakanithi-echis-go-ke-78cfc7848-8p6nr      1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--transnzoia-echis-go-ke-6b758dd697-xjhfg       1/1     Running                  2 (72m ago)      2d14h
cht-sync-couch2pg--turkana-echis-go-ke-b7867ddcc-gmdvk           1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--uasingishu-echis-go-ke-78fc9fcf7c-xv8x8       1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--vihiga-echis-go-ke-7bb44f5d79-xc22w           1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--wajir-echis-go-ke-66fc97f86d-hj7mz            1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--westpokot-echis-go-ke-56cc68df66-lzxhw        1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-dbt-789964dbb6-qxs98                                    1/1     Running                  0                2d14h
cht-sync-postgres-0                                              1/1     Running                  0                2d14h

witash · 2024-09-13T09:52:47Z

yea I just noticed that also...although the postgres service has a configured persisten volume, its not mounted, so it may not actually be using it...and also the claim is only for 1GB, and the data is about 450GB.

kubectl -n echis-cht-sync describe pod cht-sync-postgres-0
Name:             cht-sync-postgres-0
...
Containers:
  postgres-c:
    ...
    Mounts:               <none>
Volumes:
  postgres-data:
    Type:        PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:   postgres-data-cht-sync-postgres-0
    ReadOnly:    false

Events:          <none>
kubectl -n echis-cht-sync describe pv pvc-74bae13c-d740-4ece-afd3-05ab100859bf
Name:              pvc-74bae13c-d740-4ece-afd3-05ab100859bf
...
StorageClass:      local-path
Status:            Bound
Claim:             echis-cht-sync/postgres-data-cht-sync-postgres-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi

So I don't fully understand all this yet, but I think it could be related

njuguna-n · 2024-09-13T10:00:36Z

Yes, looks like it could be related. Could postgres be saving all the data in ephemeral instead of persistent storage thus causing the issues?

witash · 2024-09-13T10:09:45Z

Yes, looks like it could be related. Could postgres be saving all the data in ephemeral instead of persistent storage thus causing the issues?

yes, I think it is. The thing that is confusing me is that I used

kubectl get --raw "/api/v1/nodes/echischtsync/proxy/stats/summary" | jq

to get the ephemeral storage usage, and it doesn't repor that its using much?

      "ephemeral-storage": {
        "time": "2024-09-13T10:05:57Z",
        "availableBytes": 359237095424,
        "capacityBytes": 1054657961984,
        "usedBytes": 9802,
        "inodesFree": 65160966,
        "inodes": 65470464,
        "inodesUsed": 2
      },

but this may not be accurate or there's some other complication, otherwise it make sense

njuguna-n · 2024-09-13T10:18:00Z

Once we attach a new persistent volume we will lose all the data in ephemeral storage right?

witash · 2024-09-13T11:48:16Z

Once we attach a new persistent volume we will lose all the data in ephemeral storage right?

yes. If we really wanted to save it, could do

pgdump current database
helm uninstall
pull latest cht sync (with fix merged)
set couch2pg replicas to 0 in template
helm install
restore from pgdump file
set couch2pg replicas back to 1
helm upgrade to apply

I think since its only a few days data anyway, simpler to start from the beginning.

merge and release fix(#145): adds volume mount for postgres #153
helm uninstall
add storageSize: "950GB" (or whatever, node has currently 1TB, assuming can use all of it) under postgres section
helm reinstall

or set postgres.enabled to false and go back to using the postgres that is outside the cluster

I think the postgres outside the cluster might be easier to maintain long term;

this persistent volume is not easily resizeable so we need to be sure this is the most space it would ever need (even years later)
DBAs can manage it in the way they are probably used to, instead of having to go through kubernetes

## [1.0.2](v1.0.1...v1.0.2) (2024-09-13) ### Bug Fixes * **#145:** adds volume mount for postgres ([a128f71](a128f71)), closes [#145](#145)

medic-ci · 2024-09-13T11:51:01Z

🎉 This issue has been resolved in version 1.0.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

njuguna-n · 2024-09-13T13:37:06Z

@eljhkrr is out of office, @witash are you able to do the helm upgrade?

witash · 2024-09-13T14:00:54Z

ok, Ill do it now

witash · 2024-09-13T14:28:30Z

alright, its running now.
will keep an eye on it over the weekend

witash · 2024-09-16T07:17:37Z

Now, the next problem....mostly it ran fine over the weekend, almost all instance finished syncing. This is much faster than expected, I guess network latency between medic infrastucture and the instances made it much slower during testing.

But, because the couch2pg syncing is so much faster, the source table is way ahead of the dbt tables, and this is causing problems for the incremental update, which works best when syncing small data sets.

Now, its basically trying to do a full refresh, but using the incremental logic which requires a temp table.

             delete from "cht_sync_db"."v1"."document_metadata"                                                                                           +
             where (                                                                                                                                      +
                 uuid) in (                                                                                                                               +
                 select (uuid)                                                                                                                            +
                 from "document_metadata__dbt_tmp134537377717"                                                                                            +
             );                                                                                                                                           +
                                                                                                                                                          +
                                                                                                                                                          +
                                                                                                                                                          +
                                                                                                                                                          +
     insert into "cht_sync_db"."v1"."document_metadata" ("uuid", "_deleted", "saved_timestamp", "doc_type")                                               +
     (                                                                                                                                                    +
         select "uuid", "_deleted", "saved_timestamp", "doc_type"                                                                                         +
         from "document_metadata__dbt_tmp134537377717"                                                                                                    +
     )

document_metadata__dbt_tmp134537377717 is going to be > 200M rows so this is going to be very slow, and use a lot of space.

several instance were stalled for some reason. restarting the pods seems to have fixed the issue for now, but there are still > 10M changes to sync.
I would suggest waiting until they are close to finished (should be today sometime) and then manually populating the document_metadata table, by stopping the dbt pod and just running simply

DROP TABLE v1.document_metadata CASCADE;
insert into "cht_sync_db"."v1"."document_metadata" ("uuid", "_deleted", "saved_timestamp", "doc_type") (SELECT _id, _deleted, "saved_timestamp", doc->>'type' FROM v1.medic WHERE _deleted = false);

which is as fast as its going to get.
the current query took 17 hours for only 7M rows, so, its unlikely to finish anytime soon.

njuguna-n · 2024-09-17T07:03:12Z

All counties have completely synced except Isiolo which is stuck for now. I was just about to manually populate the document_metadata table but noticed that it just completed updating (see logs below). I think we should give dbt run a few more hours before assessing the need for manually updating the tables.

3:45:36  1 of 77 START sql incremental model v1.dbt_results ............................. [RUN]
13:45:37  1 of 77 OK created sql incremental model v1.dbt_results ........................ [INSERT 0 0 in 0.80s]
13:45:37  2 of 77 START sql incremental model v1.document_metadata ....................... [RUN]
06:56:54  2 of 77 OK created sql incremental model v1.document_metadata .................. [INSERT 0 161519768 in 234677.55s]
06:56:54  3 of 77 START sql materialized_view model v1.ext_ssview_activity_all_time ...... [RUN]
06:56:55  3 of 77 OK created sql materialized_view model v1.ext_ssview_activity_all_time . [REFRESH MATERIALIZED VIEW in 0.25s]
06:56:55  4 of 77 START sql materialized_view model v1.ext_ssview_activity_monthly ....... [RUN]
06:56:55  4 of 77 OK created sql materialized_view model v1.ext_ssview_activity_monthly .. [REFRESH MATERIALIZED VIEW in 0.16s]
06:56:55  5 of 77 START sql incremental model v1.contact ................................. [RUN]

njuguna-n · 2024-09-17T07:19:23Z

Digging a bit deeper, the next dbt run will have to update ~92 million rows on the document_metadata table which will take at least a few more days given past run times. So updating that table manually makes sense.

njuguna-n · 2024-09-17T07:37:39Z

@witash what would be the best way to pause the dbt pod?

witash · 2024-09-17T08:49:07Z

Can edit the dbt template, change replicas from 0 to 1 and run helm upgrade. Then check the processlist and kill the query if its still running before starting the manual copy

But maybe just let it run at this point? Copying the table manually will be faster, but will still take a day or so

Either way, unfortunately I have lost access and cannot do it myself

njuguna-n · 2024-09-17T09:29:12Z

I have started the manual insertion process for the table and the query is now running.

njuguna-n · 2024-09-18T09:45:43Z

The insert query for document_metadata is still running. I will update here once that is done.

njuguna-n · 2024-09-19T06:46:47Z

The query is complete and I have restarted dbt. The first run will likely still take some time as tables and views are being built and/or being populated with data. I will keep an eye on it and provide an update later.

njuguna-n · 2024-09-19T13:45:21Z

@witash the storage error is still occurring.

Name:             cht-sync-dbt-789964dbb6-f8pjt
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Thu, 19 Sep 2024 12:04:16 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 51040476Ki. Container dbt was using 30608366, request is 0, has larger consumption of ephemeral-storage.

The dbt pod has been evicted multiple times and has not managed to have a complete run

cht-sync-dbt-789964dbb6-747wv                                    1/1     Running                  0                49m
cht-sync-dbt-789964dbb6-f8pjt                                    0/1     Error                    0                108m
cht-sync-dbt-789964dbb6-fnjzb                                    0/1     Error                    0                166m
cht-sync-dbt-789964dbb6-jcjnh                                    0/1     Error                    0                3h40m
cht-sync-dbt-789964dbb6-k2j6r                                    0/1     Error                    0                7h4m
cht-sync-dbt-789964dbb6-wz7kq                                    0/1     ContainerStatusUnknown   1                4h36m

njuguna-n · 2024-09-19T13:47:03Z

@alexosugo is it possible to have a postgres instance that is hosted outside the node running CHT Sync?

witash · 2024-09-19T15:50:07Z

those big temp tables are using up all the disk space.
I noticed disk space was almost used up; 99%
in pg_stat_activity were several queries creating temp tables for contacts.
then, a second later, nothing in pg_stat_activity, and disk space down to 70% (still not great, but will avoid pod eviction)

investigator@echisproxy:~$ kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- df -h /var/lib/postgresql/data
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  983G  639G  294G  69% /var/lib/postgresql/data

not sure what stopped all the queries

maybe we can do the same manual insert with contact and data_record to avoid creating temp tables.
if we need to free up disk space, there are a lot of big task and target documents that aren't used in queries...although deleting them is a little risky if we would need them again.

witash · 2024-09-19T15:52:36Z

im going to stop dbt for now to keep it from churning

witash · 2024-09-20T05:06:27Z

Thinking about it some more, the temp tables are only the first problem; with 70% disk space used, its likely that we would run out of disk space further down the pipeline (all the models are materialized views which are also going to take up space) even if we did a manual insert of contact and data_record.

So I would suggest we

delete all the task from the source table

DELETE FROM v1.medic where doc->>'type' = 'task'

This would delete 170M rows, which have large doc columns.

If they were needed again, we would just remove the sequences from couchdb_progress, which would trigger the couch2pg instances to start from the beginning; it does not reinsert duplicates so it would only add the missing docs.

But If we did really need all that data, we need another solution anyway, which would probably be just more storage.

Then we can try a manual insert of contact and data_record and see what storage usage is like.
if its okay, restart dbt
If the external postgres has more storage available, that could be a good option for more long term stability. But if it doesn't, its not going to help. If we moved to that option, deleting the tasks would still help, since we would need to do a pgdump and restore.

@njuguna-n or @alexosugo please review this plan before I start the delete query

njuguna-n · 2024-09-20T06:13:32Z

@witash I like the approach to delete the task documents. They are not immediately required for the dashboards and there is a way to recover them from CouchDB should we need them. Having the dashboards working and up to date and reliable is more important at the moment so I say go for it!

njuguna-n · 2024-09-20T08:30:40Z

@witash we also need to run the same delete query on the document_metadata table.

witash · 2024-09-20T13:48:21Z

Deleting tasks worked well so far, removed 364GB which should be plenty for the rest of the pipeline.

Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  983G  275G  659G  30% /var/lib/postgresql/data

restarted couch2pg and dbt, will continue to monitor

njuguna-n · 2024-09-27T06:12:12Z

We are no longer experiencing this issue after deleting tasks to clear up some space. It might recur once we add back tasks but we can reopen it at that point or create a new issue.

andrablaj added this to Product Team Activities Sep 3, 2024

github-project-automation bot moved this to Todo in Product Team Activities Sep 3, 2024

njuguna-n self-assigned this Sep 4, 2024

witash added a commit that referenced this issue Sep 13, 2024

fix(#145): adds volume mount for postgres

bee1d87

andrablaj mentioned this issue Sep 13, 2024

fix(#145): adds volume mount for postgres #153

Merged

witash added a commit that referenced this issue Sep 13, 2024

fix(#145): adds volume mount for postgres

a128f71

medic-ci pushed a commit that referenced this issue Sep 13, 2024

chore(release): 1.0.2 [skip ci]

fd6b0cf

## [1.0.2](v1.0.1...v1.0.2) (2024-09-13) ### Bug Fixes * **#145:** adds volume mount for postgres ([a128f71](a128f71)), closes [#145](#145)

medic-ci added the released label Sep 13, 2024

witash added released and removed released labels Sep 13, 2024

andrablaj removed the released label Sep 19, 2024

This was referenced Sep 23, 2024

First sync for a large data set takes very long and can cause crashes #156

Open

Document server requirements & needed setup to run CHT Sync successfully #85

Closed

njuguna-n closed this as completed Sep 27, 2024

github-project-automation bot moved this from In Progress to Done in Product Team Activities Sep 27, 2024

DBT Pod running out of disk space #145

DBT Pod running out of disk space #145

Comments

njuguna-n commented Sep 3, 2024

njuguna-n commented Sep 4, 2024

andrablaj commented Sep 9, 2024

njuguna-n commented Sep 9, 2024

eljhkrr commented Sep 9, 2024

njuguna-n commented Sep 9, 2024

njuguna-n commented Sep 11, 2024

njuguna-n commented Sep 12, 2024

njuguna-n commented Sep 12, 2024

witash commented Sep 12, 2024

witash commented Sep 12, 2024

njuguna-n commented Sep 12, 2024

witash commented Sep 12, 2024

njuguna-n commented Sep 13, 2024

njuguna-n commented Sep 13, 2024

witash commented Sep 13, 2024

njuguna-n commented Sep 13, 2024

witash commented Sep 13, 2024

njuguna-n commented Sep 13, 2024

witash commented Sep 13, 2024

medic-ci commented Sep 13, 2024

njuguna-n commented Sep 13, 2024

witash commented Sep 13, 2024

witash commented Sep 13, 2024

witash commented Sep 16, 2024

njuguna-n commented Sep 17, 2024

njuguna-n commented Sep 17, 2024

njuguna-n commented Sep 17, 2024

witash commented Sep 17, 2024

njuguna-n commented Sep 17, 2024

njuguna-n commented Sep 18, 2024

njuguna-n commented Sep 19, 2024 • edited Loading

njuguna-n commented Sep 19, 2024

njuguna-n commented Sep 19, 2024

witash commented Sep 19, 2024

witash commented Sep 19, 2024

witash commented Sep 20, 2024

njuguna-n commented Sep 20, 2024

njuguna-n commented Sep 20, 2024

witash commented Sep 20, 2024

njuguna-n commented Sep 27, 2024

njuguna-n commented Sep 19, 2024 •

edited

Loading