Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBT Pod running out of disk space #145

Closed
njuguna-n opened this issue Sep 3, 2024 · 40 comments
Closed

DBT Pod running out of disk space #145

njuguna-n opened this issue Sep 3, 2024 · 40 comments
Assignees

Comments

@njuguna-n
Copy link
Contributor

Describe the bug
A dbt pod running on MoH Kenya servers was terminated due to exceeding its storage limit. See logs below.

To Reproduce

  1. SSH into the MoH servers using credentials shared on 1password
  2. RUn kubectl -n echis-cht-sync describe pod cht-sync-dbt-789964dbb6-dcqch to view details about the pod and note the message providing the reason the pod was evicted.

Expected behavior
The pod does not run out of disk space

Logs

Name:             cht-sync-dbt-789964dbb6-dcqch
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Sat, 31 Aug 2024 06:02:09 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 51464084Ki. Container dbt was using 30335059, request is 0, has larger consumption of ephemeral-storage.
IP:               10.42.52.181
IPs:
  IP:           10.42.52.181
Controlled By:  ReplicaSet/cht-sync-dbt-789964dbb6
@njuguna-n
Copy link
Contributor Author

Ran bash in the container with kubectl exec -it cht-sync-dbt-789964dbb6-ssrjl --container dbt bash -n echis-cht-sync
Running df -h to view the available disk space on the file system yields the results below

Filesystem                         Size  Used Avail Use% Mounted on
overlay                            983G  362G  571G  39% /
tmpfs                               64M     0   64M   0% /dev
tmpfs                               30G     0   30G   0% /sys/fs/cgroup
shm                                 64M   16K   64M   1% /dev/shm
/dev/mapper/ubuntu--vg-ubuntu--lv  983G  362G  571G  39% /etc/hosts
tmpfs                               30G     0   30G   0% /proc/acpi
tmpfs                               30G     0   30G   0% /proc/scsi
tmpfs                               30G     0   30G   0% /sys/firmware

Running du -h --max-depth=1 | sort -n -r to show disk usage in the top-level directories in the /dbt directory sorted by size yields the results below

740K	./logs
33M	.
27M	./dbt_packages
8.0K	./.dbt
3.8M	./target
1.3M	./old_manifest

The dbt logs don't seem to be an issue currently. The pod is still running without issues so I will check in on this later today or tomorrow.

@andrablaj
Copy link
Member

@njuguna-n @mrjones-plip any updates on this problem? Are there still storage concerns?

@njuguna-n
Copy link
Contributor Author

@andrablaj I have been unable to SSH into the server this morning to check on the new pod probably due to some ongoing maintenance. I will update here once I get access.

@eljhkrr
Copy link
Member

eljhkrr commented Sep 9, 2024

Connected with MoH infrastructure team today and they haven't expanded storage yet. Plan is to complete it this week.

@njuguna-n
Copy link
Contributor Author

Thanks @eljhkrr. Wi will still need to get to the bottom of what is taking up the disk space.

PS: The servers are still offline so I will try logging in again tomorrow.

@njuguna-n
Copy link
Contributor Author

Server are still inaccessible so no update on this.

@njuguna-n
Copy link
Contributor Author

This is still an issue. The pod has restarted two more times with the same error. Logs below

Name:             cht-sync-dbt-789964dbb6-ssrjl
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Tue, 03 Sep 2024 05:49:09 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 50599036Ki. Container dbt was using 31786940, request is 0, has larger consumption of ephemeral-storage.
Name:             cht-sync-dbt-789964dbb6-k6pkp
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Tue, 10 Sep 2024 14:08:55 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 51163920Ki. Container dbt was using 30294666, request is 0, has larger consumption of ephemeral-storage.

This is not an issue with the dbt container or its logs since it only takes up about 32MB. Having a look at cht-sync-couch2pg pod, it failed with a similar reason and takes up even less space

Name:             cht-sync-couch2pg--elgeyomarakwet-echis-go-ke-668577fd5b-v27gg
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Sat, 31 Aug 2024 06:02:09 +0000
Labels:           app=cht-sync
                  pod-template-hash=668577fd5b
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 3438972Ki. Container couch2pg was using 8Ki, request is 0, has larger consumption of ephemeral-storage.

The only other pod is the postgres one which is the culprit. Running kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- df -h /tmp yields the result below

Filesystem      Size  Used Avail Use% Mounted on
overlay         983G  482G  452G  52% /

Running kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- du -sh /* we get the logs below. I have edited them slightly to remove some permission denied ones.

0	/bin
4.0K	/boot
16K	/dev
944K	/etc
4.0K	/home
0	/lib
0	/lib64
4.0K	/media
4.0K	/mnt
4.0K	/opt
0	/proc
16K	/root
20K	/run
0	/sbin
4.0K	/srv
0	/sys
4.0K	/tmp
408M	/usr
459G	/var

Running kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- du -sh /var/* results in

4.0K	/var/backups
1.8M	/var/cache
459G	/var/lib
4.0K	/var/local
0	/var/lock
96K	/var/log
4.0K	/var/mail
4.0K	/var/opt
0	/var/run
4.0K	/var/spool
4.0K	/var/tmp

Now trying to find out what is taking up the storage in /var/lib

@njuguna-n
Copy link
Contributor Author

After running du -h --max-depth=1 | sort -n -r in the var/lib library I get the result below

460G	./postgresql
460G	.
72K	./systemd
28K	./pam
28K	./apt
24K	./ucf
6.3M	./dpkg
4.0K	./misc

My theory is that the large amounts of data being transformed by dbt result in postgres using up most of the temporary storage available while running the queries. I am not sure how we can resolve this other than increasing the resource limits available. @witash @dianabarsan any ideas?

@witash
Copy link
Contributor

witash commented Sep 12, 2024

I thought they were running cht-sync to the existing production postgres outside the cluster?
They changed to a postgres in the cluster instead?

@witash
Copy link
Contributor

witash commented Sep 12, 2024

Also the eviction threshold 52732898884 (50GB) seems quite high
can't see where its configured but we could probably lower even to 1GB or something

@njuguna-n
Copy link
Contributor Author

@witash wouldn't that mean that the pods would get evicted more often?

@witash
Copy link
Contributor

witash commented Sep 12, 2024

I think this message

Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 50599036Ki. Container dbt was using 31786940, request is 0, has larger consumption of ephemeral-storage

Means that because the available ephemeral storage on the node (50599036Ki) is less than the threshold (52732898884) its evicting pods. so by setting the threshold lower it would evict pods only if available ephemeral storage was less than 1GB

@njuguna-n
Copy link
Contributor Author

@eljhkrr are you familiar with where we can set this eviction threshold?

@njuguna-n
Copy link
Contributor Author

This might be unrelated but worth highlighting. It seems the postgres pod lost data, running select min(m.saved_timestamp) from v1.medic m; results in 2024-09-10 18:15:48.398. Most of the currently running pods started at around the same time.
cc: @alexosugo @eljhkrr @witash @dianabarsan

NAME                                                             READY   STATUS                   RESTARTS         AGE
cht-sync-couch2pg--baringo-echis-go-ke-8cf5cc87f-fl8k5           1/1     Running                  7 (22h ago)      4d18h
cht-sync-couch2pg--bomet-echis-go-ke-7b557946ff-fvfgz            1/1     Running                  1 (2d14h ago)    2d14h
cht-sync-couch2pg--bungoma-echis-go-ke-7bb48c859c-dzzxt          1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--busia-echis-go-ke-68bf657f8-2tcnp             1/1     Running                  7 (22h ago)      2d15h
cht-sync-couch2pg--elgeyomarakwet-echis-go-ke-668577fd5b-mdwfg   1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--embu-echis-go-ke-64cfcc67f4-kmv7w             1/1     Running                  7 (22h ago)      2d15h
cht-sync-couch2pg--garissa-echis-go-ke-76b84c6874-qjdq4          1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--homabay-echis-go-ke-76fcc5ddf7-4n896          1/1     Running                  15 (27m ago)     2d18h
cht-sync-couch2pg--isiolo-echis-go-ke-8d6f699cc-tq8xb            1/1     Running                  0                4d18h
cht-sync-couch2pg--kajiado-echis-go-ke-7bcc5886c6-b5965          1/1     Running                  0                2d14h
cht-sync-couch2pg--kakamega-echis-go-ke-676d9dd994-x8df2         1/1     Running                  29 (28m ago)     2d15h
cht-sync-couch2pg--kericho-echis-go-ke-64dffd4f48-h4njh          1/1     Running                  4 (24h ago)      2d14h
cht-sync-couch2pg--kiambu-echis-go-ke-5df788c65f-nx7pt           1/1     Running                  8 (29m ago)      2d18h
cht-sync-couch2pg--kilifi-echis-go-ke-5b6b78dfb7-wzft4           1/1     Running                  10 (39h ago)     2d18h
cht-sync-couch2pg--kirinyaga-echis-go-ke-fbdcfc748-d8ggp         1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--kisii-echis-go-ke-7557c7fb64-bvmc4            1/1     Running                  7 (38h ago)      2d18h
cht-sync-couch2pg--kisumu-echis-go-ke-75bf9844c8-526pl           1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--kitui-echis-go-ke-7bdfff86bf-crrpw            1/1     Running                  9 (28m ago)      2d18h
cht-sync-couch2pg--kwale-echis-go-ke-6f8fb8749f-ldwkc            1/1     Running                  8 (43h ago)      2d15h
cht-sync-couch2pg--laikipia-echis-go-ke-89774b7d9-4z2cd          1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--lamu-echis-go-ke-588fbb88d4-h9fq4             1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--machakos-echis-go-ke-7f5448f4fb-bw62p         1/1     Running                  7 (27h ago)      4d18h
cht-sync-couch2pg--makueni-echis-go-ke-5b6c6d6ccb-wgng6          1/1     Running                  0                2d14h
cht-sync-couch2pg--mandera-echis-go-ke-5cbdbb975c-sgtml          1/1     Running                  1 (2d14h ago)    2d14h
cht-sync-couch2pg--marsabit-echis-go-ke-86687777-j9lnx           1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--meru-echis-go-ke-54fbcd74cf-tqpjr             1/1     Running                  7 (2d2h ago)     2d18h
cht-sync-couch2pg--migori-echis-go-ke-6bd8dfb78b-6tcvg           1/1     Running                  12 (21h ago)     2d18h
cht-sync-couch2pg--mombasa-echis-go-ke-68f77b68bd-59rbz          1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--muranga-echis-go-ke-59c49b5d6d-s6tlh          1/1     Running                  7 (43h ago)      2d15h
cht-sync-couch2pg--nairobi-echis-go-ke-57d4d46b9f-rf8lv          1/1     Running                  86 (10h ago)     2d15h
cht-sync-couch2pg--nakuru-echis-go-ke-5564d9d797-g4bbb           1/1     Running                  1 (35h ago)      2d14h
cht-sync-couch2pg--nandi-echis-go-ke-8c6849695-97ckz             1/1     Running                  8 (27m ago)      2d15h
cht-sync-couch2pg--narok-echis-go-ke-57b78f54b8-sclgn            1/1     Running                  0                2d14h
cht-sync-couch2pg--nyamira-echis-go-ke-6b4db7dfb8-nsmv6          1/1     Running                  7 (22h ago)      2d15h
cht-sync-couch2pg--nyandarua-echis-go-ke-5c6c6cdfd5-btk9d        1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--nyeri-echis-go-ke-7b4cb79757-6mfrw            1/1     Running                  10 (27h ago)     4d18h
cht-sync-couch2pg--samburu-echis-go-ke-c88c6c459-67wmr           1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--taitataveta-echis-go-ke-745fb9dcf4-vjbkt      1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--tanariver-echis-go-ke-f59488b7f-rnnb2         1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-couch2pg--tharakanithi-echis-go-ke-78cfc7848-8p6nr      1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--transnzoia-echis-go-ke-6b758dd697-xjhfg       1/1     Running                  2 (72m ago)      2d14h
cht-sync-couch2pg--turkana-echis-go-ke-b7867ddcc-gmdvk           1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--uasingishu-echis-go-ke-78fc9fcf7c-xv8x8       1/1     Running                  6 (2d14h ago)    2d18h
cht-sync-couch2pg--vihiga-echis-go-ke-7bb44f5d79-xc22w           1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--wajir-echis-go-ke-66fc97f86d-hj7mz            1/1     Running                  6 (2d14h ago)    4d18h
cht-sync-couch2pg--westpokot-echis-go-ke-56cc68df66-lzxhw        1/1     Running                  6 (2d14h ago)    2d15h
cht-sync-dbt-789964dbb6-qxs98                                    1/1     Running                  0                2d14h
cht-sync-postgres-0                                              1/1     Running                  0                2d14h

@witash
Copy link
Contributor

witash commented Sep 13, 2024

yea I just noticed that also...although the postgres service has a configured persisten volume, its not mounted, so it may not actually be using it...and also the claim is only for 1GB, and the data is about 450GB.

kubectl -n echis-cht-sync describe pod cht-sync-postgres-0
Name:             cht-sync-postgres-0
...
Containers:
  postgres-c:
    ...
    Mounts:               <none>
Volumes:
  postgres-data:
    Type:        PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:   postgres-data-cht-sync-postgres-0
    ReadOnly:    false
Events:          <none>
kubectl -n echis-cht-sync describe pv pvc-74bae13c-d740-4ece-afd3-05ab100859bf
Name:              pvc-74bae13c-d740-4ece-afd3-05ab100859bf
...
StorageClass:      local-path
Status:            Bound
Claim:             echis-cht-sync/postgres-data-cht-sync-postgres-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi

So I don't fully understand all this yet, but I think it could be related

@njuguna-n
Copy link
Contributor Author

Yes, looks like it could be related. Could postgres be saving all the data in ephemeral instead of persistent storage thus causing the issues?

@witash
Copy link
Contributor

witash commented Sep 13, 2024

Yes, looks like it could be related. Could postgres be saving all the data in ephemeral instead of persistent storage thus causing the issues?

yes, I think it is. The thing that is confusing me is that I used

kubectl get --raw "/api/v1/nodes/echischtsync/proxy/stats/summary" | jq 

to get the ephemeral storage usage, and it doesn't repor that its using much?

      "ephemeral-storage": {
        "time": "2024-09-13T10:05:57Z",
        "availableBytes": 359237095424,
        "capacityBytes": 1054657961984,
        "usedBytes": 9802,
        "inodesFree": 65160966,
        "inodes": 65470464,
        "inodesUsed": 2
      },

but this may not be accurate or there's some other complication, otherwise it make sense

@njuguna-n
Copy link
Contributor Author

Once we attach a new persistent volume we will lose all the data in ephemeral storage right?

@witash
Copy link
Contributor

witash commented Sep 13, 2024

Once we attach a new persistent volume we will lose all the data in ephemeral storage right?

yes. If we really wanted to save it, could do

  1. pgdump current database
  2. helm uninstall
  3. pull latest cht sync (with fix merged)
  4. set couch2pg replicas to 0 in template
  5. helm install
  6. restore from pgdump file
  7. set couch2pg replicas back to 1
  8. helm upgrade to apply

I think since its only a few days data anyway, simpler to start from the beginning.

  1. merge and release fix(#145): adds volume mount for postgres #153
  2. helm uninstall
  3. add storageSize: "950GB" (or whatever, node has currently 1TB, assuming can use all of it) under postgres section
  4. helm reinstall

or set postgres.enabled to false and go back to using the postgres that is outside the cluster

I think the postgres outside the cluster might be easier to maintain long term;

  1. this persistent volume is not easily resizeable so we need to be sure this is the most space it would ever need (even years later)
  2. DBAs can manage it in the way they are probably used to, instead of having to go through kubernetes

medic-ci pushed a commit that referenced this issue Sep 13, 2024
## [1.0.2](v1.0.1...v1.0.2) (2024-09-13)

### Bug Fixes

* **#145:** adds volume mount for postgres ([a128f71](a128f71)), closes [#145](#145)
@medic-ci
Copy link

🎉 This issue has been resolved in version 1.0.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@njuguna-n
Copy link
Contributor Author

@eljhkrr is out of office, @witash are you able to do the helm upgrade?

@witash
Copy link
Contributor

witash commented Sep 13, 2024

ok, Ill do it now

@witash
Copy link
Contributor

witash commented Sep 13, 2024

alright, its running now.
will keep an eye on it over the weekend

@witash
Copy link
Contributor

witash commented Sep 16, 2024

Now, the next problem....mostly it ran fine over the weekend, almost all instance finished syncing. This is much faster than expected, I guess network latency between medic infrastucture and the instances made it much slower during testing.

But, because the couch2pg syncing is so much faster, the source table is way ahead of the dbt tables, and this is causing problems for the incremental update, which works best when syncing small data sets.

Now, its basically trying to do a full refresh, but using the incremental logic which requires a temp table.

             delete from "cht_sync_db"."v1"."document_metadata"                                                                                           +
             where (                                                                                                                                      +
                 uuid) in (                                                                                                                               +
                 select (uuid)                                                                                                                            +
                 from "document_metadata__dbt_tmp134537377717"                                                                                            +
             );                                                                                                                                           +
                                                                                                                                                          +
                                                                                                                                                          +
                                                                                                                                                          +
                                                                                                                                                          +
     insert into "cht_sync_db"."v1"."document_metadata" ("uuid", "_deleted", "saved_timestamp", "doc_type")                                               +
     (                                                                                                                                                    +
         select "uuid", "_deleted", "saved_timestamp", "doc_type"                                                                                         +
         from "document_metadata__dbt_tmp134537377717"                                                                                                    +
     ) 

document_metadata__dbt_tmp134537377717 is going to be > 200M rows so this is going to be very slow, and use a lot of space.

several instance were stalled for some reason. restarting the pods seems to have fixed the issue for now, but there are still > 10M changes to sync.
I would suggest waiting until they are close to finished (should be today sometime) and then manually populating the document_metadata table, by stopping the dbt pod and just running simply

DROP TABLE v1.document_metadata CASCADE;
insert into "cht_sync_db"."v1"."document_metadata" ("uuid", "_deleted", "saved_timestamp", "doc_type") (SELECT _id, _deleted, "saved_timestamp", doc->>'type' FROM v1.medic WHERE _deleted = false);

which is as fast as its going to get.
the current query took 17 hours for only 7M rows, so, its unlikely to finish anytime soon.

@njuguna-n
Copy link
Contributor Author

All counties have completely synced except Isiolo which is stuck for now. I was just about to manually populate the document_metadata table but noticed that it just completed updating (see logs below). I think we should give dbt run a few more hours before assessing the need for manually updating the tables.

3:45:36  1 of 77 START sql incremental model v1.dbt_results ............................. [RUN]
13:45:37  1 of 77 OK created sql incremental model v1.dbt_results ........................ [INSERT 0 0 in 0.80s]
13:45:37  2 of 77 START sql incremental model v1.document_metadata ....................... [RUN]
06:56:54  2 of 77 OK created sql incremental model v1.document_metadata .................. [INSERT 0 161519768 in 234677.55s]
06:56:54  3 of 77 START sql materialized_view model v1.ext_ssview_activity_all_time ...... [RUN]
06:56:55  3 of 77 OK created sql materialized_view model v1.ext_ssview_activity_all_time . [REFRESH MATERIALIZED VIEW in 0.25s]
06:56:55  4 of 77 START sql materialized_view model v1.ext_ssview_activity_monthly ....... [RUN]
06:56:55  4 of 77 OK created sql materialized_view model v1.ext_ssview_activity_monthly .. [REFRESH MATERIALIZED VIEW in 0.16s]
06:56:55  5 of 77 START sql incremental model v1.contact ................................. [RUN]

@njuguna-n
Copy link
Contributor Author

Digging a bit deeper, the next dbt run will have to update ~92 million rows on the document_metadata table which will take at least a few more days given past run times. So updating that table manually makes sense.

@njuguna-n
Copy link
Contributor Author

@witash what would be the best way to pause the dbt pod?

@witash
Copy link
Contributor

witash commented Sep 17, 2024

Can edit the dbt template, change replicas from 0 to 1 and run helm upgrade. Then check the processlist and kill the query if its still running before starting the manual copy

But maybe just let it run at this point? Copying the table manually will be faster, but will still take a day or so

Either way, unfortunately I have lost access and cannot do it myself

@njuguna-n
Copy link
Contributor Author

I have started the manual insertion process for the table and the query is now running.

@njuguna-n
Copy link
Contributor Author

The insert query for document_metadata is still running. I will update here once that is done.

@njuguna-n
Copy link
Contributor Author

njuguna-n commented Sep 19, 2024

The query is complete and I have restarted dbt. The first run will likely still take some time as tables and views are being built and/or being populated with data. I will keep an eye on it and provide an update later.

@njuguna-n
Copy link
Contributor Author

@witash the storage error is still occurring.

Name:             cht-sync-dbt-789964dbb6-f8pjt
Namespace:        echis-cht-sync
Priority:         0
Service Account:  default
Node:             echischtsync/10.127.105.170
Start Time:       Thu, 19 Sep 2024 12:04:16 +0000
Labels:           app=cht-sync
                  pod-template-hash=789964dbb6
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: ephemeral-storage. Threshold quantity: 52732898884, available: 51040476Ki. Container dbt was using 30608366, request is 0, has larger consumption of ephemeral-storage.

The dbt pod has been evicted multiple times and has not managed to have a complete run

cht-sync-dbt-789964dbb6-747wv                                    1/1     Running                  0                49m
cht-sync-dbt-789964dbb6-f8pjt                                    0/1     Error                    0                108m
cht-sync-dbt-789964dbb6-fnjzb                                    0/1     Error                    0                166m
cht-sync-dbt-789964dbb6-jcjnh                                    0/1     Error                    0                3h40m
cht-sync-dbt-789964dbb6-k2j6r                                    0/1     Error                    0                7h4m
cht-sync-dbt-789964dbb6-wz7kq                                    0/1     ContainerStatusUnknown   1                4h36m

@njuguna-n
Copy link
Contributor Author

@alexosugo is it possible to have a postgres instance that is hosted outside the node running CHT Sync?

@witash
Copy link
Contributor

witash commented Sep 19, 2024

those big temp tables are using up all the disk space.
I noticed disk space was almost used up; 99%
in pg_stat_activity were several queries creating temp tables for contacts.
then, a second later, nothing in pg_stat_activity, and disk space down to 70% (still not great, but will avoid pod eviction)

investigator@echisproxy:~$ kubectl -n echis-cht-sync exec cht-sync-postgres-0 -- df -h /var/lib/postgresql/data
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  983G  639G  294G  69% /var/lib/postgresql/data

not sure what stopped all the queries

maybe we can do the same manual insert with contact and data_record to avoid creating temp tables.
if we need to free up disk space, there are a lot of big task and target documents that aren't used in queries...although deleting them is a little risky if we would need them again.

@witash
Copy link
Contributor

witash commented Sep 19, 2024

im going to stop dbt for now to keep it from churning

@witash
Copy link
Contributor

witash commented Sep 20, 2024

Thinking about it some more, the temp tables are only the first problem; with 70% disk space used, its likely that we would run out of disk space further down the pipeline (all the models are materialized views which are also going to take up space) even if we did a manual insert of contact and data_record.

So I would suggest we

  1. delete all the task from the source table
DELETE FROM v1.medic where doc->>'type' = 'task'

This would delete 170M rows, which have large doc columns.

If they were needed again, we would just remove the sequences from couchdb_progress, which would trigger the couch2pg instances to start from the beginning; it does not reinsert duplicates so it would only add the missing docs.

But If we did really need all that data, we need another solution anyway, which would probably be just more storage.

  1. Then we can try a manual insert of contact and data_record and see what storage usage is like.
  2. if its okay, restart dbt
  3. If the external postgres has more storage available, that could be a good option for more long term stability. But if it doesn't, its not going to help. If we moved to that option, deleting the tasks would still help, since we would need to do a pgdump and restore.

@njuguna-n or @alexosugo please review this plan before I start the delete query

@njuguna-n
Copy link
Contributor Author

@witash I like the approach to delete the task documents. They are not immediately required for the dashboards and there is a way to recover them from CouchDB should we need them. Having the dashboards working and up to date and reliable is more important at the moment so I say go for it!

@njuguna-n
Copy link
Contributor Author

@witash we also need to run the same delete query on the document_metadata table.

@witash
Copy link
Contributor

witash commented Sep 20, 2024

Deleting tasks worked well so far, removed 364GB which should be plenty for the rest of the pipeline.

Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  983G  275G  659G  30% /var/lib/postgresql/data

restarted couch2pg and dbt, will continue to monitor

@njuguna-n
Copy link
Contributor Author

We are no longer experiencing this issue after deleting tasks to clear up some space. It might recur once we add back tasks but we can reopen it at that point or create a new issue.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Product Team Activities Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants