-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBT Pod running out of disk space #145
Comments
Ran bash in the container with
Running
The dbt logs don't seem to be an issue currently. The pod is still running without issues so I will check in on this later today or tomorrow. |
@njuguna-n @mrjones-plip any updates on this problem? Are there still storage concerns? |
@andrablaj I have been unable to SSH into the server this morning to check on the new pod probably due to some ongoing maintenance. I will update here once I get access. |
Connected with MoH infrastructure team today and they haven't expanded storage yet. Plan is to complete it this week. |
Thanks @eljhkrr. Wi will still need to get to the bottom of what is taking up the disk space. PS: The servers are still offline so I will try logging in again tomorrow. |
Server are still inaccessible so no update on this. |
This is still an issue. The pod has restarted two more times with the same error. Logs below
This is not an issue with the dbt container or its logs since it only takes up about 32MB. Having a look at
The only other pod is the postgres one which is the culprit. Running
Running
Running
Now trying to find out what is taking up the storage in |
After running
My theory is that the large amounts of data being transformed by dbt result in postgres using up most of the temporary storage available while running the queries. I am not sure how we can resolve this other than increasing the resource limits available. @witash @dianabarsan any ideas? |
I thought they were running cht-sync to the existing production postgres outside the cluster? |
Also the eviction threshold 52732898884 (50GB) seems quite high |
@witash wouldn't that mean that the pods would get evicted more often? |
I think this message
Means that because the available ephemeral storage on the node (50599036Ki) is less than the threshold (52732898884) its evicting pods. so by setting the threshold lower it would evict pods only if available ephemeral storage was less than 1GB |
@eljhkrr are you familiar with where we can set this eviction threshold? |
This might be unrelated but worth highlighting. It seems the postgres pod lost data, running
|
yea I just noticed that also...although the postgres service has a configured persisten volume, its not mounted, so it may not actually be using it...and also the claim is only for 1GB, and the data is about 450GB.
So I don't fully understand all this yet, but I think it could be related |
Yes, looks like it could be related. Could postgres be saving all the data in ephemeral instead of persistent storage thus causing the issues? |
yes, I think it is. The thing that is confusing me is that I used
to get the ephemeral storage usage, and it doesn't repor that its using much?
but this may not be accurate or there's some other complication, otherwise it make sense |
Once we attach a new persistent volume we will lose all the data in ephemeral storage right? |
yes. If we really wanted to save it, could do
I think since its only a few days data anyway, simpler to start from the beginning.
or set postgres.enabled to false and go back to using the postgres that is outside the cluster I think the postgres outside the cluster might be easier to maintain long term;
|
🎉 This issue has been resolved in version 1.0.2 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
ok, Ill do it now |
alright, its running now. |
Now, the next problem....mostly it ran fine over the weekend, almost all instance finished syncing. This is much faster than expected, I guess network latency between medic infrastucture and the instances made it much slower during testing. But, because the couch2pg syncing is so much faster, the source table is way ahead of the dbt tables, and this is causing problems for the incremental update, which works best when syncing small data sets. Now, its basically trying to do a full refresh, but using the incremental logic which requires a temp table.
several instance were stalled for some reason. restarting the pods seems to have fixed the issue for now, but there are still > 10M changes to sync.
which is as fast as its going to get. |
All counties have completely synced except Isiolo which is stuck for now. I was just about to manually populate the
|
Digging a bit deeper, the next dbt run will have to update ~92 million rows on the |
@witash what would be the best way to pause the dbt pod? |
Can edit the dbt template, change replicas from 0 to 1 and run helm upgrade. Then check the processlist and kill the query if its still running before starting the manual copy But maybe just let it run at this point? Copying the table manually will be faster, but will still take a day or so Either way, unfortunately I have lost access and cannot do it myself |
I have started the manual insertion process for the table and the query is now running. |
The insert query for |
The query is complete and I have restarted dbt. The first run will likely still take some time as tables and views are being built and/or being populated with data. I will keep an eye on it and provide an update later. |
@witash the storage error is still occurring.
The dbt pod has been evicted multiple times and has not managed to have a complete run
|
@alexosugo is it possible to have a postgres instance that is hosted outside the node running CHT Sync? |
those big temp tables are using up all the disk space.
not sure what stopped all the queries maybe we can do the same manual insert with contact and data_record to avoid creating temp tables. |
im going to stop dbt for now to keep it from churning |
Thinking about it some more, the temp tables are only the first problem; with 70% disk space used, its likely that we would run out of disk space further down the pipeline (all the models are materialized views which are also going to take up space) even if we did a manual insert of contact and data_record. So I would suggest we
This would delete 170M rows, which have large If they were needed again, we would just remove the sequences from couchdb_progress, which would trigger the couch2pg instances to start from the beginning; it does not reinsert duplicates so it would only add the missing docs. But If we did really need all that data, we need another solution anyway, which would probably be just more storage.
@njuguna-n or @alexosugo please review this plan before I start the delete query |
@witash I like the approach to delete the task documents. They are not immediately required for the dashboards and there is a way to recover them from CouchDB should we need them. Having the dashboards working and up to date and reliable is more important at the moment so I say go for it! |
@witash we also need to run the same delete query on the |
Deleting tasks worked well so far, removed 364GB which should be plenty for the rest of the pipeline.
restarted couch2pg and dbt, will continue to monitor |
We are no longer experiencing this issue after deleting tasks to clear up some space. It might recur once we add back tasks but we can reopen it at that point or create a new issue. |
Describe the bug
A dbt pod running on MoH Kenya servers was terminated due to exceeding its storage limit. See logs below.
To Reproduce
kubectl -n echis-cht-sync describe pod cht-sync-dbt-789964dbb6-dcqch
to view details about the pod and note the message providing the reason the pod was evicted.Expected behavior
The pod does not run out of disk space
Logs
The text was updated successfully, but these errors were encountered: