Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Consuming Hugely Disproportionate Data Disk Space in Overheads, and Consuming for No Apparent Reason #561

Open
AB442 opened this issue Nov 7, 2022 · 11 comments

Comments

@AB442
Copy link

AB442 commented Nov 7, 2022

Expected Behavior

The ECS OVA v3.7 installation. The expectation is that ECS storage uses a reasonable amount of disk space when a file is uploaded. For example a 1 GB file is uploaded, the user expects 1GB of disk space to be consumed, understandably this could be something like 1.5 GB with various overheads. It is expected that deleting the file releases all of the consumed disk space, allowing the space to be used for additional files.

Actual Behavior

ECS consumes massively more disk space on the data disk than the uploaded files should occupy (7 to 10 times more, multiplicative). Beyond this the disk space is also consumed passively after an upload, I have not been able to observe when this passive consumption of disk space stops and it appears to keep consuming more disk space as time goes on. The system was not rebooted during this time. All of the excess consumed space falls under metadata and protection overhead. I have a single node installation and metadata or protection are not enabled in my deploy.yml or bucket files. The machine has a large disk but this issue made another ECS deployment of mine with a smaller disk crash.

Examples of what happened: In my experience the disk space consumed by user files disk is dwarfed by metadata overhead (over 2x size of user files) and protection overhead (over 4x size of user files). More worrying is that ECS appears to be consuming extra disk space passively, for no apparent reason. For example I logged off on a Friday evening and approximately 700 GB was consumed, by Monday morning it was over 1 TB, all of the additional consumption appeared to be metadata and protection overhead. I can view in real time the consumption rising, in the last few hours I have added no files but an additional 30 GB was used. I have a sizeable amount of disk space on this ECS node so I have had the ability to monitor this consumption, but on another ECS node I made some weeks ago I had 100 GB data disk, I decided to check this node and it is now crashed with 96% disk space used by ECS, I cannot check the logs specifically but it carries the same symptoms of my problem on my main node. The smaller test node only had a single 6.8 MB file uploaded to it. Neither node was rebooted since they were set up as I'm aware of the known issue that rebooting ECS can tie up disk space.

I would greatly appreciate assistance with this issue, as you can see it's quite serious and effectively renders ECS Community Edition unusable for any length of time, and certainly prevents evaluation of the utility. See screenshots below.

Steps to Reproduce Behavior

  1. Install ECS from OVA v3.7
  2. Perform necessary steps (1 & 2) with default values in deploy.yml (protection = false, etc), but necessary environment IPs etc added to file
  3. Once the platform is up and running begin uploading files and monitor the disk space used on the ECS dasboard. The disk space consumed should far outstrip the size of the uploaded files. In the capacity utilization tab of the dashboard you can view the breakdown of specific usage, if the issue has occurred the usage for metadata and protection overheads will be far higher than for the user files (2 to 4 times the size for each). Stopping the upload of files and waiting will allow one to observer further movement in memory consumption over the next day approx.

Relevant Output and Logs


image1

image2


Notifies: @nikhil-vr

@nikhil-vr
Copy link
Collaborator

Metadata usage is very high on the above example which is unexpected for ECS code , we reserve many copies of btree pages , not sure is this the cause or the btree/journal garbage which happens after 30/15 days. Is the capacity usage remains same or dropping it ?

I'm currently working on 3.8 and will test this , also optimize some GC parameters for smaller system.

@AB442
Copy link
Author

AB442 commented Dec 5, 2022

We left the node running for about 2 weeks but the usage kept increasing day after day. As this is is a single-node installation, I'm not sure if this also occurs in multi-node. I think even if garbage disposal did clean it out after 15 / 30 days it wouldn't help systems with smaller disks, for example I have seen the garbage build up enough to crash a 500 GB disk machine in a day or two. Thanks for the response.

@nikhil-vr
Copy link
Collaborator

Please test 3.8 , I have made some optimization for metadata reduction for small systems , please try and let me know.

@lriva94
Copy link

lriva94 commented Jan 26, 2023

Hello, I installed a 3.8 edition with 4 nodes (maybe it is considered as not a small system), I did inject very few data and system metadata are increasing every day. No GC actions seem to be triggered.
It seems it is not fixed even in 3.8

HistoricalCapa
vdcCapa

@lriva94
Copy link

lriva94 commented Jan 26, 2023

I used OVA installation.

@AB442
Copy link
Author

AB442 commented Jan 26, 2023

I can test 3.8 from source as it appears the OVA still has the issue. Hopefully will have some results in the coming days. As mentioned, I installed 3.7 from OVA and it presented this issue.

Update: I tried to install from source but encountered a few errors in step 1 which I unfortunately don't have time to debug at the moment, but I think it's a reasonable assumption that the issue likely exists there also. I have the ECS 3.8 OVA set up so if there is anything you would like to try out then let me know.

@nikhil-vr
Copy link
Collaborator

Manual deployment will not change. We are investigating on it , will let you know the outcome.

@tihopia
Copy link

tihopia commented Feb 23, 2023

Is there any knowledge that this problem is not present on earlier build? Is it worth to install 3.6.2.0 to prevent disk space consumption

It seems that versions 3.7 ja 3.8 suffers this. We have installed 3.8 both with ova and manual method and discovered same symptoms what are described here

@nikhil-vr
Copy link
Collaborator

We are releasing a new OVA image for 3.8 next week , will update once it is posted.

@AB442
Copy link
Author

AB442 commented Feb 23, 2023

To answer tihopa's question: I installed multiple versions and the problem was present on them, including 3.6, 3.5.

@jonnyoboy2110
Copy link

Is there any way to remove all the excess data that is being stored, or prevent any more of the excess data from being stored . I use my server for testing so I don't need any of the data after testing but its becoming difficult to set up a whole new server every time we need to test because the server has used up all of the storage on my machine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants