Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage #57

Closed
expectocode opened this issue Dec 3, 2016 · 5 comments
Closed

Reduce memory usage #57

expectocode opened this issue Dec 3, 2016 · 5 comments

Comments

@expectocode
Copy link

Running on a machine with 1GB RAM is hellishly slow when processing a decent number of messages, would be a great help if you made it not load so much into memory :) Love the project, very useful for some of my own :D

@tvdstaaij
Copy link
Owner

I noticed this myself too, and I also know where it's coming from. When implementing #6 a certain design decision had to be made between three options varying from memory-saving but impractical to convenient but memory-intensive. I chose the middle ground, which buffers the message data of one dialog at a time in memory. So as it is now the number of dialogs shouldn't matter much, but the amount and size of the messages in the largest dialog determine the bulk of the memory load, and this gets quite a bit higher than I anticipated.

I'm not yet sure about the best way to tackle this problem, and I don't have much time for Github projects at the moment, but I would certainly like to solve this at some point. I am actually one of the users that would directly benefit from this, seeing how I run the script on a server with 2GB RAM of which at least a third is used by other software.

@tvdstaaij
Copy link
Owner

I should also mention that, as far as I know, this is specifically a problem with formatters and not with the dumping process itself. If all formatters are disabled (so that only JSONL files will be produced) the memory usage should be acceptable.

However, as I wrote this and looked at some of my code I realized that I made a mistake in some conditionals that could cause high memory usage even if all formatters are disabled. The above commit on master fixes this.

@ghost
Copy link

ghost commented Dec 18, 2016

Thanks, that's how I use it and hopefully this will help :) I really appreciate your responsiveness

tvdstaaij added a commit that referenced this issue Mar 4, 2017
…old to new [#57,#74]

This is a necessary followup to 706776a and should also eliminate the excessive memory usage problem during formatting. Changes the progress file format, so a fresh backup is necessary after applying this commit.

 Minor regression: breaks reply author support (e.g. "in reply to Kenny") in plaintext formatter until an alternative method for achieving this is implemented.
@tvdstaaij
Copy link
Owner

I decided to rework the formatting system to use a less memory intensive method as a part of #74, which is developed on the dump-old-to-new branch and will probably be integrated in the next major release (because it requires a clean dump after upgrading). Currently the formatter memory problem is solved on this branch. It may increase a bit again if I re-implement reply formatting functionality but this should not be significant. Closing this in favor of #74.

@anfederico
Copy link

anfederico commented Jun 17, 2018

Say you've backed up a total of 1,000,000 messages and formatted all of them (plaintext one day per file). If you run the backup again and scrape 1000 more messages, are you reformatting everything again? Or just the 1000?

I've backed up around 2 years of data and backup messages daily. I've noticed the daily backups are quick (3-5 minutes for just a few 100-1000 messages), but the formatting takes hours. That's why I suspect it's reformatting everything?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants