This is a work in progress tool to solve some issues when collecting many, many old backups I had over the years. I hate losing memories (being these pictures, old writings, pieces of code, excel sheets with projects I imagined, notes from friemds, etc) so I kept a vast collection of unusable backups. After many years of trying to sort this out I decided to give it a try and focus on making something that may actually work. Time will tell :)
The project includes three different executable files: delete_duplicates.py, scan.py, merge.py
The functionality has been separated in different executables because, while they maintain some similarities, the goals and modes of operations are very different.
-
delete_duplicates.py: The goal of this exec is to process a given source directory and create a list of files that follow some rules (using the definitions in should_ignore function). If the file is not ignored and has been already found somewhere else in the same session, the script will delete the file (or create a script to delete the file after)
-
scan.py: The goal of this exec is to manipulate a database of objects, allowing the user to create a new scan or work with previously scanned files.
-
merge.py: The goal of this exec is to merge the content of a source folder into a target folder, manipulating the content if the file already exists. To do so, the system uses a database of previously seen files and according to it, decides what to do with objects that have the same content.
The project includes three different versions of the DataStore backend: in-memory dict, the Shelve standard module and sqllite3.
At the moment, the selection of the backend is not configurable. You can only change by calling the right class (MemoryDataStore, ShelveDataStore, DataStore) - only DataStore implements all required functionalities at the moment, but it may be very slow for some large operations.