FileStore: a Store for files on disk #619

rkingsbury · 2022-04-06T18:26:22Z

This WIP PR is an attempt to facilitate using maggma infrastructure to process files on disk (e.g. experimental data files). It is a follow up to the closed #488. ~~This is in a very early state and intended mainly to facilitate discussion.~~

~~UPDATED 2022-04-11~~
~~UPDATED 2022-04-12~~
~~UPDATED 2022-04-13~~
~~UPDATED 2022-04-18~~
UPDATED 2022-04-25

Based on discussion with @munrojm , the idea here is to define a FileStore that provides a maggma style interface a to a directory full of files, making it possible to run Builder directly on data files and not just on mongo documents. That way, you can process files without having to write a separate Drone class for parsing. Rather, you can point a Builder at a FileStore, use the FileStore to retrieve the relevant files, and then define whatever work needs to happen in process_items of a Builder.

Right now the FileStore is initialized with a parent directory, and each subdirectory constitutes one item or record in the store. Each record contains a list of the files in that subdirectory and keeps track of their modification times. There is an internal MemoryStore that tracks all the file metadata.

Here is my thought process behind the current implementation:

It seems to me that update and remove_docs should actually create and delete files on disk if we are to adhere strongly to the Store paradigm with respect to how documents work (i.e., the files on disk that constitute the Store must ALWAYS be in sync with contents of the internal MemoryStore. We don't want to delete some records from the MemoryStore without changing the corresponding files).
If we adhere to Point # 1, then the only way to add additional metadata to a record is to write a new file. So for example, if I added a field to a record, as soon as I call update the FileStore should simultaneously 1) add the field to the mongo document in the internal MemoryStore and 2) create a .json file in the respective directory that includes the metadata.
Point # 2 is the reason that I've built FileStore to use directories as the items, rather than individual files. Directories are guaranteed to have unique names and they provide a place to store additional metadata that one might want to add.
It would also be useful to have a version of this class that can operate on a single directory full of individual files. In this case, the metadata for all records would have to be written to a single .json file.
~~Maybe it would make sense to define a base FileStore based on point # 4 and an inherited class that is further customized for data organized into directories.~~
I've revised the FileStore so that each item represents a single file. Metadata is written automatically to a .json file in root directory of the FileStore.

Basic usage and concepts are as follows:

On .connect() FileStore iterates through all files in its base directory and create a FileRecord object from each. max_depth and file_filters (fnmatch patterns) can be passed as kwargs to restrict which files are indexed.
On .connect(), if the store is not read-only, FileStore will create and connect to an internal JSONStore that creates or reads a .json file of metadata in the base directory
~~FileRecord objects record the name, parent directory, size, hash, and last modified time of the file and populate that into the internal MemoryStore~~
The internal MemoryStore is populated with dicts containing the name, full path, parent directory, size, hash, and last modified time of the file. This is accomplished via the .read() method.
Files are uniquely identified by a file_id, which is computed by read() as the hash of the file path relative to the FileStore base directory. This relative path is guaranteed to be unique by the file system. By using the relative path instead of the absolute path, we make it possible to move the entire FileStore to a new location on disk without changing file_id. NOTE: we considered adding the file creation time to the hash that constitutes the file_id. However I decided against this because its value may differ across platforms. E.g. the pathlib.stat().st_ctime attribute returns different values on Windows vs. Unix/Mac - see here)
On .update(), anything in the document except the keys populated by .read() will be written out to the .json file (and read back in next time you connect to the store). Only items with extra keys are written to the JSON (i.e., if you have 10 items in the store but add metadata to just one, only the one item will be written to the JSON). The purpose of this behavior is to prevent any duplication of data. The file_id (or whatever self.key is) and path are retained in the JSON file to make each metadata record manually identifiable
In the event that there are metadata records in the JSON file that no longer match files on disk, those records will be added to the FileStore and marked with {"orphan":True}. This can happen if, for example, you init a FileStore and later delete a file, or if you init the store with the default arguments but later restrict the file selection with max_depth or file_filters. The goal with this behavior is to preserve all metadata the user may have added and prevent data loss.
There is an include_orphans kwarg you can set on init that defines whether or not orphaned metadata records will be returned in queries
remove_docs deletes files assuming the store is not read only. It has an additional guard argument confirm which must be set to the non-default value True for the method to actually do anything.
query will attempt to read the actual contents of a file up to a certain size limit (which can be adjusted via kwarg). The contents are added under a contents key in the returned item
Here is an example returned document (a file from the test_files directory)

{'_id': ObjectId('625e581113cef6275a992abe'),
 'name': 'input.in',
 'path': '/home/ryan/miniconda3/envs/md/code/maggma/tests/test_files/file_store_test/calculation1/input.in',
 'parent': 'calculation1',
 'size': 90,
 'file_id': '2d12e9803fa0c6eaffb065c8dc3cf4fe',
 'last_updated': datetime.datetime(2022, 4, 19, 5, 23, 54, 109000),
 'hash': 'd42c9ff24dc2fde99ed831ec767bd3fb',
 'orphan': False,
 'contents': 'This is the file named input.in\nIn directory calculation1\nin the FileStore test directory.'}

Eager to hear thoughts!

Contributor Checklist

I have broken down my PR scope into the following TODO tasks
- Create a FileStore class that uses files on disk as the underlying data storage medium
- implement remove_docs method
- add a convenience method to assign metadata by parsing the file or directory name
- Refactor / streamline
- Write documentation and update docstrings
I have run the tests locally and they passed.
I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR

Possible future enhancements (separate PRs)

Integrate compression somehow
Replace mongomock with a faster in-memory alternative

lgtm-com · 2022-04-06T18:45:22Z

This pull request introduces 6 alerts when merging b846318 into 0c1ea7f - view on LGTM.com

new alerts:

6 for Unused import

lgtm-com · 2022-04-06T20:30:03Z

This pull request introduces 2 alerts when merging 9d07b06 into 0c1ea7f - view on LGTM.com

new alerts:

2 for Unused import

lgtm-com · 2022-04-06T20:55:53Z

This pull request introduces 2 alerts when merging 84ef477 into 0c1ea7f - view on LGTM.com

new alerts:

2 for Unused import

lgtm-com · 2022-04-06T21:55:30Z

This pull request introduces 2 alerts when merging d6f2a77 into 0c1ea7f - view on LGTM.com

new alerts:

2 for Unused import

lgtm-com · 2022-04-06T22:12:14Z

This pull request introduces 2 alerts when merging 7f5776b into 0c1ea7f - view on LGTM.com

new alerts:

2 for Unused import

lgtm-com · 2022-04-07T20:45:32Z

This pull request introduces 5 alerts when merging 294607d into 0c1ea7f - view on LGTM.com

new alerts:

3 for Unused import
2 for Unused local variable

lgtm-com · 2022-04-08T07:08:18Z

This pull request introduces 8 alerts when merging 038899a into 0c1ea7f - view on LGTM.com

new alerts:

5 for Unused import
2 for Unused local variable
1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-09T06:12:02Z

This pull request introduces 6 alerts when merging 22fdf26 into 0c1ea7f - view on LGTM.com

new alerts:

3 for Unused import
2 for Unused local variable
1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-09T16:28:51Z

This pull request introduces 6 alerts when merging c70c9b9 into 0c1ea7f - view on LGTM.com

new alerts:

3 for Unused import
2 for Unused local variable
1 for `__eq__` not overridden when adding attributes

codecov · 2022-04-10T01:07:07Z

Codecov Report

Merging #619 (7275093) into main (b68e7d7) will increase coverage by 0.46%.
The diff coverage is 98.17%.

@@            Coverage Diff             @@
##             main     #619      +/-   ##
==========================================
+ Coverage   88.96%   89.42%   +0.46%     
==========================================
  Files          41       42       +1     
  Lines        2891     3046     +155     
==========================================
+ Hits         2572     2724     +152     
- Misses        319      322       +3

Impacted Files	Coverage Δ
src/maggma/stores/file_store.py	`98.02% <98.02%> (ø)`
src/maggma/stores/mongolike.py	`87.97% <100.00%> (+0.09%)`	⬆️
src/maggma/core/drone.py	`92.42% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 861ac74...7275093. Read the comment docs.

lgtm-com · 2022-04-10T01:22:40Z

This pull request introduces 1 alert when merging 2c815a5 into 0c1ea7f - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-11T00:40:49Z

This pull request introduces 1 alert when merging 2f3b891 into 0c1ea7f - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-11T18:00:18Z

This pull request introduces 1 alert when merging a21e5b0 into 02480e2 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-11T22:00:31Z

This pull request introduces 1 alert when merging e55b64f into 02480e2 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-11T23:55:38Z

This pull request introduces 1 alert when merging a512e4c into 02480e2 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-13T06:14:46Z

This pull request introduces 1 alert when merging ebf88a9 into facbc20 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-13T17:11:25Z

This pull request introduces 1 alert when merging db6e095 into facbc20 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-18T22:07:51Z

This pull request introduces 4 alerts when merging d396773 into c393da8 - view on LGTM.com

new alerts:

2 for Unused local variable
1 for `__eq__` not overridden when adding attributes
1 for Unused import

lgtm-com · 2022-04-18T23:01:51Z

This pull request introduces 1 alert when merging ef13e43 into c393da8 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-19T06:18:09Z

This pull request introduces 1 alert when merging e8e973f into c393da8 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-19T06:55:40Z

This pull request introduces 1 alert when merging 390d259 into c393da8 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-21T18:15:37Z

This pull request introduces 1 alert when merging 2a2e65f into 2661301 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-21T18:48:40Z

This pull request introduces 1 alert when merging 16cc5ee into 2661301 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

…o filestore

lgtm-com · 2022-04-25T18:15:57Z

This pull request introduces 1 alert when merging faaaab8 into 861ac74 - view on LGTM.com

new alerts:

1 for `__eq__` not overridden when adding attributes

lgtm-com · 2022-04-25T19:14:07Z

This pull request introduces 1 alert when merging 51d9803 into 861ac74 - view on LGTM.com

new alerts:

1 for Unused import

rkingsbury · 2022-04-25T20:01:37Z

Hi @munrojm I'm going to take the FileStore PR out of WIP status because I think it's basically feature-complete, tested, and working well. I suspect there are some ways to refactor the "guts" of this class to be a little cleaner, reduce redundancy from parent classes, etc. I feel confident now that as long as the tests continue to pass we can rework the internals as desired. If you have any feedback in that regard (or with respect to function and kwarg naming, organization, etc.) please let me know.

Also, I would like to add a thorough documentation page before merge but I thought I'd wait for feedback from you on the interface and implementation before I write that.

Lastly, I will note that as we discussed before, the slowness of mongomock is a bit annoying, even when using FileStore on a directory of only ~500 small text files. Hopefully we can find a faster alternative someday.

munrojm · 2022-04-25T20:11:04Z

Hi @rkingsbury, that sounds great! Thanks again for all of your hard work on this. I will set aside some time tomorrow to do a detailed line-by-line and add my comments.

rkingsbury force-pushed the filestore branch from df2891e to 04b7023 Compare April 10, 2022 00:55

rkingsbury mentioned this pull request Apr 11, 2022

MemoryStore: fix groupby ignoring properties #621

Merged

4 tasks

rkingsbury changed the title ~~FileStore: initial prototype~~ FileStore: a Store for files on disk Apr 11, 2022

rkingsbury added 9 commits April 18, 2022 15:25

FileStore: initial prototype

86149a9

FileStore: datetime cleanups for Document

2d9429f

FileStore: rename RecordIdentifier -> Directory

5d1cd56

FileStore: cleanups

5a596d7

FileStore: last_updated test, tweaks

e117857

FileStore: fix key field; change to dir_name

d9bfc5c

FileStore: default_factory like behavior for hash, mtime

1731161

JSONStore: create file on init; datetime serialization fix

4cb7168

FileStore: simplify; add file_id; inherit from JSONStore

c5988db

rkingsbury added 2 commits April 18, 2022 15:26

FileStore: track_files -> file_filters

689d39c

FileStore: add include_orphans kwarg

7cc21b4

rkingsbury force-pushed the filestore branch from d396773 to 7cc21b4 Compare April 18, 2022 22:26

rkingsbury added 2 commits April 18, 2022 15:34

JSONStore: re-merge upstream

8508045

JSONStore: restore split read_json_file

ef13e43

rkingsbury added 2 commits April 18, 2022 22:42

FileStore: read file contents (with options)

3abda20

FileStore: add remove_docs

e8e973f

rkingsbury added 2 commits April 18, 2022 23:33

FileStore: fix contents for properties=None

e9de626

FileStore: properties=None vs properties=[] consistent with super

390d259

FileStore: add_metadata convenience method

2a2e65f

FileStore: auto_data option for add_metadata

16cc5ee

Merge branch 'main' of https://github.com/materialsproject/maggma int…

faaaab8

…o filestore

rkingsbury added 2 commits April 25, 2022 11:46

FileStore: rm FileRecord; use relative file_id

15d8d5e

FileStore: mypy fixes

51d9803

FileStore: rm import and refactor protected_keys

574a689

rkingsbury marked this pull request as ready for review April 25, 2022 19:57

rkingsbury added 2 commits April 30, 2022 20:39

FileStore: bugfix when path="."

cb0ec66

FileStore: bugfix when path="."

7275093

munrojm added the release:minor label May 23, 2022

munrojm merged commit a2b2b60 into materialsproject:main May 23, 2022

rkingsbury mentioned this pull request May 27, 2022

Docs updates: add FileStore and misc edits #668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileStore: a Store for files on disk #619

FileStore: a Store for files on disk #619

rkingsbury commented Apr 6, 2022 •

edited

Loading

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 7, 2022

lgtm-com bot commented Apr 8, 2022

lgtm-com bot commented Apr 9, 2022

lgtm-com bot commented Apr 9, 2022

codecov bot commented Apr 10, 2022 •

edited

Loading

lgtm-com bot commented Apr 10, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 13, 2022

lgtm-com bot commented Apr 13, 2022

lgtm-com bot commented Apr 18, 2022

lgtm-com bot commented Apr 18, 2022

lgtm-com bot commented Apr 19, 2022

lgtm-com bot commented Apr 19, 2022

lgtm-com bot commented Apr 21, 2022

lgtm-com bot commented Apr 21, 2022

lgtm-com bot commented Apr 25, 2022

lgtm-com bot commented Apr 25, 2022

rkingsbury commented Apr 25, 2022

munrojm commented Apr 25, 2022

FileStore: a Store for files on disk #619

FileStore: a Store for files on disk #619

Conversation

rkingsbury commented Apr 6, 2022 • edited Loading

Contributor Checklist

Possible future enhancements (separate PRs)

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 6, 2022

lgtm-com bot commented Apr 7, 2022

lgtm-com bot commented Apr 8, 2022

lgtm-com bot commented Apr 9, 2022

lgtm-com bot commented Apr 9, 2022

codecov bot commented Apr 10, 2022 • edited Loading

Codecov Report

lgtm-com bot commented Apr 10, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 11, 2022

lgtm-com bot commented Apr 13, 2022

lgtm-com bot commented Apr 13, 2022

lgtm-com bot commented Apr 18, 2022

lgtm-com bot commented Apr 18, 2022

lgtm-com bot commented Apr 19, 2022

lgtm-com bot commented Apr 19, 2022

lgtm-com bot commented Apr 21, 2022

lgtm-com bot commented Apr 21, 2022

lgtm-com bot commented Apr 25, 2022

lgtm-com bot commented Apr 25, 2022

rkingsbury commented Apr 25, 2022

munrojm commented Apr 25, 2022

rkingsbury commented Apr 6, 2022 •

edited

Loading

codecov bot commented Apr 10, 2022 •

edited

Loading