Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileStore: a Store for files on disk #619

Merged
merged 41 commits into from
May 23, 2022

Conversation

rkingsbury
Copy link
Collaborator

@rkingsbury rkingsbury commented Apr 6, 2022

This WIP PR is an attempt to facilitate using maggma infrastructure to process files on disk (e.g. experimental data files). It is a follow up to the closed #488. This is in a very early state and intended mainly to facilitate discussion.

UPDATED 2022-04-11
UPDATED 2022-04-12
UPDATED 2022-04-13
UPDATED 2022-04-18
UPDATED 2022-04-25

Based on discussion with @munrojm , the idea here is to define a FileStore that provides a maggma style interface a to a directory full of files, making it possible to run Builder directly on data files and not just on mongo documents. That way, you can process files without having to write a separate Drone class for parsing. Rather, you can point a Builder at a FileStore, use the FileStore to retrieve the relevant files, and then define whatever work needs to happen in process_items of a Builder.

Right now the FileStore is initialized with a parent directory, and each subdirectory constitutes one item or record in the store. Each record contains a list of the files in that subdirectory and keeps track of their modification times. There is an internal MemoryStore that tracks all the file metadata.

Here is my thought process behind the current implementation:

  1. It seems to me that update and remove_docs should actually create and delete files on disk if we are to adhere strongly to the Store paradigm with respect to how documents work (i.e., the files on disk that constitute the Store must ALWAYS be in sync with contents of the internal MemoryStore. We don't want to delete some records from the MemoryStore without changing the corresponding files).
  2. If we adhere to Point # 1, then the only way to add additional metadata to a record is to write a new file. So for example, if I added a field to a record, as soon as I call update the FileStore should simultaneously 1) add the field to the mongo document in the internal MemoryStore and 2) create a .json file in the respective directory that includes the metadata.
  3. Point # 2 is the reason that I've built FileStore to use directories as the items, rather than individual files. Directories are guaranteed to have unique names and they provide a place to store additional metadata that one might want to add.
  4. It would also be useful to have a version of this class that can operate on a single directory full of individual files. In this case, the metadata for all records would have to be written to a single .json file.
  5. Maybe it would make sense to define a base FileStore based on point # 4 and an inherited class that is further customized for data organized into directories.
  6. I've revised the FileStore so that each item represents a single file. Metadata is written automatically to a .json file in root directory of the FileStore.

Basic usage and concepts are as follows:

  • On .connect() FileStore iterates through all files in its base directory and create a FileRecord object from each. max_depth and file_filters (fnmatch patterns) can be passed as kwargs to restrict which files are indexed.
  • On .connect(), if the store is not read-only, FileStore will create and connect to an internal JSONStore that creates or reads a .json file of metadata in the base directory
  • FileRecord objects record the name, parent directory, size, hash, and last modified time of the file and populate that into the internal MemoryStore
  • The internal MemoryStore is populated with dicts containing the name, full path, parent directory, size, hash, and last modified time of the file. This is accomplished via the .read() method.
  • Files are uniquely identified by a file_id, which is computed by read() as the hash of the file path relative to the FileStore base directory. This relative path is guaranteed to be unique by the file system. By using the relative path instead of the absolute path, we make it possible to move the entire FileStore to a new location on disk without changing file_id. NOTE: we considered adding the file creation time to the hash that constitutes the file_id. However I decided against this because its value may differ across platforms. E.g. the pathlib.stat().st_ctime attribute returns different values on Windows vs. Unix/Mac - see here)
  • On .update(), anything in the document except the keys populated by .read() will be written out to the .json file (and read back in next time you connect to the store). Only items with extra keys are written to the JSON (i.e., if you have 10 items in the store but add metadata to just one, only the one item will be written to the JSON). The purpose of this behavior is to prevent any duplication of data. The file_id (or whatever self.key is) and path are retained in the JSON file to make each metadata record manually identifiable
  • In the event that there are metadata records in the JSON file that no longer match files on disk, those records will be added to the FileStore and marked with {"orphan":True}. This can happen if, for example, you init a FileStore and later delete a file, or if you init the store with the default arguments but later restrict the file selection with max_depth or file_filters. The goal with this behavior is to preserve all metadata the user may have added and prevent data loss.
  • There is an include_orphans kwarg you can set on init that defines whether or not orphaned metadata records will be returned in queries
  • remove_docs deletes files assuming the store is not read only. It has an additional guard argument confirm which must be set to the non-default value True for the method to actually do anything.
  • query will attempt to read the actual contents of a file up to a certain size limit (which can be adjusted via kwarg). The contents are added under a contents key in the returned item
  • Here is an example returned document (a file from the test_files directory)
{'_id': ObjectId('625e581113cef6275a992abe'),
 'name': 'input.in',
 'path': '/home/ryan/miniconda3/envs/md/code/maggma/tests/test_files/file_store_test/calculation1/input.in',
 'parent': 'calculation1',
 'size': 90,
 'file_id': '2d12e9803fa0c6eaffb065c8dc3cf4fe',
 'last_updated': datetime.datetime(2022, 4, 19, 5, 23, 54, 109000),
 'hash': 'd42c9ff24dc2fde99ed831ec767bd3fb',
 'orphan': False,
 'contents': 'This is the file named input.in\nIn directory calculation1\nin the FileStore test directory.'}

Eager to hear thoughts!

Contributor Checklist

  • I have broken down my PR scope into the following TODO tasks
    • Create a FileStore class that uses files on disk as the underlying data storage medium
    • implement remove_docs method
    • add a convenience method to assign metadata by parsing the file or directory name
    • Refactor / streamline
    • Write documentation and update docstrings
  • I have run the tests locally and they passed.
  • I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR

Possible future enhancements (separate PRs)

  • Integrate compression somehow
  • Replace mongomock with a faster in-memory alternative

@lgtm-com
Copy link

lgtm-com bot commented Apr 6, 2022

This pull request introduces 6 alerts when merging b846318 into 0c1ea7f - view on LGTM.com

new alerts:

  • 6 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Apr 6, 2022

This pull request introduces 2 alerts when merging 9d07b06 into 0c1ea7f - view on LGTM.com

new alerts:

  • 2 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Apr 6, 2022

This pull request introduces 2 alerts when merging 84ef477 into 0c1ea7f - view on LGTM.com

new alerts:

  • 2 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Apr 6, 2022

This pull request introduces 2 alerts when merging d6f2a77 into 0c1ea7f - view on LGTM.com

new alerts:

  • 2 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Apr 6, 2022

This pull request introduces 2 alerts when merging 7f5776b into 0c1ea7f - view on LGTM.com

new alerts:

  • 2 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Apr 7, 2022

This pull request introduces 5 alerts when merging 294607d into 0c1ea7f - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 2 for Unused local variable

@lgtm-com
Copy link

lgtm-com bot commented Apr 8, 2022

This pull request introduces 8 alerts when merging 038899a into 0c1ea7f - view on LGTM.com

new alerts:

  • 5 for Unused import
  • 2 for Unused local variable
  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 9, 2022

This pull request introduces 6 alerts when merging 22fdf26 into 0c1ea7f - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 2 for Unused local variable
  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 9, 2022

This pull request introduces 6 alerts when merging c70c9b9 into 0c1ea7f - view on LGTM.com

new alerts:

  • 3 for Unused import
  • 2 for Unused local variable
  • 1 for `__eq__` not overridden when adding attributes

@codecov
Copy link

codecov bot commented Apr 10, 2022

Codecov Report

Merging #619 (7275093) into main (b68e7d7) will increase coverage by 0.46%.
The diff coverage is 98.17%.

@@            Coverage Diff             @@
##             main     #619      +/-   ##
==========================================
+ Coverage   88.96%   89.42%   +0.46%     
==========================================
  Files          41       42       +1     
  Lines        2891     3046     +155     
==========================================
+ Hits         2572     2724     +152     
- Misses        319      322       +3     
Impacted Files Coverage Δ
src/maggma/stores/file_store.py 98.02% <98.02%> (ø)
src/maggma/stores/mongolike.py 87.97% <100.00%> (+0.09%) ⬆️
src/maggma/core/drone.py 92.42% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 861ac74...7275093. Read the comment docs.

@lgtm-com
Copy link

lgtm-com bot commented Apr 10, 2022

This pull request introduces 1 alert when merging 2c815a5 into 0c1ea7f - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 11, 2022

This pull request introduces 1 alert when merging 2f3b891 into 0c1ea7f - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 11, 2022

This pull request introduces 1 alert when merging a21e5b0 into 02480e2 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@rkingsbury rkingsbury changed the title FileStore: initial prototype FileStore: a Store for files on disk Apr 11, 2022
@lgtm-com
Copy link

lgtm-com bot commented Apr 11, 2022

This pull request introduces 1 alert when merging e55b64f into 02480e2 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 11, 2022

This pull request introduces 1 alert when merging a512e4c into 02480e2 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 13, 2022

This pull request introduces 1 alert when merging ebf88a9 into facbc20 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 13, 2022

This pull request introduces 1 alert when merging db6e095 into facbc20 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 18, 2022

This pull request introduces 4 alerts when merging d396773 into c393da8 - view on LGTM.com

new alerts:

  • 2 for Unused local variable
  • 1 for `__eq__` not overridden when adding attributes
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Apr 18, 2022

This pull request introduces 1 alert when merging ef13e43 into c393da8 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 19, 2022

This pull request introduces 1 alert when merging e8e973f into c393da8 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 19, 2022

This pull request introduces 1 alert when merging 390d259 into c393da8 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 21, 2022

This pull request introduces 1 alert when merging 2a2e65f into 2661301 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 21, 2022

This pull request introduces 1 alert when merging 16cc5ee into 2661301 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 25, 2022

This pull request introduces 1 alert when merging faaaab8 into 861ac74 - view on LGTM.com

new alerts:

  • 1 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Apr 25, 2022

This pull request introduces 1 alert when merging 51d9803 into 861ac74 - view on LGTM.com

new alerts:

  • 1 for Unused import

@rkingsbury rkingsbury marked this pull request as ready for review April 25, 2022 19:57
@rkingsbury
Copy link
Collaborator Author

Hi @munrojm I'm going to take the FileStore PR out of WIP status because I think it's basically feature-complete, tested, and working well. I suspect there are some ways to refactor the "guts" of this class to be a little cleaner, reduce redundancy from parent classes, etc. I feel confident now that as long as the tests continue to pass we can rework the internals as desired. If you have any feedback in that regard (or with respect to function and kwarg naming, organization, etc.) please let me know.

Also, I would like to add a thorough documentation page before merge but I thought I'd wait for feedback from you on the interface and implementation before I write that.

Lastly, I will note that as we discussed before, the slowness of mongomock is a bit annoying, even when using FileStore on a directory of only ~500 small text files. Hopefully we can find a faster alternative someday.

@munrojm
Copy link
Member

munrojm commented Apr 25, 2022

Hi @rkingsbury, that sounds great! Thanks again for all of your hard work on this. I will set aside some time tomorrow to do a detailed line-by-line and add my comments.

@munrojm munrojm merged commit a2b2b60 into materialsproject:main May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants