-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileStore: a Store for files on disk #619
Conversation
This pull request introduces 6 alerts when merging b846318 into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 9d07b06 into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 84ef477 into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging d6f2a77 into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging 7f5776b into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 5 alerts when merging 294607d into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 8 alerts when merging 038899a into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 6 alerts when merging 22fdf26 into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 6 alerts when merging c70c9b9 into 0c1ea7f - view on LGTM.com new alerts:
|
Codecov Report
@@ Coverage Diff @@
## main #619 +/- ##
==========================================
+ Coverage 88.96% 89.42% +0.46%
==========================================
Files 41 42 +1
Lines 2891 3046 +155
==========================================
+ Hits 2572 2724 +152
- Misses 319 322 +3
Continue to review full report at Codecov.
|
This pull request introduces 1 alert when merging 2c815a5 into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 2f3b891 into 0c1ea7f - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging a21e5b0 into 02480e2 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging e55b64f into 02480e2 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging a512e4c into 02480e2 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging ebf88a9 into facbc20 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging db6e095 into facbc20 - view on LGTM.com new alerts:
|
This pull request introduces 4 alerts when merging d396773 into c393da8 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging ef13e43 into c393da8 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging e8e973f into c393da8 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 390d259 into c393da8 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 2a2e65f into 2661301 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 16cc5ee into 2661301 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging faaaab8 into 861ac74 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 51d9803 into 861ac74 - view on LGTM.com new alerts:
|
Hi @munrojm I'm going to take the Also, I would like to add a thorough documentation page before merge but I thought I'd wait for feedback from you on the interface and implementation before I write that. Lastly, I will note that as we discussed before, the slowness of |
Hi @rkingsbury, that sounds great! Thanks again for all of your hard work on this. I will set aside some time tomorrow to do a detailed line-by-line and add my comments. |
This WIP PR is an attempt to facilitate using
maggma
infrastructure to process files on disk (e.g. experimental data files). It is a follow up to the closed #488.This is in a very early state and intended mainly to facilitate discussion.UPDATED 2022-04-11UPDATED 2022-04-12UPDATED 2022-04-13UPDATED 2022-04-18UPDATED 2022-04-25
Based on discussion with @munrojm , the idea here is to define a
FileStore
that provides amaggma
style interface a to a directory full of files, making it possible to runBuilder
directly on data files and not just on mongo documents. That way, you can process files without having to write a separateDrone
class for parsing. Rather, you can point aBuilder
at aFileStore
, use theFileStore
to retrieve the relevant files, and then define whatever work needs to happen inprocess_items
of aBuilder
.Right now theThere is an internalFileStore
is initialized with a parent directory, and each subdirectory constitutes one item or record in the store. Each record contains a list of the files in that subdirectory and keeps track of their modification times.MemoryStore
that tracks all the file metadata.Here is my thought process behind the current implementation:
update
andremove_docs
should actually create and delete files on disk if we are to adhere strongly to theStore
paradigm with respect to how documents work (i.e., the files on disk that constitute theStore
must ALWAYS be in sync with contents of the internalMemoryStore
. We don't want to delete some records from theMemoryStore
without changing the corresponding files).update
theFileStore
should simultaneously 1) add the field to the mongo document in the internalMemoryStore
and 2) create a .json file in the respective directory that includes the metadata.Point # 2 is the reason that I've builtFileStore
to use directories as the items, rather than individual files. Directories are guaranteed to have unique names and they provide a place to store additional metadata that one might want to add.It would also be useful to have a version of this class that can operate on a single directory full of individual files. In this case, the metadata for all records would have to be written to a single .json file.Maybe it would make sense to define a baseFileStore
based on point # 4 and an inherited class that is further customized for data organized into directories.FileStore
so that each item represents a single file. Metadata is written automatically to a .json file in root directory of theFileStore
.Basic usage and concepts are as follows:
.connect()
FileStore
iterates through all files in its base directory and create aFileRecord
object from each.max_depth
andfile_filters
(fnmatch
patterns) can be passed as kwargs to restrict which files are indexed..connect()
, if the store is not read-only,FileStore
will create and connect to an internalJSONStore
that creates or reads a .json file of metadata in the base directoryFileRecord
objects record the name, parent directory, size, hash, and last modified time of the file and populate that into the internalMemoryStore
MemoryStore
is populated with dicts containing the name, full path, parent directory, size, hash, and last modified time of the file. This is accomplished via the.read()
method.file_id
, which is computed byread()
as the hash of the file path relative to the FileStore base directory. This relative path is guaranteed to be unique by the file system. By using the relative path instead of the absolute path, we make it possible to move the entireFileStore
to a new location on disk without changingfile_id
. NOTE: we considered adding the file creation time to the hash that constitutes thefile_id
. However I decided against this because its value may differ across platforms. E.g. thepathlib.stat().st_ctime
attribute returns different values on Windows vs. Unix/Mac - see here).update()
, anything in the document except the keys populated by.read()
will be written out to the .json file (and read back in next time you connect to the store). Only items with extra keys are written to the JSON (i.e., if you have 10 items in the store but add metadata to just one, only the one item will be written to the JSON). The purpose of this behavior is to prevent any duplication of data. Thefile_id
(or whateverself.key
is) andpath
are retained in the JSON file to make each metadata record manually identifiable{"orphan":True}
. This can happen if, for example, you init aFileStore
and later delete a file, or if you init the store with the default arguments but later restrict the file selection withmax_depth
orfile_filters
. The goal with this behavior is to preserve all metadata the user may have added and prevent data loss.include_orphans
kwarg you can set on init that defines whether or not orphaned metadata records will be returned in queriesremove_docs
deletes files assuming the store is not read only. It has an additional guard argumentconfirm
which must be set to the non-default valueTrue
for the method to actually do anything.query
will attempt to read the actual contents of a file up to a certain size limit (which can be adjusted via kwarg). The contents are added under acontents
key in the returned itemtest_files
directory)Eager to hear thoughts!
Contributor Checklist
FileStore
class that uses files on disk as the underlying data storage mediumremove_docs
methodPossible future enhancements (separate PRs)
mongomock
with a faster in-memory alternative