-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracking issue: File ID Manager #940
Comments
Notes from 8/4 server meeting:
|
The only discussion I recall regarding Windows was whether the " |
No—sorry, this was poorly made joke on my end in the Zoom chat, not a serious comment. (we have been battling some issues with Jupyter Server running on Windows recently, so my comment was a bit tongue-in-cheek 😅 ). Jupyter Server 2.0 will most certainly support Windows as best as possible. 😎 |
Hi @dlqqq - I just finished reading the design section (excellent write-up - thank you!) and had some comments regarding its content.
(Definite nit: the return value of |
@kevin-bates Sorry for not addressing your concerns earlier! I was out on vacation for a week, and had a small mountain of tasks waiting for me when I got back. Let me address your concerns:
Well, we are actually being consistent and not wrongly assigning
Yup! That's because you renamed the file, which changes the file's associated metadata, which forces an update to
Well, not necessarily. When you "move" a file to a remote filesystem, you're really just deleting a file on the current FS and creating a new one on the remote FS. That's why your
Absolutely. Will definitely work on this once I get this work into a separate server extension.
👍 |
I may miss something, but I don't understand how https://github.com/jupyter-server/jupyter_server_fileid has to be used by jupyter-server. Could you highlight in a few commands/configs what needs to be done to make the fileid extension be functional with the server? |
File ID Manager design
Right now, there is no universal way in Jupyter for developers to track a file as it is created, modified, and renamed. This is essential when developers need data to be associated with a specific file across its lifetime. The File ID Manager (FIM) is a proposed Jupyter manager to associate immutable file IDs with files that tracks the path of a file across its entire lifetime. That is, a file ID always be solved to the current path of the corresponding file even when the file is moved.
Use cases
Use case: Comments
Comments is a feature request that allows users to attach comments to JupyterLab documents. This offers a richer user collaboration experience, in alignment with Jupyter’s larger goal of providing stronger features along the same vein. Right now, the comments feature request has several proposed implementations for storing comments:
.ipynb
metadata fields)Leveraging metadata fields limits support of this feature to file types that support hidden metadata within the blob. This feature cannot work on raw text files. Furthermore, this implementation is not extensible as it requires a custom implementation for each file type.
Sidecar files mitigate this, but this implementation requires us to pollute the directory structure with sidecar files. Furthermore, users must remember to move or copy both files in tandem or else the comments data is no longer associated with the target file or with the new copy of the target file, respectively. This breaks existing scripts that perform such filesystem operations, and potential data loss leads to a very poor developer experience.
A SQLite database that abstracts the implementation away from end-users is the most promising implementation. Here we are able to leverage the benefits of using sidecar files without polluting directory structure or compromising on user and developer experience. However, there are two glaring issues with maintaining a database between the file path and the comments data:
Feature specification
The FIM is a new manager in Jupyter Server that supports the following key features:
Furthermore, all of these features must be implemented agnostic of the underlying operating system, filesystem, and kernel. This problem is surprisingly challenging however deceptively simple. In this document, we wish to outline the design of the FIM such that these features can be achieved in such a way.
Terminology
We will use a few custom terms to discuss the design on a more specific and granular basis.
Op: A filesystem operation, mainly creates, moves, copies and deletes.
In-band/Out-of-band: In-band ops are any ops performed through the Jupyter server Contents API, which is called by the JupyterLab UI. Out-of-band edits are other ops done through some other method, e.g. shell commands or drag-and-drop in Windows File Explorer.
Stat info: file metadata returned from the
stat()
system call.Ino: inode number. An integer associated with each file that points to where its inode is located on disk. The inode stores all relevant metadata about the file, and thus the ino is preserved across file moves within the same filesystem.
Crtime: the file’s creation time. May not be available on all platforms.
Similar stat info: when the stat info of a previously deleted file and a newly created file have the same
ino
andcrtime
(falling back tomtime
ifcrtime
is not available on the platform). This indicates an out-of-band move.Indexing: creating a record that associates a file path with a unique, immutable, and non-reused file ID. We say that the file ID manager “indexes” a file when it stores an association between the file path with a file ID.
Indexed-but-moved: a file which was previously indexed but moved out-of-band.
Disjoint move: an out-of-band op involving deleting a file and creating a file with identical contents at a different path rather than moving a file with
mv
. Out-of-band disjoint moves are impossible to track without storing file contents in an object database like Git. Disjoint moves include:cp
the original file and thenrm
the original fileTakeaway: In-band ops are easy to track
Because the Contents API manager can directly call the FIM’s methods, it’s easy to track a file across its lifetime, since the FIM is informed of all ops happening to all files in a JupyterLab session. The rest of this document focuses on a strategy to track out-of-band ops.
Looking to Git for inspiration
One key idea to note here is that tracking a file across its lifetime in a platform-agnostic manner is exactly what Git does. Git does not rely on a filesystem event daemon like
inotify
to do this, and relies purely on files themselves. This method ensures that Git works on pretty much every platform used today.Git uses an index file
.git/index
to track all of the files under the root, and stores the original copies of each file in the objects database at.git/objects
. Runninggit init
on a new directory indexes all files (including directories) within the root, recursively. This can be shown with thegit ls-tree
command:Each index entry has an object type.
blob
types represent files, whiletree
types represent directories.Tracking a file across by reading its contents is very expensive as it requires disk reads. Hence Git relies on a heuristic obtained from file metadata.
stat()
is a system call available to all POSIX-compliant platforms, along with some “mostly POSIX-compliant” platforms including Windows. It exposes file metadata that can be employed as a heuristic for tracking file moves. For a full list of metadata types see the system call documentation. The most relevant ones used by Git are the following:st_mtime
: time last modifiedst_ctime
:st_ino
: inode number (preserved across moves)st_uid
: owner user IDst_gid
: owner group IDst_size
: size in bytesGit stores this stat info within each index entry and employs this as a heuristic. If the metadata is identical, then the file or directory is almost certainly unchanged.
Git also uses this stat info to detect new and deleted files under the Git root. When adding or deleting a file, the
mtime
of the immediate parent directory is changed.foo/bar/baz.txt
only updates themtime
offoo/bar
and notfoo
.Thus if the stat info for the directory is different, Git can read the current contents of that directory and compare it to the old contents of the directory. Doing this across all directories under the root allows Git to detect any created or deleted file under the git root.
This functionality is deceptively powerful because it allows Git to track moves very efficiently. When you think about it, a file move is almost like deleting a file and creating a new file with similar stat info. Because this stat info is preserved across moves, whenever Git detects a new file, it can compare the stat info to any deleted files. If the stat info is identical, then the file was just moved. Otherwise, Git falls back to reading the contents of the deleted file (retained in the objects database) and diffing it against the new file. If the difference is less than 50%, then Git considers the file to be renamed.
However, note that Git does not care about file copies at all. To Git, a new copy of a file is just a new file, with no history associated with it. There is also no way we can detect copies efficiently (purely from stat info) without running a diff against every single file under the index. Hence, this strategy does not detect out-of-band copies.
Implementation proposal
We maintain a single table:
Files
. This has the following schema:path
,ino
,is_dir
are indexed to speed lookups.id
: the file IDpath
: the file pathino
: the inode number of the filecrtime
: the time the file was createdctime
if on windows,birthtime
on MacOS and other BSD-likesmtime
: the time the file contents were last modifiedis_dir
: 1 if the file is a directory, 0 otherwise.FIM.init()
Create SQLite tables and indices if necessary. Then index all directories under the server root.
FIM._stat(path: str)
Retrieves a file’s stat info and returns it in a
StatStruct
:FIM._sync_file(path: str, stat_info: StatStruct)
This private method is what detects out-of-band moves. The key idea is:
ino
andcrtime
. Ifcrtime
is not available, we fallback to verifyingmtime
.If there is a record with similar stat info, we update the existing record with the new path and stat info, then return the file ID. Otherwise this method returns
None
.FIM.index(path: str): number
First, call
_stat()
onpath
to make sure file exists. Otherwise returnNone
.Then, call
_sync_file
onpath
to check if file was indexed-but-moved. Return ID if so.Finally, create a new record for the file at
path
. Return file ID.FIM.get_id(path: str): number
Same as
index()
except returnsNone
if the file was not indexed-but-moved. Does not create a new record and file ID for the file atpath
.FIM._sync_all()
Syncs all new files under the entire server root. Files moved out-of-band can only appear under dirty directories, which are:
We iterate through all dirty directories under the server root and call
_sync_file()
on all of their contents. This ensures that the correct file path is associated with each file ID.FIM.get_path(id: number): str
Call
_sync_all()
. Then find thepath
associated with the file ID.Next, verify if the file at the path exists. If not, then return
None
.Otherwise return the path.
FIM.[move, copy, delete]
More straightforward and not worth discussing here as these methods handle in-band ops that are easier to reason about.
Summary of out-of-band ops handling
_sync_file()
and_sync_all()
which are called byget_id()
andget_path()
respectivelyget_id()
andget_path()
methods by verifying file exists before returningKnown issues
inotify
doesn’t emit copy events at all.get_path()
can be slow if you move a very large directory.get_path()
after moving a very large directory (/arch
) in the Linux source tree.crtime
has a certain precision depending on the underlying filesystem/kernel. This can be 1 nanosecond (ext* with 256-byte inodes), 100 nanoseconds (NTFS), one second (ext* with 128-byte inodes), or two seconds (FAT/FAT32).ino
) without changing thecrtime
.crtime
implementation are unable to detect moves followed by edits.mtime
for directories changes whenever a file underneath is added, deleted, or renamed.mtime
. This is discussed further in the Open Questions section.FAQ
Why not just use inos to identify a file?
foo
has a ino of 1 andboo
has a ino of 2, then iffoo
gets deleted and a new filebaz
gets created, thenbaz
has a ino of 1. This is inappropriate for our use case as a file ID should never be reused; it should track the path of one and only one file across its lifetime.foo
would get attached tobaz
after the ops execute.crtime
) to give us more confidence in a file’s identity.Why not just use a filesystem event daemon to watch the contents of the server root?
ino
andcrtime
.Why is the logic for
get_path()
so complex? Do we really need to sync all the possible dirty directories under the server root to associate the correct path to a given ID?get_id()
is simple because we’re given the path, and hence can easily detect an out-of-band move.get_path()
is more tricky because we’re not given the path. Hence, we need to sync every file under all dirty directories to do so.Open questions
dramalore: https://www.anmolsarma.in/post/linux-file-creation-time/#fn:2stat()
syscall.statx()
syscall.mtime
ifcrtime
is not available?mtime
and only useino
to compare file identity on platforms wherecrtime
is not available, then newly created files following a delete could be given the same ID.mtime
fallback as the default behavior, because it’s possible in the future to warn users if data is associated with a file that no longer exists. It’s much trickier to determine if data was associated with the “wrong file”.ContentsManager
implementation?PRs
Future steps
journal_mode
andsynchronous
_sync_all()
on an interval (e.g. 1s) when the server is on.jupyter mv
,jupyter cp
, etc.)The text was updated successfully, but these errors were encountered: