-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
draft: File ID Manager base implementation #921
Conversation
|
||
|
||
class FileIdManager(LoggingConfigurable): | ||
db_path = Unicode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend that we name this trait database_filepath
to match the SessionManager:
jupyter_server/jupyter_server/services/sessions/sessionmanager.py
Lines 153 to 161 in e59610b
database_filepath = Unicode( | |
default_value=":memory:", | |
help=( | |
"The filesystem path to SQLite Database file " | |
"(e.g. /path/to/session_database.db). By default, the session " | |
"database is stored in-memory (i.e. `:memory:` setting from sqlite3) " | |
"and does not persist when the current Jupyter Server shuts down." | |
), | |
).tag(config=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also some logic here you could borrow to validate this trait:
jupyter_server/jupyter_server/services/sessions/sessionmanager.py
Lines 163 to 181 in e59610b
@validate("database_filepath") | |
def _validate_database_filepath(self, proposal): | |
value = proposal["value"] | |
if value == ":memory:": | |
return value | |
path = pathlib.Path(value) | |
if path.exists(): | |
# Verify that the database path is not a directory. | |
if path.is_dir(): | |
raise TraitError( | |
"`database_filepath` expected a file path, but the given path is a directory." | |
) | |
# Verify that database path is an SQLite 3 Database by checking its header. | |
with open(value, "rb") as f: | |
header = f.read(100) | |
if not header.startswith(b"SQLite format 3") and not header == b"": | |
raise TraitError("The given file is not an SQLite database file.") | |
return value |
I think we could borrow quite a bit of SQLite connection logic in SessionManager here to avoid duplication of code. We'll need to abstract out some of these pieces and share them across classes. For example, the definition of the connection, cursor, etc. could be abstracted out of the SessionManager and used by both of these managers: jupyter_server/jupyter_server/services/sessions/sessionmanager.py Lines 200 to 228 in e59610b
|
Thank you for working on this, @dlqqq! I think a File/Document ID will be really useful for a lot of use cases! I'm excited to see this land. My number one concern about the current implementation is that the file ID isn't strictly unique—i.e. it's just an index in the database. In order to make this service more broadly useful, I think these IDs must be unique. Let me give a simple example... This service could be useful for sharing notebooks between two running Jupyter Servers (not necessarily RTC, just basic copying between two servers)—an enhancement many people have asked us for. In this example, each server has its own file ID database that maps file IDs to their path on the individual filesystems. If we "share" this file from one server to another, I would hope we could use this service to have a (admittedly, weak) link between |
Yes, thank you!
Sorry for jumping here without reading the complete history.... have the requirements for distributed systems/servers like jupyterhub be taken into account? (uniqueness, portability across servers...) |
@Zsailer Hey Zach! Thank you for taking the time to review my progress so far. Right now, I'm working on implementing handling for out-of-band file system operations along with a design doc that dives deeper into how the implementation actually works along with some of its shortcomings. I've addressed a few of your questions below and will address the others once I'm done with my current work. Keep in mind though, the logic for handling out-of-band operations gets a bit more tricky, so it'll likely need a re-review.
Well, the SQL connection/statement logic is fairly simple and I'm not sure if it's worth abstracting. The SQL statements are all fairly self-contained and only take 2, maybe 3 lines per statement. Over-abstracting and splitting the source over too many different files hinders readability. But I'll take a closer look at this concern when I'm done with my current work.
I'm lost as to why a file ID needs to be, say, "globally unique" to support this use-case. This is just a file copy. If the receiving server needs data (e.g. comments) attached to that copy, the sending server sends that data in tandem. The receiving server is then free to assign a different ID to that new file. IMO, there's no reason for these two servers to share a file ID for a given file on two separate filesystems because... well, they're not the same file. A copy of a file is a new file. This is how it's currently working in local filesystems. Circling back to commenting, if I add a comment in one file, I should not be adding a comment to another file just because it's a copy of the original and shares the same ID. Shared IDs across files leads to shared state across files, which gets complex really fast. I'm mainly giving pushback here because using a UUID for the primary key is a performance detriment. I'm not giving a firm "no" here. @echarles Great questions! Yes, these concerns will be taken into much further study, but only after we handle operations on local filesystems. Could you elaborate more on what you mean by "distributed [Jupyter] servers"? AFAIK JupyterHub is still a single server, so it seems more accurate to refer to it as "remote" rather than "distributed". Regarding portability and uniqueness, could you elaborate more on these as well? @kevin-bates brought these points up too. Do you mean that file IDs should be globally unique as mentioned earlier? Again, I'm still a bit lost on the use case for this; to my knowledge Jupyter server only runs on a single machine, and a client can't connect to multiple Jupyter servers. Rough sketch on how to handle remote Jservers/filesystems: The idea is that the |
(disclaimer: I have not taken the needed time to read and digest the dense and very useful discussion around the file id service, but think it is not too early to lay down any requirements)
Well, if the design and the technical implementation shows to be fairly different when we would look after, this may be annoying (e.g. if the generated database in case of many/distributed/remote servers is different form le local database). To which extends, not sure, one could argue we don't really care as the local database would be anyway not migrated to any such distributed system. But playing devil advocate, I could imagine my local server has a custom server extension that sends to a corporate backend those IDs and that one day my notebooks would have to be served from JupyterHub. The reverse is also true. A JupyterHub instance could allow the downoad of the notebook that could be run on my local laptop. In short, I would prefer to cover the JupyterHub case in the design.
I am thinking to a case where the notebook files are loaded from a shared storage system, e.g. NFS. In that case, the user goes today to server 1 instance, and tomorrow to server 2 instance. Another case is RTC where a server hosted by JupyterHub will be consumed by multiple users. So true, "distributed" is a bad name, IMHO a server is always "remote", "ephemeral" may be better but still not very good.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dlqqq - thanks for getting this started. Here are some initial comments.
I think there are some things to work out - primarily uniqueness in both the id
(UUID vs. Integer) and path
(how to deal with different root_dir
values). FWIW, I took a stab at converting the implementation to use UUIDs here.
Is it expected that things like "comments", or anything that references a File ID, be in this same database? If not, I think that argues for UUIDs as well - otherwise external references could get a false positive and be "linked" to files other than what is intended.
@vidartf has a good point of perhaps decoupling this from the ContentsManager implementation and using the event bus.
jupyter_server/serverapp.py
Outdated
@@ -1886,9 +1887,9 @@ def init_configurables(self): | |||
connection_dir=self.runtime_dir, | |||
kernel_spec_manager=self.kernel_spec_manager, | |||
) | |||
self.file_id_manager = FileIdManager(parent=self, log=self.log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably define a file_id_manager_class
configurable like the other managers define so folks can BYO their own FID manager. This also suggests defining an abstract base class that is then referenced in the klass
attribute which allows folks to either derive from FileIdManager
or essentially implement their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I forgot to define FileIdManager
as a trait on ServerApp
. Before I fix this, is there any reason to prefer
self.file_id_manager_class = Type(klass=FileIdManager)
...
self.file_id_manager = self.file_id_manager_class(*args)
over just using
self.file_id_manager = Instance(klass=FileIdManager, args=(*args))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think either is what we want.
Both imply that one has to derive from FileIdManager
- which we shouldn't impose on anyone. So klass
should reference an ABC (e.g., FileIdManagerABC
) that only defines the public APIs as abstract. It could also contain validation logic for the db-path trait, etc.
So the user would invoke the server using --ServerApp.file_id_manager_class = my.package.MyFileIdManager
and the trait would ensure that my.package.MyFileIdManager
is an instance of FileIdManagerABC
and whatever interacts with the file ID manager class would be relegated to the methods defined in the ABC.
Folks that want to slightly adjust the OOTB implementation of FileIdManager
(or override all of its methods entirely) are free to do so via subclassing as they are still an instance of the ABC.
For example, with the last set of changes, an entirely different implementation is required if any non-filesystem-based ContentsManager
is configured (unfortunately).
Hey @kevin-bates! Thanks for the review. You can make some excellent points. However, I literally was just about to push my latest changes on top of this PR. I'll do my best to address your comments soon. To the rest of the team, I wanted to segment design discussion to a separate tracking issue, since this project will likely have to span multiple PRs due to its size. See here: #940 |
I'm pushing my latest changes (which track out-of-band filesystem operations) now. EDIT: oh geez, the commit history is weirdly interweaved with review comments. yikes. LMK if you all want me to rebase and edit the commit dates. |
# uniqueness constraint relaxed here because we need to keep records | ||
# of deleted files which may occupy same path | ||
"path TEXT NOT NULL, " | ||
"ino INTEGER NOT NULL UNIQUE, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inode (st_ino
) is not unique by itself and needs to be coupled with st_dev
to gain uniqueness within a set of devices managed by a given server. I also think we will have issues when mounted file systems are in play (e.g., I think I recall that the inode reflects that of the NFS server for example, not necessarily the client mounting the filesystem, besides other idiosyncrasies that will creep in as these kinds of things vary with OS), not to mention non-file systems like Databases or Cloud Object Storage, etc. So this feels like a bit of a slippery slope, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... right. Good catch. I'll see what I can do about this. I think by default, the File ID Manager will only support local filesystems and NFS. For other cloud services, it's possible that they already expose a file ID, so the custom File ID manager implementation should be straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other cloud services, it's possible that they already expose a file ID, so the custom File ID manager implementation should be straightforward.
I believe this is saying that non-file system-based ContentsManager
derivations must implement their own FileIdManager
. Is that correct?
Regarding NFS, I wouldn't be surprised if the st_dev
value changes if a previously mounted FS is unmounted, then remounted (following the mount of others or even a reboot). (This hunch may be confirmed if I'm reading things correctly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is saying that non-file system-based ContentsManager derivations must implement their own FileIdManager. Is that correct?
If this is the case (prolly it is), the FileIdManager implementation should be configurable (like the ContentManager...). It may already be the case (have not looked at the code, sorry).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echarles - this is discussed here: #921 (comment) although I don't think we should require that.
Personal benchmarks show a very significant performance gain when calling get_path(). This is because we are no longer scanning all the contents of every indexed-but-moved directory in _sync_all(), but rather intelligently appending the new paths of the directories to check for dirtiness to a deque used in the body of _sync_all(). Also fixes a few bugs (sorry, couldn't split this commit up): - Indexing symlinks always indexes the real path that symlink refers to - Fixes bug with the way new paths were calculated in recursive moves
Codecov Report
@@ Coverage Diff @@
## main #921 +/- ##
==========================================
+ Coverage 71.44% 72.66% +1.21%
==========================================
Files 65 66 +1
Lines 7705 8084 +379
Branches 1289 1339 +50
==========================================
+ Hits 5505 5874 +369
+ Misses 1805 1804 -1
- Partials 395 406 +11
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@kevin-bates Thank you for your suggestions! I addressed as many as I could, but I believe it's best if I start working on separating my work into a standalone server extension. The great thing about this is that we can split up each of your concerns into a separate PR to address them. |
Hey team! This is now migrated into a separate server extension: https://github.com/jupyter-server/jupyter_server_fileid Thank you all for your review comments! I've left open review questions as issues on that repo. |
Great discussion, a few comments:
* Right now real-time collaboration only work where multiple users access a
single Jupyter Server. The case where multiple users access multiple
servers isn't handled yet and a lot of other things would have to change to
get that working. As such I don't think the File ID service needs to solve
this case yet. However, we should look forward to when it would have to
deal with that case.
* The scoping of a given FileID service is a ContentManager. Thus, if a
user has multiple ContentManagers serving different file systems, they
would presumably have 1 file ID service for each of this. For example if a
user has a local file system and an S3 bucket exposed through 2
ContentManagers, they would have two File ID services.
* Related to this case, I can imagine situations where a user wants to move
or copy a file between different ContentManager (copy file to S3 for
example). In that case, it may help to have UUIDs that are globally unique.
…On Wed, Jul 27, 2022 at 10:44 PM Eric Charles ***@***.***> wrote:
(disclaimer: I have not taken the needed time to read and digest the dense
and very useful discussion around the file id service, but think it is not
too early to lay down any requirements)
Great questions! Yes, these concerns will be taken into much further
study, but only after we handle operations on local filesystems.
Well, if the design and the technical implementation shows to be fairly
different when we would look after, this may be annoying (e.g. if the
generated database in case of many/distributed/remote servers is different
form le local database). To which extends, not sure, one could argue we
don't really care as the local database would be anyway not migrated to any
such distributed system. But playing devil advocate, I could imagine my
local server has a custom server extension that sends to a corporate
backend those IDs and that one day my notebooks would have to be served
from JupyterHub. The reverse is also true. A JupyterHub instance could
allow the downoad of the notebook that could be run on my local laptop.
In short, I would prefer to cover the JupyterHub case in the design.
Could you elaborate more on what you mean by "distributed [Jupyter]
servers"? AFAIK JupyterHub is still a single server, so it seems more
accurate to refer to it as "remote" rather than "distributed".
I am thinking to a case where the notebook files are loaded from a shared
storage system, e.g. NFS. In that case, the user goes today to server 1
instance, and tomorrow to server 2 instance.
Another case is RTC where a server hosted by JupyterHub will be consumed
by multiple users.
So true, "distributed" is a bad name, IMHO a server is always "remote",
"ephemeral" may be better but still not very good..
—
Reply to this email directly, view it on GitHub
<#921 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGXUGKMHTL3T2TXJ3XUDLVWIM35ANCNFSM53PJHWXQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Brian E. Granger
Senior Principal Technologist, AWS AI/ML ***@***.***)
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
(This is why it would have been nice if FileIDs were a function of (and emitted from) the ContentsManager.)
Yes, but more importantly, is that once objects referencing FileIDs find their way to other systems (via sharing or whatever), they will be incorrectly associated with the wrong files until UUIDs are used as the primary ID. |
Kevin, I think you are asking the right questions about how the file id
service will work across different file systems. At the same time, there
are a lot of other issues with how Jupyter Lab/Server work with different
and multiple file systems. We plan on tackling those questions after
JupyterLab 4 and Jupyter Server 2 are out and that work will include all of
the questions you are asking about. For now, we want to make sure that the
basic usage cases of commenting, notebook jobs, and RTC work well when
files/directories are moved/renamed in JLab 3 and 4.
…On Tue, Oct 11, 2022 at 7:42 AM Kevin Bates ***@***.***> wrote:
- The scoping of a given FileID service is a ContentManager. Thus, if a
user has multiple ContentManagers serving different file systems, they
would presumably have 1 file ID service for each of this. For example
if a
user has a local file system and an S3 bucket exposed through 2
ContentManagers, they would have two File ID services.
(This is why it would have been nice if FileIDs were a function of (and
emitted from) the ContentsManager.)
Since the FileID service is now a consumer of ContentsManager events, how
will a given FileID service instance determine that the received event was
relative to a LargeFileContentsManager, an S3ContentsManager, or a
FooContentsManager?
- Related to this case, I can imagine situations where a user wants to
move
or copy a file between different ContentManager (copy file to S3 for
example). In that case, it may help to have UUIDs that are globally
unique.
Yes, but more importantly, is that once objects referencing FileIDs find
their way to other systems (via sharing or whatever), they will be
incorrectly associated with the wrong files
<jupyter-server/jupyter_server_fileid#3 (comment)>
*until* UUIDs are used as the primary ID.
—
Reply to this email directly, view it on GitHub
<#921 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGXUAY6QGRSTRSAP7J3V3WCV4HHANCNFSM53PJHWXQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Brian E. Granger
Senior Principal Technologist, AWS AI/ML ***@***.***)
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
I see, thank you for your response. Just to be clear, I'm not as worried about the filesystem issues as I am about avoiding data integrity emergencies - per my second point above - which we can continue discussing in the linked issue. |
Overview
Draft implementation of File ID service proposed here.
File ID Service maintains a bi-directional mapping between file IDs and file names. This service permits bi-directional lookup, and preserves file ID across filesystem operations done through the
ContentsManager
.FileIdManager
class with assoc. unit testsFileIdManager
methods intoContentsManager
andAsyncContentsManager
filesystem methodsid
property that contains the file ID to the Contents API GET/POST/PUT/PATCH /api/contents/{path} operations.jupyter_server/benchmarks/fileidmanager_benchmark.py
.Testing
To test with JupyterLab stable, developers can run the following in a separate
conda
environment:Benchmarks
To run File ID Manager benchmark:
Benchmarks run on
m5.12xlarge
AWS EC2 instance.