-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[apiserver]: fix jgscm root folder creation, dependencies #1
Conversation
cleanup, from failed checkout attempt cleanup more cleanup cleanup cleanup
…presence of GoogleStorageContentManager.default_path
jgscm/__init__.py
Outdated
""" | ||
Splits the path into bucket name and path inside the bucket. | ||
:param path: string to split. | ||
:return: tuple(bucket name, bucket path). | ||
""" | ||
if self.default_path and self.default_path not in path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want path.startswith(self.default_path)
, not in?
jgscm/__init__.py
Outdated
""" | ||
Splits the path into bucket name and path inside the bucket. | ||
:param path: string to split. | ||
:return: tuple(bucket name, bucket path). | ||
""" | ||
if self.default_path and self.default_path not in path: | ||
if path[0] == "/": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want urllib.parse.urljoin
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, if you use urljoin
, I think you need the startswith
bit, it handles concatenation of absolute URLs appropriately:
>>> urllib.parse.urljoin('gs://bucket/foo', 'gs://bucket/bar')
'gs://bucket/bar'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the default_path is a misnomer, and not a path. It's the default bucket (name), such as "user-kljkldjf".
setup.py
Outdated
install_requires=["google-cloud>=0.32.0", "notebook>=4.2", "nbformat>=4.1", | ||
"tornado>=4", "traitlets>=4.2"], | ||
install_requires=["google-api-python-client", | ||
"google-cloud-storage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should version these dependencies analogously.
@@ -460,7 +460,8 @@ def run_post_save_hook(self, model, os_path): | |||
model=model, | |||
contents_manager=self) | |||
except Exception: | |||
self.log.error("Post-save hook failed on %s", os_path, exc_info=True) | |||
self.log.error("Post-save hook failed on %s", | |||
os_path, exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary format change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's fine.
setup.py
Outdated
@@ -4,16 +4,18 @@ | |||
setup( | |||
name="jgscm", | |||
description="Jupyter Google Cloud Storage ContentsManager", | |||
version="0.1.9", | |||
version="0.2.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should signify this is not the original version, but our derived one. Also, this seems like a patch version change. How about 0.1.10-hail
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -
syntax isn't a valid PEP 440 version number, for this kind of situation we're supposed to suffix a +hail
which is called a "local version identifier".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's not semvar? My bad? Good to know.
jgscm/__init__.py
Outdated
""" | ||
Splits the path into bucket name and path inside the bucket. | ||
:param path: string to split. | ||
:return: tuple(bucket name, bucket path). | ||
""" | ||
if self.default_path and not path.startswith(self.default_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the check for "/" prefix that I had written in the last commit. At some point, if we decide we want to continue using this library, I think we can simplify the program, to stop stripping leading "/" in every function that also calls _parse_path (at which time we'll add the leading "/" check back into _parse_path). Until then, it appears _parse_path is never called with a leading "/".
I can add that back in for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until then, it appears _parse_path is never called with a leading "/".
Ah, OK. An assert to that effect might be good documentation.
One last potential bug: path might start with default_path, but it is just a prefix (e.g. default_path = foo, path = foobar/baz). In general, this is never going to work if you create a file in the bucket with the same name as the bucket.
""" | ||
Splits the path into bucket name and path inside the bucket. | ||
:param path: string to split. | ||
:return: tuple(bucket name, bucket path). | ||
""" | ||
if self.default_path and not path.startswith(self.default_path): | ||
path = f"{self.default_path}/{path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I'm starting to see: path is not a URL but it may or may not include the bucket.
Hmm, this doesn't do what the old code does. If default_path doesn't have a leading slash, you probably need to add it in the call to startswith
. If path has a leading slash, then this code doesn't behave the same as the previous one. You probably want something like os.path.join(f'/{default_path}', path)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default_path shouldn't have a leading slash https://github.com/src-d/jgscm#usage
c.GoogleStorageContentManager.default_path = 'path/without/starting/slash'
The _parse_path needs to return the bucket_name (no slash), and the rest of the path, with the slash separating bucket_name and the rest of the path stripped. The call right below my change is:
bucket, _, blobname = path.partition("/")
return bucket, blobname
which splits on the first slash, and strips that first slash (because the middle argument is dropped).
The behavior we want:
_parse_path("your_bucket_name_with_slash/")
we should get("your_bucket_name", "")
_parse_path("some-folder")
we should get("your_bucket_name", "some_folder")
- This would previously fail, you would get
("some_folder", "")
, which would result in the failed get operation.
- This would previously fail, you would get
_parse_path("some-folder/")
we should get("your_bucket_name", "some_folder/")
_parse_path("your_bucket_name/some-folder")
we should get("your_bucket_name", "some_folder")
_parse_path("your_bucket_name/some-folder/some_file")
we should get("your_bucket_name", "some-folder/some_file")
I've attached some examples of the call stack:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only behavior that this changes is that when you create a bucket in the root dir, and you have default_path
(bucket name) set, it will use that bucket, rather than treat the directory being created as a new bucket. That's why I had treated this as a breaking change.
And I realized it actually fixes another issue. In the old jgscm, when you created 2 python files in the root dir, the 2nd would overwrite the first. This fixes that as well. See the below before_change.log
call stack, notice it overwrites Untitled.ipynb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enumerating the cases makes it clear this design has a real ambiguity when there is a folder with the same name as a bucket, but let's deal with that after this goes in.
Your code still breaks on the case there is a folder that starts with (but is not equal to) the bucket name, my recommendation of startswith
is bad.
when you create a bucket in the root dir
You mean folder?
That's why I had treated this as a breaking change.
This really seems like a bug (creating things outside the default path), which is why I'd call it a patch release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enumerating the cases makes it clear this design has a real ambiguity when there is a folder with the same name as a bucket, but let's deal with that after this goes in.
Yes, I agree, I don't like the lack of separation between folder and bucket taken in jgscm. I think the more stable solution is to always prefix the filename with the bucket name, unless in the root folder, in which case the behavior will be different depending on the flag passed (and that maybe shouldn't be default_path; someone may want to be able to create buckets even when a default bucket is specified).
Your code still breaks on the case there is a folder that starts with (but is not equal to) the bucket name, my recommendation of startswith is bad.
I will attempt to address this now.
You mean folder?
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's addressed by checking for
path.startswith(f"{self.default_path}/")
This problem only affects renaming and moving, since creation uses a default name (untitled-folder
). Of the two, this fixes renaming, but moving still acts strangely with folder names that are the same as bucket names. However, jgscm moving logic is in general funny, because it completely ignores the semantic differences between folders and files (since buckets don't really distinguish these, and there's a lot of / stripping happening, in the jupyter class it extends).
Here is an example of this working renaming.mp4.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose that we defer the moving fix for another PR, so that Konrad/Liam can test other functions. However, can work further on this if you think worth it. Whatever you prefer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd open an issue for the file/folder-as-bucket-name problem and make it low priority. Getting IO (hadoop stuff) and user isolation going seem much more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks good now. Thanks for explaining.
Normalizes "directory" and "file" creation when
GoogleStorageContentManager.default_path
is specified. Previously, whendefault_path
was specified creating a file (say a Python file) would correctly create a blob in the bucket, while creating a directory would attempt to generate a new bucket (despite directories being blobs).As far as I can tell this is fixes all encountered issues besides the listing of the bucket in the UI ("/" instead of "/bucket" in the browser): creating folders in root, renaming folders, making multiple folders (previously only 1 directory - "untitled-folder" - would be created/overwritten for N directory creation operations)
This fix does not affect behavior when
GoogleStorageContentManager.default_path
is unspecified. This original behavior is required by cloudtools.Addresses the following open issues in the src-d/jgscm repo:
src-d#14 (dependencies)
src-d#13 (folder creation in root)
src-d#13 (comment)
cc @danking, @jigold