Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apiserver]: fix jgscm root folder creation, dependencies #1

Merged
merged 7 commits into from
Apr 8, 2019

Conversation

akotlar
Copy link

@akotlar akotlar commented Apr 6, 2019

Normalizes "directory" and "file" creation when GoogleStorageContentManager.default_path is specified. Previously, when default_path was specified creating a file (say a Python file) would correctly create a blob in the bucket, while creating a directory would attempt to generate a new bucket (despite directories being blobs).

As far as I can tell this is fixes all encountered issues besides the listing of the bucket in the UI ("/" instead of "/bucket" in the browser): creating folders in root, renaming folders, making multiple folders (previously only 1 directory - "untitled-folder" - would be created/overwritten for N directory creation operations)

This fix does not affect behavior when GoogleStorageContentManager.default_path is unspecified. This original behavior is required by cloudtools.

Addresses the following open issues in the src-d/jgscm repo:
src-d#14 (dependencies)
src-d#13 (folder creation in root)
src-d#13 (comment)

  • "I also see an issue where if I create two new notebooks, they're both named "untitled" and the second overwrites the first."

cc @danking, @jigold

cleanup, from failed checkout attempt

cleanup

more cleanup

cleanup

cleanup
cseed
cseed previously requested changes Apr 8, 2019
"""
Splits the path into bucket name and path inside the bucket.
:param path: string to split.
:return: tuple(bucket name, bucket path).
"""
if self.default_path and self.default_path not in path:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want path.startswith(self.default_path), not in?

"""
Splits the path into bucket name and path inside the bucket.
:param path: string to split.
:return: tuple(bucket name, bucket path).
"""
if self.default_path and self.default_path not in path:
if path[0] == "/":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want urllib.parse.urljoin here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, if you use urljoin, I think you need the startswith bit, it handles concatenation of absolute URLs appropriately:

>>> urllib.parse.urljoin('gs://bucket/foo', 'gs://bucket/bar')
'gs://bucket/bar'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the default_path is a misnomer, and not a path. It's the default bucket (name), such as "user-kljkldjf".

setup.py Outdated
install_requires=["google-cloud>=0.32.0", "notebook>=4.2", "nbformat>=4.1",
"tornado>=4", "traitlets>=4.2"],
install_requires=["google-api-python-client",
"google-cloud-storage",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should version these dependencies analogously.

@@ -460,7 +460,8 @@ def run_post_save_hook(self, model, os_path):
model=model,
contents_manager=self)
except Exception:
self.log.error("Post-save hook failed on %s", os_path, exc_info=True)
self.log.error("Post-save hook failed on %s",
os_path, exc_info=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary format change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a auto-format line-length change, 79 character length, formatted by autopep8. Should I force it to save without formatting?
Screen Shot 2019-04-08 at 1 30 37 PM

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's fine.

setup.py Outdated
@@ -4,16 +4,18 @@
setup(
name="jgscm",
description="Jupyter Google Cloud Storage ContentsManager",
version="0.1.9",
version="0.2.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should signify this is not the original version, but our derived one. Also, this seems like a patch version change. How about 0.1.10-hail?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -syntax isn't a valid PEP 440 version number, for this kind of situation we're supposed to suffix a +hail which is called a "local version identifier".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's not semvar? My bad? Good to know.

"""
Splits the path into bucket name and path inside the bucket.
:param path: string to split.
:return: tuple(bucket name, bucket path).
"""
if self.default_path and not path.startswith(self.default_path):
Copy link
Author

@akotlar akotlar Apr 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the check for "/" prefix that I had written in the last commit. At some point, if we decide we want to continue using this library, I think we can simplify the program, to stop stripping leading "/" in every function that also calls _parse_path (at which time we'll add the leading "/" check back into _parse_path). Until then, it appears _parse_path is never called with a leading "/".

I can add that back in for safety.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until then, it appears _parse_path is never called with a leading "/".

Ah, OK. An assert to that effect might be good documentation.

One last potential bug: path might start with default_path, but it is just a prefix (e.g. default_path = foo, path = foobar/baz). In general, this is never going to work if you create a file in the bucket with the same name as the bucket.

@akotlar akotlar dismissed cseed’s stale review April 8, 2019 17:37

addressed, thanks!

"""
Splits the path into bucket name and path inside the bucket.
:param path: string to split.
:return: tuple(bucket name, bucket path).
"""
if self.default_path and not path.startswith(self.default_path):
path = f"{self.default_path}/{path}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I'm starting to see: path is not a URL but it may or may not include the bucket.

Hmm, this doesn't do what the old code does. If default_path doesn't have a leading slash, you probably need to add it in the call to startswith. If path has a leading slash, then this code doesn't behave the same as the previous one. You probably want something like os.path.join(f'/{default_path}', path).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_path shouldn't have a leading slash https://github.com/src-d/jgscm#usage

  • c.GoogleStorageContentManager.default_path = 'path/without/starting/slash'

The _parse_path needs to return the bucket_name (no slash), and the rest of the path, with the slash separating bucket_name and the rest of the path stripped. The call right below my change is:

 bucket, _, blobname = path.partition("/")
 return bucket, blobname

which splits on the first slash, and strips that first slash (because the middle argument is dropped).

The behavior we want:

  • _parse_path("your_bucket_name_with_slash/") we should get ("your_bucket_name", "")
  • _parse_path("some-folder") we should get ("your_bucket_name", "some_folder")
    • This would previously fail, you would get ("some_folder", ""), which would result in the failed get operation.
  • _parse_path("some-folder/") we should get ("your_bucket_name", "some_folder/")
  • _parse_path("your_bucket_name/some-folder") we should get ("your_bucket_name", "some_folder")
  • _parse_path("your_bucket_name/some-folder/some_file") we should get ("your_bucket_name", "some-folder/some_file")

I've attached some examples of the call stack:

Screenshot 2019-04-08 13 54 11

Screenshot 2019-04-08 13 54 32

Screenshot 2019-04-08 13 54 52

Screenshot 2019-04-08 13 55 14

Copy link
Author

@akotlar akotlar Apr 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only behavior that this changes is that when you create a bucket in the root dir, and you have default_path (bucket name) set, it will use that bucket, rather than treat the directory being created as a new bucket. That's why I had treated this as a breaking change.

And I realized it actually fixes another issue. In the old jgscm, when you created 2 python files in the root dir, the 2nd would overwrite the first. This fixes that as well. See the below before_change.log call stack, notice it overwrites Untitled.ipynb.

before_change.log
after_change.log

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerating the cases makes it clear this design has a real ambiguity when there is a folder with the same name as a bucket, but let's deal with that after this goes in.

Your code still breaks on the case there is a folder that starts with (but is not equal to) the bucket name, my recommendation of startswith is bad.

when you create a bucket in the root dir

You mean folder?

That's why I had treated this as a breaking change.

This really seems like a bug (creating things outside the default path), which is why I'd call it a patch release.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerating the cases makes it clear this design has a real ambiguity when there is a folder with the same name as a bucket, but let's deal with that after this goes in.

Yes, I agree, I don't like the lack of separation between folder and bucket taken in jgscm. I think the more stable solution is to always prefix the filename with the bucket name, unless in the root folder, in which case the behavior will be different depending on the flag passed (and that maybe shouldn't be default_path; someone may want to be able to create buckets even when a default bucket is specified).

Your code still breaks on the case there is a folder that starts with (but is not equal to) the bucket name, my recommendation of startswith is bad.

I will attempt to address this now.

You mean folder?
Yes

Copy link
Author

@akotlar akotlar Apr 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's addressed by checking for

path.startswith(f"{self.default_path}/")

This problem only affects renaming and moving, since creation uses a default name (untitled-folder). Of the two, this fixes renaming, but moving still acts strangely with folder names that are the same as bucket names. However, jgscm moving logic is in general funny, because it completely ignores the semantic differences between folders and files (since buckets don't really distinguish these, and there's a lot of / stripping happening, in the jupyter class it extends).

Here is an example of this working renaming.mp4.zip

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that we defer the moving fix for another PR, so that Konrad/Liam can test other functions. However, can work further on this if you think worth it. Whatever you prefer :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd open an issue for the file/folder-as-bucket-name problem and make it low priority. Getting IO (hadoop stuff) and user isolation going seem much more important.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link

@cseed cseed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good now. Thanks for explaining.

@akotlar akotlar merged commit 1695db9 into hail-is:master Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants