Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

shaikh-ma · 2024-07-25T11:30:45Z

The owner parameter of the fork() method allows creating a dataset fork in a defined project location.
But to make it work the owner parameter always needs to be the Project URL e.g., https://<company>.crunch.io/api/projects/<project_id>/.

This often requires the user, an additional step, to manually find out the project URL to pass as the owner parameter.

It is comparatively easier to work with the project folder paths e.g., "Client datasets | Projects A | Project A - 1 " than the project URL, which also is more human friendly.
Hence, this Pull Request adds a new feature to fork() method to allow the user to pass the project path to the owner parameter as well, as requested in the issue.

The previous functionality of the fork() method is still the same, only the behavior of the owner parameter has been updated.

On passing project path (instead of a project URL), the fork() method now automatically maps the folder path to its respective Project URL. Thus, preventing the user an additional step for finding out the Project URL manually.

Examples:

import scrunch

ds = scrunch.get_dataset("Dataset X", project="Project A")

# Passing project path to `owner`
ds_fork = ds.fork(preserve_owner=False, owner="Project B | Project 1")


# Passing project URL to `owner`
ds_fork = ds.fork(preserve_owner=False, owner="https://<company>.crunch.io/api/projects/<project_id>/")

Additionally, this PR handles the timeout error (TaskProgressTimeoutError) which often occurs when forking large datasets.

shaikh-ma · 2024-08-13T07:32:40Z

Hello @jjdelc, @aless10, 👋

Please could you take a look at this PR, when you get a chance.

Thanks

jjdelc

The intention of this ticket is correct, it is a good direction to move away from dataset "owners" and think of them as living inside a project.

But I don't think this implementation reflects that.

What I suggest is using a new argument called project and have it receive a Project instance (or maybe a project ID).

That way you can call dataset.fork(project=destination_project)

And I think that the default behavior should be:

def fork(self, ..., project=None, ...):
  project = project or self.project

This way, new forks without any location information would be created on the same project as the current dataset (the user being an editor on the dataset to issue a fork, should also have edit permissions on the project to place the new fork)

Bonus points: remove the owner stuff completely, it is not an official use case anymore, is anybody using it?

shaikh-ma · 2024-08-15T09:32:20Z

This sounds a great idea @jjdelc! 🙂

Initially @alextanski & I, had a similar thought to add a new parameter "project".

But we had few concerns about it, then:

It might become confusing both the developer & the users when/how to use the owner & project parameter. And their behaviour with preserve_owner.
Handling of undesired scenarios e.g., user might be passing both these parameters or none. Et cetra.

Hence, the current changes were aimed to not change the current behaviour of the fork() method while adding the needed features additionally. So as to maintain simplicity & backward compatibility.

As you've now suggested to remove the "owner" completely, just a few questions:

How should we handle the scenario when the user needs the fork dataset to be created in their personal project?
Should we rename or remove the preserve_owner parameter?

Thanks

jjdelc · 2024-08-19T11:34:00Z

You make a good point about backwards compatibility, and I think it should be easy to handle in inside the method.

For now, you may need to keep the owner behavior as it is. And add the new project field. Handle inside the call if both are sent raise an error and suggest to only send project, if none are sent, follow the current behavior.

My other point is not to require the user to send URLs, but allow for a Project() instance to be received, I would like to support arbitrary paths too, but you'll risk it that the path may not exist and have to then take care of creating and such, best to have that responsibility outside of this .fork() call and once they solved that they would have a Project() instance.

Also, add a test checking your payload.

jjdelc · 2024-08-19T11:36:11Z

scrunch/datasets.py

        if preserve_owner:
            body['owner'] = self.resource.body.owner
+        elif owner:
+            body["owner"] = (
+                owner if owner.startswith("http") else get_project(owner).url


These should be encouraged to be Crunch objects rather than strings. For projects don't allow strings, urls or paths. Because you'll have problems validating here.

Have this be a Project instance.

question: @jjdelc Double-checking on this - Isn't get_project() validating the string for us? It will return a Project instance if the path can be resolved, and raise an InvalidPathError it cannot:

InvalidPathError Traceback (most recent call last) Cell In[8], line 7 4 ygdc_crunch.Crunch() 6 name_pr = "TechOps|Alex 3" ----> 7 pr = scrunch.get_project(name_pr) File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:206, in get_project(project, connection) 203 _project = Project(ret) 205 if sub_project: --> 206 _project = _project.get(sub_project) 208 return _project File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:552, in Project.get(self, path) 550 for p_name in Path(path).get_parts(): 551 try: --> 552 node = node.get_child(p_name) 553 except KeyError: 554 raise InvalidPathError('Project not found %s' % p_name) File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:571, in Project.get_child(self, name) 568 return Project(tup.entity) 569 return self.root.dataset[name] --> 571 raise InvalidPathError('Invalid path: %s' % name) InvalidPathError: Invalid path: Alex 3

Ah, you're right. I missed that part of the if then yes this is correct

Thanks @jjdelc - so a green light on allowing path strings?

jjdelc · 2024-08-19T11:37:01Z

scrunch/datasets.py

@@ -2412,14 +2409,22 @@ def fork(self, description=None, name=None, is_published=False,
            **kwargs
        )

+        owner = kwargs.get("owner")
        if preserve_owner:
            body['owner'] = self.resource.body.owner


If they choose to preserve owner, what they mean is that it should be in the same place as the current project. The API already returns the project URL for the owner attribute. So this may also be self.resource.project.

And we can slowly stop the conversaion about owners, because that concept does not exist in the API anymore and will bring confusion to users that individuals have ownership of datasets.

question: @jjdelc I am a bit confused here - We want to keep the preserve_owner kwarg and add a new one, project, explicitly to the signature by your suggestion; then basically fall back to self.resource.project (simple code change to abandon the old owner property) if project is not given? Just making sure we get this right!

Yes, really at this point owner and project are equivalent. There is no concept of an individual owner in Crunch.

It looks like preserve_owner is the default behavior when no owner is passed in, if preserve_owner is False, and no owner is passed, then there's no explicit code path for this.

Thanks for confirming.

@jjdelc , the self.resource.project.self fails when forking a dataset into the personal project., i.e., when the preserve_owner = True for a dataset in the personal project.
Hence, I've kept it unchanged for now i.e, body['owner'] = self.resource.body.owner.

Additionally, I've added the project parameter as suggested and also have added the test in test_dataset.py:

Also, I've added a few validations,

jjdelc · 2024-08-19T11:38:27Z

scrunch/datasets.py

+        try:
+            _fork = self.resource.forks.create(payload).refresh()
+        except TaskProgressTimeoutError as exc:
+            _fork = exc.entity.wait_progress(exc.response).refresh()


If you get a timeout, you should continue to raise, this is an invalid call with the exception. If it timed out it quite likely failed, and the code flow shouldn't continue bc you won't have a fork to carry on with.

question: @jjdelc Do you refer here to catching the timeout inside the fork() method directly? Because it is common practice for users to catch the timeouts on forks with a progress tracking against a larger timeout. We simply do not want to have this inside the method here?

What I mean is that if pycrunch raised a timeout, there's no recovery. Calling .wait_progress with the response of the timeout is not going to work, and it won't achieve anything (maybe some python error) but this line looks like it is attempting some fallback way to recover from the timeout and still obtain a _fork instance from it.

Yep - got it now, @jjdelc! Make sense. @shaikh-ma I think if we would want control the timeout AND also return the fork when it completed, we would need to do this:

... < catch & wait > entity = exc.entity.refresh() fork_ds = scrunch.get_mutable_dataset(entity.body.id)

I think this should work (at least I think that's what we do in other scenarios)?

The line _fork = exc.entity.wait_progress(exc.response).refresh() worked as expected (without giving any error) when I tried out with a dataset.

@jjdelc Looks like handling the timeout seems to work in our example, as demonstrated by @shaikh-ma - what do you think? I guess this is the last open point, then we should be good? The test has also been added.

Appreciate your time (when you can give it some 👀)! Thanks!

So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that.

On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation.

This is equivalent of saying timeout = 2*default_timeout. You've effectively waited 2 times.

It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together

scrunch/tests/test_datasets.py

Co-authored-by: Alessio Izzo <[email protected]>

jjdelc · 2024-09-03T12:48:31Z

scrunch/datasets.py

+        try:
+            _fork = self.resource.forks.create(payload).refresh()
+        except TaskProgressTimeoutError as exc:
+            _fork = exc.entity.wait_progress(exc.response).refresh()


So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that.

On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation.

This is equivalent of saying timeout = 2*default_timeout. You've effectively waited 2 times.

It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together

jjdelc · 2024-09-03T12:48:49Z

scrunch/tests/test_datasets.py

        sess = MagicMock()
        body = JSONObject({
            'name': 'ds name',
            'description': 'ds description',
-            'owner': user_id,
+            'owner': project_id,
+            # 'owner': user_id,


This comment needs to go?

I've removed the comment

jjdelc · 2024-09-03T12:49:24Z

scrunch/tests/test_datasets.py

@@ -1795,7 +1846,8 @@ def test_fork_preserve_owner(self):
            'body': {
                'name': 'FORK #1 of ds name',
                'description': 'ds description',
-                'owner': user_id,  # Owner preserved
+                # 'owner': user_id,  # Owner preserved
+                'owner': project_id,


I think it's best to start updating the tests to use project to make sure anybody else following the implementation starts learning this is the preferred way

Hi @jjdelc ,

I've updated the try-catch block to set timeout=None to avoid getting timed out again & have also added a comment stating the purpose.

I continue to worry that this re-attempt strategy is a bad approach. What this is doing is capturing the 1st timeout and then trying again.

Recall that this is a client-side timeout, say 60s so in case the opreation in the backend ends after 61s, the client will time out, and raise an error, then the client will try again a 2nd time and proceed for a next creation.

I think that if the issue is to reduce timeout changes, this should simply set a long timeout on the very fist request. Instead of making 2 independent creation requests to the server.

Hi @jjdelc,

As we've already completed the main goal of this PR, which involves allowing Project paths & URL,
I think it would be a good idea to explore timeout handling separately.
It might need determining what timeout value might be the best fit & how this will affect the codebase/performance etc, and to add a test for testing this scenario.

Therefore, I've removed the timeout handling part from the fork() method as this step is just an extra precaution to handle timeouts, but it's not the primary focus of this PR.

Very good call @shaikh-ma , I think this is best, if the fork starts to timeout, we can address that later.

shaikh-ma · 2024-09-16T08:09:54Z

Hi @jjdelc, @aless10 👋 ,

Please let me know if anything else is needed from our end or it's good to be merged. 😄

jjdelc · 2024-09-18T11:46:34Z

Merged on master, I'll make another minor release to include this

shaikh-ma and others added 3 commits July 25, 2024 13:56

Updating datset.fork method to allow project path string

c84a7e8

Updating docstring

8d318df

Update datasets.py

5ed329e

shaikh-ma changed the title ~~Task/update fork method to allow path string~~ Task/update fork method to allow Project Paths Jul 25, 2024

shaikh-ma changed the title ~~Task/update fork method to allow Project Paths~~ Allowing Project Paths to be passed to fork() methods Jul 26, 2024

shaikh-ma marked this pull request as draft July 26, 2024 12:00

shaikh-ma added 2 commits July 31, 2024 18:03

Handling timeouts

c9f0995

Update datasets.py

0f4ca00

shaikh-ma marked this pull request as ready for review August 1, 2024 08:38

shaikh-ma changed the title ~~Allowing Project Paths to be passed to fork() methods~~ Allowing Project Paths to be passed to scrunch.dataset.fork() method Aug 2, 2024

jamesrkg requested review from jjdelc and aless10 August 14, 2024 07:35

jjdelc requested changes Aug 15, 2024

View reviewed changes

jjdelc requested changes Aug 19, 2024

View reviewed changes

Updates as per code reviews

de721c2

aless10 reviewed Aug 29, 2024

View reviewed changes

scrunch/tests/test_datasets.py Outdated Show resolved Hide resolved

shaikh-ma and others added 2 commits August 29, 2024 15:08

Update scrunch/tests/test_datasets.py

b507aeb

Co-authored-by: Alessio Izzo <[email protected]>

fixing indentation

ea2d8ec

jjdelc reviewed Sep 3, 2024

View reviewed changes

shaikh-ma added 2 commits September 3, 2024 21:07

updates as per comments

45f356f

removing time-out handling

bd23e59

jjdelc approved these changes Sep 6, 2024

View reviewed changes

aless10 approved these changes Sep 9, 2024

View reviewed changes

jjdelc merged commit c94645f into Crunch-io:master Sep 18, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

shaikh-ma commented Jul 25, 2024 •

edited

Loading

shaikh-ma commented Aug 13, 2024

jjdelc left a comment

shaikh-ma commented Aug 15, 2024 •

edited

Loading

jjdelc commented Aug 19, 2024

jjdelc Aug 19, 2024

alexbuchhammer Aug 19, 2024 •

edited

Loading

jjdelc Aug 20, 2024

alexbuchhammer Aug 20, 2024 •

edited

Loading

jjdelc Aug 19, 2024

alexbuchhammer Aug 19, 2024 •

edited

Loading

jjdelc Aug 20, 2024

alexbuchhammer Aug 20, 2024

shaikh-ma Aug 26, 2024 •

edited

Loading

jjdelc Aug 19, 2024

alexbuchhammer Aug 19, 2024 •

edited

Loading

jjdelc Aug 20, 2024

alexbuchhammer Aug 20, 2024 •

edited

Loading

shaikh-ma Aug 26, 2024

alexbuchhammer Aug 28, 2024

jjdelc Sep 3, 2024

jjdelc Sep 3, 2024

jjdelc Sep 3, 2024

shaikh-ma Sep 3, 2024

jjdelc Sep 3, 2024

shaikh-ma Sep 3, 2024

shaikh-ma Sep 3, 2024

jjdelc Sep 5, 2024

shaikh-ma Sep 6, 2024 •

edited

Loading

jjdelc Sep 6, 2024

shaikh-ma commented Sep 16, 2024 •

edited

Loading

jjdelc commented Sep 18, 2024

Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

Conversation

shaikh-ma commented Jul 25, 2024 • edited Loading

shaikh-ma commented Aug 13, 2024

jjdelc left a comment

Choose a reason for hiding this comment

shaikh-ma commented Aug 15, 2024 • edited Loading

jjdelc commented Aug 19, 2024

Choose a reason for hiding this comment

alexbuchhammer Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbuchhammer Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbuchhammer Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaikh-ma Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbuchhammer Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexbuchhammer Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaikh-ma Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaikh-ma commented Sep 16, 2024 • edited Loading

jjdelc commented Sep 18, 2024

shaikh-ma commented Jul 25, 2024 •

edited

Loading

shaikh-ma commented Aug 15, 2024 •

edited

Loading

alexbuchhammer Aug 19, 2024 •

edited

Loading

alexbuchhammer Aug 20, 2024 •

edited

Loading

alexbuchhammer Aug 19, 2024 •

edited

Loading

shaikh-ma Aug 26, 2024 •

edited

Loading

alexbuchhammer Aug 19, 2024 •

edited

Loading

alexbuchhammer Aug 20, 2024 •

edited

Loading

shaikh-ma Sep 6, 2024 •

edited

Loading

shaikh-ma commented Sep 16, 2024 •

edited

Loading