Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compose caching #7131

Closed
wants to merge 9 commits into from
Closed

Conversation

atbenmurray
Copy link
Contributor

@atbenmurray atbenmurray commented Oct 13, 2023

Fixes #7130

Description

This is still a work in progress and a focus point for the discussion on #7130. At present, it provides an initial solution that only
solves the Compose always being treated as random part of the solution, but doesn't yet allow nested Composes to be iterated into.

So far, this implementation provides a solution to the issue where Compose is always treated as random.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@atbenmurray atbenmurray self-assigned this Oct 13, 2023
Copy link
Contributor

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good on a glance. Are the preceding commits related to this fix?

@atbenmurray
Copy link
Contributor Author

Code looks good on a glance. Are the preceding commits related to this fix?

Thanks for taking a look!
I'm out at the moment but I'll double check. I still want to modify the solution so it has the nested compose and flat compose equivalent behaviour anyway.

@@ -334,6 +334,18 @@ def __call__(self, input_, start=0, end=None, threading=False, lazy: bool | None

return result

def is_random(self):
Copy link
Contributor

@KumoLiu KumoLiu Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
Do we have another special reason to add this is_random function?
Looks like we could directly use this flatten function in the CacheDataset, it can also apply to nested compose. Do you have concerns about this easy solution?

def flatten(self):

super().__init__(data=data, transform=transform)

Copy link
Contributor Author

@atbenmurray atbenmurray Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with is_random; it is just an initial attempt to solve the problem. I've thought about it over the weekend and if I can avoid introducing is_random then I'll get rid of it. In particular, if we always iterate into Compose objects, we don't need a check to determine whether the Compose object contains only provably deterministic transforms.

flatten has its own problems in that it isn't a complete mechanism and it produces a physically flattened set of transforms. I'm prepping another commit that iterates over transforms without flattening them and returns an index instead. This could then be used in the flatten implementation internally where people really need it, but I think an iterator that knows how to visit transforms is cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has also exposed a potential issue with the relationship between Compose and OneOf, SomeOf, RandomOrder. The latter three are always random and shouldn't be iterated into when trying to find the last deterministic transform, whereas Compose really is just a container and its randomness depends on its contents.

@atbenmurray
Copy link
Contributor Author

Ok. I think I have a clean solution. I'm just testing it, then I'll either wrap up this PR and create a new one or I'll continue with this one.

Essentially, the iterative approach can replace the physical flattening quite nicely. This is good because we don't flatten the transforms when we execute the Compose pipeline but we do effective execute in a nested way. The solution is to modify execute_compose to be able to perform the iteration itself without flattening the data. The solution doesn't need is_random, although we might want to keep it (or something like it) so we don't have to keep a list of subclasses of Compose that we don't iterate into.

@atbenmurray atbenmurray mentioned this pull request Oct 18, 2023
7 tasks
@atbenmurray
Copy link
Contributor Author

Ok, @dzenanz, @KumoLiu, PR #7140 is the solution I had in mind that solves the general problem described by discussion #6172 and issue #7130. As such, I'm going to close this PR and ask you to take a look at #7140 as a better approach. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datasets that cache in some form should be able to handle nested Compose
3 participants