Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Is the implementation of cycler efficient? #742

Closed
NivekT opened this issue Aug 17, 2022 · 4 comments
Closed

[Discussion] Is the implementation of cycler efficient? #742

NivekT opened this issue Aug 17, 2022 · 4 comments

Comments

@NivekT
Copy link
Contributor

NivekT commented Aug 17, 2022

TL;DR: It seems in most cases users might be better off using .flatmap(lambda x: [x for _ in n_repeat]) rather than .cycle(n_repeat).

Here is the implementation, basically Cycler reads from the source DataPipe for n number of times.

Things to consider:

  1. This means repeating certain operations (e.g. reading from disk, complicated transformation) for n number of times, unless you use in_memory_cache.
  2. If shuffle is used afterwards, I believe .flatmap(lambda x: [x for _ in n_repeat]) is strictly better than .cycle(n_repeat).
  3. For input = [0, 1, 2], the major difference is that .cycle returns [0, 1, 2, 0, 1, 2] compared to .flatmap(...) returning [0, 0, 1, 1, 2, 2].

Questions:

  1. Should we change the implementation?
  2. Should we add something like .repeat() which basically does .flatmap(lambda x: [x for _ in n_repeat])?
  3. Should we advise users to use .flatmap(...) instead unless they specifically want the ordering of [0, 1, 2, 0, 1, 2]?

@VitalyFedyunin @ejguan Let me know what you think.

@NivekT
Copy link
Contributor Author

NivekT commented Aug 17, 2022

I did some quick profiling and cycle is slower for a relatively simple DataPipe.

Screen Shot 2022-08-17 at 7 01 17 PM

@ejguan
Copy link
Contributor

ejguan commented Aug 18, 2022

  • Should we add something like .repeat() which basically does .flatten(lambda x: [x for _ in n_repeat])?

I would support this proposal because I believe those two have different use cases. And, python does support such functionality via https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.repeat_each

BTW, which are you using for profiling? It looks great.

@NivekT
Copy link
Contributor Author

NivekT commented Aug 18, 2022

Adding an operation sounds good. I will probably modify the docstrings as well to mention that the other option exists and may be more suitable for some use cases.

I am using scalene for profiling.

@NivekT
Copy link
Contributor Author

NivekT commented Aug 30, 2022

Closing this as #748 has landed.

@NivekT NivekT closed this as completed Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants