Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing 1-to-M behaviour of on_disk_cache. #810

Closed

Conversation

VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin commented Oct 6, 2022

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as filename_fn

Stack from ghstack (oldest at bottom):

Differential Revision: D40148560

VitalyFedyunin added a commit that referenced this pull request Oct 6, 2022
ghstack-source-id: 4bfa54e5646d08eb5601ebe99ad23b044cef52ae
Pull Request resolved: #810
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 6, 2022
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!! Thank you!!!

@@ -185,13 +194,19 @@ def __init__(

if filepath_fn is not None:
_check_unpickable_fn(filepath_fn)
filepath_fn = _generator_to_list(filepath_fn) if inspect.isgeneratorfunction(filepath_fn) else filepath_fn
assert not inspect.isgeneratorfunction(filepath_fn) # BC breaking, now only str is accepted as return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, we need to change the dataset implementation from TorchText side. e.g.: https://github.com/pytorch/text/blob/ff1fdfce8ac030a11638b2d94d54364144586253/torchtext/datasets/imdb.py#L107

Comment on lines +274 to +276
todo_dp: Any
cached_dp: Any
one_many_cached_dp: Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: annotated as IterDataPipe

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`


Differential Revision: [D40148560](https://our.internmc.facebook.com/intern/diff/D40148560)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this pull request Oct 11, 2022
ghstack-source-id: faa070d82c5482588a0dd5649335bbed550c6724
Pull Request resolved: #810
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`


Differential Revision: [D40148560](https://our.internmc.facebook.com/intern/diff/D40148560)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this pull request Oct 11, 2022
ghstack-source-id: 74bee209f06b86519ecaa7b996fccaad8ffbec95
Pull Request resolved: #810
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

VitalyFedyunin added a commit to pytorch/text that referenced this pull request Oct 12, 2022
Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]
VitalyFedyunin added a commit to pytorch/text that referenced this pull request Oct 13, 2022
Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]
Nayef211 pushed a commit to pytorch/text that referenced this pull request Oct 13, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]
joecummings added a commit to pytorch/text that referenced this pull request Oct 14, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]

Co-authored-by: Vitaly Fedyunin <[email protected]>
@facebook-github-bot facebook-github-bot deleted the gh/VitalyFedyunin/23/head branch October 16, 2022 14:19
ejguan pushed a commit to ejguan/text that referenced this pull request Oct 20, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]

Co-authored-by: Vitaly Fedyunin <[email protected]>
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 20, 2022
Summary:
Pull Request resolved: pytorch#810

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40148560

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bb9c5f32546a3d4859cff6bda0cc84234312ab7
ejguan pushed a commit that referenced this pull request Oct 20, 2022
Summary:
Pull Request resolved: #810

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40148560

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bb9c5f32546a3d4859cff6bda0cc84234312ab7
abhinavarora pushed a commit to pytorch/text that referenced this pull request Oct 21, 2022
* Fixed on_disk_cache issues

[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]

* Update on "Fixed on_disk_cache issues"

Fixed issues with cache locks and cache files overwrites. Required to be compatible with pytorch/data#810




[ghstack-poisoned]

Co-authored-by: Vitaly Fedyunin <[email protected]>

Co-authored-by: Joe Cummings <[email protected]>
Co-authored-by: Vitaly Fedyunin <[email protected]>
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 23, 2022
Summary:
Pull Request resolved: pytorch#810

TLDR: If filename_fn of on_disk_cache ( filename_fn'1 ) generates different name from filename_fn of end_caching, it is considered 1 to many situation. In this case additional listing file would be created at filename_fn'1 position and it will be used to check cache consistency between runs.

BC breaking, on_disk_cache no longer accept generators as `filename_fn`

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40148560

Pulled By: VitalyFedyunin

fbshipit-source-id: 3bb9c5f32546a3d4859cff6bda0cc84234312ab7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants