Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SparsePageSource, delete cache files after use #5321

Merged
merged 4 commits into from
Feb 19, 2020

Conversation

RAMitchell
Copy link
Member

This refactoring separates out many of the components used in the external memory dmatrix. In particular the prefetching and management of pages is extracted into ExternalMemoryPrefetcher<T> and different external memory page types all have their own constructors now.

Part of the reason for doing this is to manage the lifetime of the cache files and remove them when the DMatrix gets destructed. Discussion about this in #4237. Leaving behind cache files leads to many unpredictable and unwanted behaviours, for example if I train the first time using 10% of my data, then train a second time using 100% but do not change the cache prefix, the 10% data will get loaded and used, completely ignoring the training input from the second run.

src/data/sparse_page_dmatrix.h Outdated Show resolved Hide resolved
src/data/sparse_page_source.h Outdated Show resolved Hide resolved
@@ -300,77 +327,118 @@ class SparsePageSource : public DataSource<T> {
writer.PushWrite(std::move(page));
}
std::unique_ptr<dmlc::Stream> fo(
dmlc::Stream::Create(cinfo.name_info.c_str(), "w"));
dmlc::Stream::Create(cache_info_.name_info.c_str(), "w"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check to make sure this file doesn't exist? There is a FileExists function in tests/cpp/helpers.h we can use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ThreadedIterator is not thread safe, we should be very careful about stateful operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added fatal errors if cache exists or if multiple threads try to use the prefetcher. Please take a look at my test for threads. It's not technically 100% guaranteed to trigger the exception, we can remove the test if we get false negatives in CI.

@RAMitchell RAMitchell merged commit bc96ceb into dmlc:master Feb 19, 2020
@hcho3 hcho3 mentioned this pull request Feb 21, 2020
12 tasks
@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants