Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(draft) Defer File closing to a background thread pool and cache dire… #66

Closed
wants to merge 2 commits into from

Conversation

jpdoyle
Copy link

@jpdoyle jpdoyle commented Jul 1, 2018

…ctory IDs

My current approach to #64

if(close_it != m_files_to_close.end()) {
m_files.emplace(id,std::move(close_it->second));
m_files_to_close.erase(close_it);
it = m_files.find(id);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emplace returns the iterator to the created element. Another find is unnecessary time cost.

}
} while(being_closed);
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this block of code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make sure that the file is not currently being deleted by one of the background threads -- I don't know the FileBase implementation well enough to know that it would be safe to have two FileBases referring to the same file open, with one actively being deleted by a background thread.

In pseudocode, it roughly does:

while(the file I'm trying to open is being closed) {
    wait for at least one background thread to finish (which would indicate a change to the list of files being closed)
}

// The file was reopened
return;
}
ptr = it->second.release();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lines below may throw exceptions, causing leakage of ptr. Instead of raw pointer, it should be a std::unique_ptr.

@@ -0,0 +1,123 @@
#ifndef THREAD_POOL_HPP
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need several threads to do the background closing? Seems overkill to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5000 threads is definitely overkill, but I haven't tuned that number yet. In theory, running background threads might not be necessary at all, but I haven't dug into what deleting a FileBase exactly does -- it seems to be tied up in IO calls, but I'm not sure which ones (careful use of strace might be informative). With my folder I'm seeing a ~50x speedup -- I had timed ls before, but find -type f is more representative -- with this change it takes 2 minutes 11 seconds, and with vanilla securefs it took 1 hour 51 minutes. For reference, find -type f | wc -l says that this directory has 250289 files.

Copy link
Owner

@netheril96 netheril96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test on Travis CI fails. The error message is unclear (SIGABRT), but something is wrong.

…unique_ptr, use iterator from unordered_map::emplace
@jpdoyle
Copy link
Author

jpdoyle commented Jul 3, 2018

I think the travis OSX failure was because the thread pool spawned 5000 threads. I reduced it to 50, which should be more reasonable.

Copy link
Owner

@netheril96 netheril96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I now understand the basics of what you are trying to do, and it should work. Let's iron out the details before merging it.

One big thing: I think the cache of path to ID resolution should be deferred to a future PR, if ever. As you have discovered, it barely improves performance. Any cache mechanism runs the risk of data out of sync, so cache should only be added when the benefit is clear and large.

I am also amazed that you are able to understand my codebase so easily as to modify it 👍🏻.

condvars[i]->notify_all();
}

bool done() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time following what done() and wait() does, or what is the use of condvars. Besides, done() seems susceptible to race condition, as it accesses work_queues without acquiring the corresponding lock.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condvars lets the worker threads sleep without wasting any CPU time until there is work available for them to do -- condvars[i]->wait(l) pauses the thread until convars[i]->notify_all() is called on some other thread -- while the thread is asleep, the lock l is unlocked, but once it wakes up again it aquires l once more.

done() has a minor data race, but it's a read-only operation. Between jobs being added with add_job and finishing, done() will never true. If there are many jobs being added and finishing simultaneously, done might erroneously return true. wait waits for some job to complete, or for 100ms to pass, whichever happens first.

I should add these as comments :)

Copy link
Owner

@netheril96 netheril96 Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition invokes undefined behavior. Nothing, not even reading, is safe during a race.

But more importantly, I don't think we should have done() and wait(). I guess that the intention is signal to the main threading calling open_as when to wake up and recheck if the ID to be opened is no longer being closed. But

  • Opening a ID that is just being closed is not a common scenario.
  • The signal is not accurate. It only signals that one ID has been closed, not which ID.
  • Each background thread has to call job_done.notify_all() after each closing. Not only does each call cost CPU cycles, it reduces parallelism since several threads may call it at the same time and have to wait for others.

In a nutshell, this design incurs a great cost, for a not-so-common scenario, and the main thread still has to recheck many times. Too much cost, too little gain.

A far simpler way is simply to let the main thread sleep for a short duration if the ID being opened is being closed by another thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the undefined behavior. I've been doing a bit too much platform-locked stuff so I forgot the CPP generic memory model. Even platform-locked to x86 this code may sometimes fail due to memory reordering effects in empty() -- so the body of the if should be in a lock guard regardless.

Opening an ID that is just being closed is not a common scenario

Many directories in find -type f will fit this profile, with the current length of time deleting a FileBase takes.

The signal is not accurate. It only signals that one ID has been closed, not which ID.

This is correct, but the signal will always trigger when the correct ID is closed, so it's just the pull-based version of this communication protocol. To wait on a particular ID, we'd have to have the thread pool know a lot more about the jobs being given to it. That might be a good idea, but given how the data race in done() snuck in, I think adding code complexity to the thread pool is risky.

Each background thread has to call job_done.notify_all() after each closing. Not only does each call cost CPU cycles, it reduces parallelism since several threads may call it at the same time and have to wait for others.

Using something like condition variables will waste a lot less CPU time than periodic polling, and when an extended wait isn't necessary the condition variable will return quickly.

Copy link
Owner

@netheril96 netheril96 Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is best if you could do a real world benchmark on which approach is better (and how many threads are appropriate). If you insist on current design, I won't object.

@jpdoyle
Copy link
Author

jpdoyle commented Jul 3, 2018

I agree about the cache -- I'm gonna run a small benchmarking matrix using ls and find -type f to double-check how much of an effect the caching has, but I think my implementation is too half-baked to ship. A true solution would need to actively monitor the underlying folder with something like inotify, and even that isn't perfectly reliable.

Copy link
Owner

@netheril96 netheril96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about done() and wait().

if(!active_count.load()) {
bool anything_queued = false;
for(const auto& q: work_queues) {
if(!q.empty()) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we lock individual queues before accessing them, when we unlock them, the emptiness may change. Well, actually it cannot, because in our case, the thread calling done() is the same as the thread calling add_job, but such long distance coupling is not a good design.

std::unique_lock<std::mutex> l(job_done_lock);
job_done.wait_for(l,std::chrono::milliseconds(100));
}
return done();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value is discarded by the caller. Does it serve any purpose?

if(!free_pool.done()) {
bool being_closed = false;
do {
being_closed = false;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a no-op.

@jpdoyle jpdoyle closed this Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants