-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a multi thread version of BatchTask #71
Comments
Hi, Cheers, Martin |
AFAIK, Ivan (@habernal) was the last person working on this issue. I think, work on this issue is quite advanced, but I'm not sure about where it got stuck. Ivan, could you help out here? |
We started implementing MultiThreadBatchTask and some tests in MultiThreadTaskPerformanceTest and MultiThreadBatchTaskTest, but we got stuck at correctly propagating errors (some tasks fail because of missed dependencies and should be re-scheduled while others fail "normally"). There is a test case for a large graph of dependent tasks, which we planned to use to prove that the multithread solution actually speed things up. We haven't touch that since, feel free to explore. |
Thanks a lot, Johannes and Ivan. I will take a look. Multithreading would be a great addition to DKPro. Apart from that, has anyone ever thought about supporting cluster/grid solutions, such as Sun Grid Engine? This might be another option for boosting performance by allowing atomic jobs to be run in parallel in a cluster, but I am not sure how it would work for tasks such as Lucence n-gram meta-info, which need to write to the same Lucence index. |
In theory, each task uses its own context to write to and to read from. If multiple resources from different contexts need to access the same files/DBs etc., that is a separate problem and needs to be dealt with by those resources I would say. Hence: multi-threaded BatchTasks are a big steps towards supporting cluster solutions. |
Thanks a lot, Johannes, for the hint. I will have a look at the BigData sub-project. |
I can feel your pain ;-) I have been using DKPro TC to classify up to a million of documents - split into smaller subsets which were processed by multiple java threads. However, automatic parallelization is definitely preferable to manual parallelization ... |
On the UIMA list, a message was posted last week to announce this project here which allows multithreaded execution of tasks created by CAS multipliers in UIMAfit (if I understand it correctly): Might be interesting in the present context. |
@habernal Hi Ivan, I've had some time to go through the code of the MultiThreadBatchTask classes. I don't have much experience with mutlithreading, so it took me a while, but I think in the end I figured out where the problem was. Just a minor bug fix really, especially considering the complexity of this task: exceptionsFromCurrentLoop needs to be reset whenever the outer loop reststarts, because otherwise the outer loop will run at most twice, potentially leaving a number of tasks un-executed. Two questions, though:
|
PS: I ran a little test on a 24-core machine with some strange results. These are the total execution times for the performance tests with different numbers of threads n set for the executor: 1 thread: Duration [ms]: 66,085 The values are suspiciously close together and don't show much of a performance boost. I have checked to make sure that the class got re-compiled alright with the new values for n, so that can't be problem. |
Hmm, I think the issue might be with the following line: future.get(); This waits for the result of the Future, which essentially turns it into a synchronous call and in consequence all tasks is executed sequentially and not in parallel. But I could be wrong... |
I couldn't investigate yet on the performance issue. Regarding next steps in DKPro TC (but this should better be discussed on the respective mailing list): once we have a way to (maybe dynamically) set the number of threads in MultiThreadBatchTask, all ExperimentBatchTask in DKPro TC can probably inherit from this class (with a default thread number of 1). |
I have made some more changes to the implementation of MultiThreadBatchTask and I think I have it nailed now. Here are the results from a first test: 1 thread: Total runtime [ms]: 54,003 Not so promising. Then I figured that the actual execution time of the DummyTask might be too fast in comparison to the overall runtime and the overhead of managing the threads/futures. So, I added a 10 second pause to the task to simluate some actual work and now the results show a clear improvement when using several threads: 1 thread: Total runtime [ms]: 4,205,190 I will have some time this evening to clean up the modified code and do some final checks before submitting this. |
That sounds indeed promising. Thanks for the investigations, I'm excited to test this in practice! |
I've submitted the pull request with my changes. Please review this closely before merging. The basic structure is the same as before, but I have made some substantial changes to the code. |
That is really great, thanks a lot for the hard work. As you say, this request will need some deeper reviewing - we'll try to do that and merge asap. |
Original issue reported on code.google.com by
[email protected]
on 10 Apr 2015 at 10:43The text was updated successfully, but these errors were encountered: