-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sync files scheduler class #5269
Conversation
Anyways we should get rid of this exponential/logaritmic loop because for 1k files it is 500k loops, but for 10k files it is 50mln loops. EDIT: I actualy tried to find a use case for sub directory job inside directory, but is it a case in the old implementation? |
7bc1ec5
to
34d54b7
Compare
Good catch @mrow4a. |
// all the sub files or sub directories. | ||
QVector<PropagatorJob *> _subJobs; | ||
// all the new and changed files without conflicts. | ||
QLinkedList<PropagatorJob *> _syncJobs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK not ideal because of heap fragmentation, but @ogoffart can say more to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine, I have choosen LinkedList to be able to easily put jobs both at the end and begining of the list. I need it for bundling and stuff. You know, everywhere there are pros and cons. Lets discuss it, but this is implementation detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, use QList, which reserve space at the begining and at the end.
Amazing. High CPU usage has plagued us there for quite some time! How does the memory usage compare? Because @jturcotte had worked hard to try to lower this. Can you look at this please?
@ogoffart needs to have a look at this from the correctness perspective :) |
I guess there are more flowers like that in the code, I will look for more later. I guess another friends like that are in the discovery. @guruz Hmm, I did not check it, but basicaly it should be the same as previously and lowered with each item, since it frees the memory from linkedlist after each synced file is completed (I am removing item from list, passing it to PropagateItem job, and when it is being destroyed, the item I guess is also being destroyed, since it is not in the linkedlist). @ogoffart Please take this as a draft, I did not look at unit tests yet. Do you have some suggestions already? |
I did not review the wole thing, but it seems you are leaking when the sync abort prematuraly. There is indeed a O(n^2) complexity. But if i understand correctly what you are doing is to having two lists to reduce the n. But the real fix is what the FIXME is saying: cache the value so one does not starts always at 0 |
@@ -608,22 +610,23 @@ bool PropagateDirectory::scheduleNextJob() | |||
|
|||
bool stopAtDirectory = false; | |||
// FIXME: use the cached value of finished job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what you need to do ro remove the O(N^2) cache the value of the first unfinished job and start from here instead of always starting from 0
// all the sub files or sub directories. | ||
QVector<PropagatorJob *> _subJobs; | ||
// all the new and changed files without conflicts. | ||
QLinkedList<PropagatorJob *> _syncJobs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, use QList, which reserve space at the begining and at the end.
@ogoffart I did not touch this implementation of starting from 0 since I though you had some idea behind it/bug scenario. I can also add this caching thing if you say that does not matter. I would anyways go with the other class syncing files at the end, since I want to add another feature there - job scheduler which will try to fill upload/download bandwidth, running this jobs in parallel. In the PropagateSyncFIles job I want to create like few queues from which I will extract in round robin scheduler way. EDIT: might be, this code has to be reviewed in the case of aborts/strange scenarios you know |
34d54b7
to
303ba6f
Compare
Ok, I pushed new version. This should solve most of the concerns and implementation is very easy and lightweight. It should also reduce memory usage. Caching (so CPU saving) I will add in another PR. @guruz @jturcotte @DeepDiver1975 @ogoffart EDIT: tested manualy with all possible sync operations EDIT: fixed bug - previously failing unit tests |
01f7957
to
a4039b3
Compare
a4039b3
to
5aa1629
Compare
@mrow4a I've read through the patch and don't understand yet why having the split into subJobs and syncJobs is necessary. Wouldn't removing all those finished jobs from _subJobs (or, less invasively, keeping track of the first interesting index) have the same effect? I'm assuming the main thing you want to get rid of is iterating over all these finished jobs all the time? To be more specific:
|
Caching is yet another problem, separate PR. In this PR it was "by the way" reducing this loops, actually want to have clear separation of download/upload jobs and other jobs, which will ease pain of adding new features. I want to have another component which will have separate queues of download/upload/chunk/update jobs for bundling, schedulers, delta sync and others.. I dont want to use PropagateDirectory, which is serving there only as a jobs dispatcher. EDIT: currently jobs are nearly litteraly taken from filesystem as they are seen and sendm without thinking what it really is. There is no real sync protocol there. @ckamm Please add separate PR with your solution. -> #5274 EDIT: |
@mrow4a Thanks for explaining, I thought reducing pointless looping was your main objective. About concurrency: Only the GUI thread calls scheduleNextJob. All threaded concurrency happens purely inside the QNetworkManager when propagation jobs run network tasks. |
@mrow4a Could you build a Win testclient with your improvements on rotor? A user offered help to test-drive this? |
Is this still relevant now that #5274 was integrated? |
@ogoffart You suggested to do everything in the PropagateDirectory, I will try to fit everything there, along with Bundling. Lets see how it will work. Otherwise it will end up with sth like solution here if it will be too big mess in PropagateDirectory. This class is already complicated, not including anything else. |
Outdated by #5440 |
Hello guys:
While working on bundling, I analysed a little bit of code, and I found these little flower there, which will reduce the number of loops significantly, in this example from 506000 to 3000 for 1000 files. However, this implementation also has something more in mind, but I will explain it later, please verify and do some QA on this small solution:
MOTIVATION:
The only change here is that we actualy for other jobs than NEW and SYNC we proceed in old manner, while for NEW and SYNC we create new job, insert to that directory _primaryJobs and withing that job extract jobs from the queue and execute them, not checking the status of all jobs which were already finished - please refer to the code at first picture. The checking is done at the end of each job verifying if the queue is empty and if number of jobs ordered is equal to number of jobs finished.
Test on localhost, I have added one folder and 1000 files of 1kB:
In the old version you have required to run around half a million loops to execute this code:
This is my version with around three thousands loops and 1000 files of 1kB, sync also took a little bit shorter, from 58s to 50s on my machine, and took much less CPU
@dragotin @DeepDiver1975 @ogoffart @guruz @cdamken