-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved performance when importing wide datasets (and fixed bug when selecting multiple variables) #8465
Conversation
updated branch from main master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patowhiz that seems extrmely promising already.
a) I tried the ExtRemes > carcasson data from the library, which has 12,000 variables. It now took just 23 seconds to import into R-Instat. Great.
b) I then tried the Prepare > Data: Change > Transpose. It took just 3 seconds to open and load the data selector. That was (I think) fixed before.
c) I then added all into the data receiver. That took 150 seconds. I suspect that might be quite an easy fix? Could that also be done in this pull request?
d) Then I pressed Ok. It then took another 23 seconds to write the transposed data frame. I was pleased initially, but I now wonder why that takes so long. It is just writing a data frame of 4 variables and 12,000 rows.
e) Then I transposed back again. How long does R-Instat take to write a data frame internally with 12,000 columns. That took only the 23 seconds again - which is just great.
f) I then tried to see what - if anything you have done when I ask for the metadata. This freezes up, and I have to restart R-Instat.
g) When I restart I was pleased to see that it works fine on running the log file.
This is soooooooooooo promising. Very exciting indeed.
To report further, there is a free-to-download dataset from DHS. This has 8 data frames, one of which has 7654 variables (6290 rows) and it has others of 4200, 2200 and 1000 variables. The rds file took about 31 seconds to open initially. That is acceptable.
@rdstern thanks for testing this. I've been looking into the dialogs issue as well. I'll post my changes for testing once I'm done. |
multiple receiver and selector changes
@lloyddewit @ChrisMarsh82 when enhancing the performance of the multiple receivers I noticed the following;
I have not yet seen any dialog which loads variables from more than 1 data frame in different options at the same time. In dialogs each selector is associated with 1 data frame. And each receiver is always associated with the data frame selected in the selector. @rdstern is there any dialog that deviates from this 'principle'. I wasn't able to understand why @volloholic and @dannyparsons made such an implementation, I think this is unnecessary complication which has led to much more complexity in the design of selector and receiver. On a separate note, I noticed the transpose dialog (dlgTranspose) uses the selector and receiver in an 'unconventional' way. It has to be refactored to make it use the performance gains made in this PR. I suggest @N-thony to have a look at it after this PR is merged. |
@rdstern @lloyddewit below are my performance tests. I had 2 chrome windows, each with multiple tabs open, visual studio and calculator running. Before this PR, in 1 trial it took (considering the time taken, I didn't have the time to run a second trial);
I didn't bother proceeding with trial, it was clearly a waste of time In this PR, in 2 trials it took;
This shows exponential performance gains after optimizing the existing code despite the overengineering of the controls. @rdstern you are welcome to do your own test trials using the 3 dialogs mentioned above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patowhiz I will try to answer your questions later.
Meantime I have been testing what you have done and I suggest something very useful is there, and a small tweak - not a big improvement - is all it needs for now.
a) I tried first with the extRemes dataset. I used the Describe > Summarise facility. This worked fine. It was also quite a severe test of the output window (and Maximise) because there was output for 12,000 variables. Why anyone would want to do that, is unclear, but it all works fine now:
- It takes about 30 seconds to open initially. That's ok.
- Then the summarise dialog was fine - just about 3 seconds - first time, and almost immediate when I open it again.
- I copied all variables over to the receiver and that was fine too. Great!
- So was the data processing and the results in the output window.
- Then I tried opening the column metadata and it all seemed to freeze. So I restarted R-Instat! But read on.
b) Then I tried with the free DHS data. This is a for real set, because they distribute this to anyone, but ask lots of questions on why you need another one.
This has 8 dataframes. The first is 6000 rows by 7500 variables! The others have 7, 971, 2264, 4275, 940, 526 and 326 variables. The cases are up to 35,000. So a formidable dataset including one data frame with as many variables as I have met before in any survey.
a) All worked fine. I summarised all the variables in turn, switching data frames.
b) Now the surprise. I tried the column metadata. The screen looks terrible and I assumed it had frozen. But I went away for a few minutes and then found it had worked!!!
c) And the data frame metadata was fine.
d) I closed it, and then opening it again. That was almost immediate?
So, for now @Patowhiz instead of a clever solution, with the column metadata, why not have a message to continue box. Are you sure you need the column metadata? If so, be patient. It, will be slow to load the first time. Yes/No.
nThen ideally not the mix of screens that looks as though there is a big problem. Then, as I say. It appears! For all 8 data frames. And looks great.
So, rather than getting too clever on the overall display of the metadata, I would prefer that we enhance the selct (of different variables) and I hope Antoine can include the option of being able to apply the selects to the metadata as well.
@rdstern thanks for testing this. This is promising, considering it was very difficult to test such wide data sets before this PR. I spent quite a considerable amount of time (4 days) looking into this issue in a general way. Most solutions that I came up with risked regression in areas that aren't clearly understandable. I'm tempted to not loose the momentum (despite my many other tasks). If a message is all you need, then that sounds as a good quick task for @N-thony. I did write some prototype code that could have allowed me to support paging of the metadata. I now wonder whether I should spend more time in crafting something concrete for the August release. Kindly have a look at the issues I have raised above as well. Would be good if you could get feedback from @volloholic as well, I still don't quite understand why we needed that complexity. I recommend removing it now when I understand it's pitfalls well enough. The base code that supports paging of column metadata should be usable with columns selects. Basically, the selector, receivers and column metadata will probably use the same data source. So solving the paging feature is a step towards solving the column select feature. I totally agree we definitely need to support and gain experience in developing for wide datasets. I can see them being increasingly adopted in this age of abundant computing power and AI tools. I'm very optimistic that the output window can in future comfortably handle large outputs, it's design already supports addition of such implementation. |
@Patowhiz If you could do the paging the metadata window quickly, that would be great. Otherwise, it would be good to write an issue while this is all fresh in your mind. |
@Patowhiz and @ChrisMarsh82, you asked:
I have answers that relate at least to some of these questions:
I do not know of any instance where we currently use this feature. It is important for future plans. It should enable dialogs (when needed, and in linked data frames) to analyse data at multiple levels, without them needing to be merged to all be at the same level. An example (though not in a current multiple receiver) could be a water-balance calculation where you calculate So, you are welcome to have a switch for now, that simplifies (and hopefully speeds up) the work when thism option does not apply. But don't get rid of the option completely. Your point 4 is: If it would really help to avoid this, then we could have 2 selectors in these cases? I don't quite understand your points 2 and 3. Maybe @ChrisMarsh82 can help? I hope this helps a bit? |
updated the branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patowhiz this all seems to work now. If @lloyddewit agrees, then I would be very happy to see this merged now, with another issue on what remains oustanding.
I tried with 1 medium-width example (1000), variables. Then (initially accidentally) I opened a wide one 7400 (1 min 10 seconds) and 12,000 variables (2mins 30 seconds) when I had the metadata window open. They looked fine!
Then I did them in 2 steps, so opening and then adding the metadata window.
Here is the picture while the metadata window is being added:
It looks odd, but not terrible.
So, if merged, then I suggest an additional issue to be considered soon, perhaps for the September upgrade, as follows:
a) @Patowhiz said that the transpose dialog has different code, and maybe @N-thony could improve that. This is an obvious dialog when dealing with wide data sets, so perhaps that can be done soon.
b) Discuss and probably install a limit in the Tools > Options menu. What happens with (say) 50,000 variables, etc. R doesn't have a particular limit.
c) Consider a message and perhaps the be patient dialog, when opening the information for the first time.
@rdstern I have now added the warning message. If you press no then you will only see an empty column metadata for the data frame that has wide data sets only. The warning is also shown if a wide data set is detected. The oddity you are referring above is due to how visibility toggling of the "windows" is currently implemented. I think it was expected that the visibility toggling will happen very fast which in this case it's not due to the many things happening in the background in a single thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patowhiz I started checking the latest changes, and then found a more serious bit of regression. The Right-click on data frame > Delete now seems to freeze up. I think the ordinary menu item also doesn't work.
It is fine in the latest merged version. But in your branch it seems to be stuck here?
I also have comments about your warning thing, but wanted to make tis response first. It is nothing to do with wide data sets. This was with my small survey (36 rows and 7 variables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patowhiz this seems fine and I am therefore approving.
I wonder what your lower limit is, that ask? I tried with 8 files simultaneously that is the free set from DHS and so a good one to show and use. It asked about all 3 of the widest, which have CR: 2262, HR 7654, IR 4275 variables. I tried saying Yes to everything and it was reasonably quick - but worth having the warning.
I then tried with yes to CR 2262, and No to the wider ones. Your limit is clearly less than 2262 and I wonder about making it higher? For example 3000 or even 4000 variables?
Apart from that - and the probably unrelated problem with the Delete data frame it is looking great. I am approving though I worry slightly about the Delete problem. Are there other vulnerable points if we merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patowhiz I have now found a small problem with the warning and that is when it is given. I am getting the warning too often.
a) I said no the first time for the 2 wide data sets. That was fine. Worked well.
b) I then closed the metadata and later I opened it again. It asked me again about those 2 sheets. That's questionable, but I suppose ok.
c) Then, when the metadata was open I changed a numeric to a factor variable on one of the small files (just 7 variables). And it asked me yet again. That's too often and when I was working happily on a small file.
So, at the most it could ask repeatedly when I re-open the meta-data. I would also not mind, if it only asked the first time. But I should be able to work with the metadata window open without it ever asking again.
@rdstern thanks for testing this. I have now fixed the issues you raised above.
I can't guarantee if there will be no performance issues beyond that number when filling the grid. If you strongly want more than 1000 I can quickly change the hard coded figure. Regarding the prompt, I added an extra response button as a bonus. So you users can chose to not give a definitive answer by pressing 'cancel'. I'm happy to change that implementation. |
updated my branch
updated the branch with latest changes from the master
@Patowhiz I have been checking the saving of the wide data in this new version as well as the latest released version. I have been using the Carcassonheat data from the ExtRemes package. It is small, i.e. just 4 rows and 12054 variables. I hope this helps. Here are 2 data books that do open (and save) with your version: g) Finally, what took forever, but worked. Was to use the old R-Instat (version 4.16), and append and then save. In the original, the save works fine and quickly. The saving is quick, but of course the appending was painfully slow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lloyddewit this is a major improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Patowhiz thanks, looks great, I just had some small comments
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
@lloyddewit thanks for the code review. |
@Patowhiz I think there are still 8 unresolved comments above, please could you check? thanks |
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
Co-authored-by: lloyddewit <[email protected]>
@lloyddewit thanks. I think I've answered all your questions and committed some of your suggestions |
updated branch with changes from master
@rdstern if you can retest/approve, then we can merge, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now don't get any issues saving the data.
I am havin g some interesting problems getting the metadata sometimes.
a) I first tried with the extRemes data, with 2 sheets of 12,000 and it never finished.
b) On another occasion, with just one lot of the 12000 it gave me the following:
I continued and it then worked.
c) I then tried with versions that were saved as rds files. For the dhs data it was fine but with no message. It looked as below. For the extremes data it didn't give the message but then worked while I was writing this.
I am approving because the changes are so important. I will keep checking.
@rdstern when you say no the first time, the message prompt never gets shown to you again, until you restart R-Instat. That means an other imports that follow won't prompt you for column metadata viewing. Same applies to saying yes. As explained in my previous comments, if you accept to view the column metadata, expect some performance issues. This is due to filling the |
@Patowhiz thanks. Now I understand. I tried again with the DHS wide data, which is for me now, an rds file. And it asked me in the right way. I also found it has a lot, really a lot, of empty variables. When I get rid of those, which I did, then it just has one data frame of over 2000 and two with more than 1000 variables. I said yes and it took only 7 seconds to get the metadata! We have both approved now, so I hope this can be merged this week. |
Fixes some items in issue #7161. Fixes #8286
@africanmathsinitiative/developers this is ready for review