Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved performance when importing wide datasets (and fixed bug when selecting multiple variables) #8465

Merged
merged 28 commits into from
Sep 12, 2023

Conversation

Patowhiz
Copy link
Contributor

@Patowhiz Patowhiz commented Jul 27, 2023

Fixes some items in issue #7161. Fixes #8286
@africanmathsinitiative/developers this is ready for review

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patowhiz that seems extrmely promising already.
a) I tried the ExtRemes > carcasson data from the library, which has 12,000 variables. It now took just 23 seconds to import into R-Instat. Great.
b) I then tried the Prepare > Data: Change > Transpose. It took just 3 seconds to open and load the data selector. That was (I think) fixed before.
c) I then added all into the data receiver. That took 150 seconds. I suspect that might be quite an easy fix? Could that also be done in this pull request?
d) Then I pressed Ok. It then took another 23 seconds to write the transposed data frame. I was pleased initially, but I now wonder why that takes so long. It is just writing a data frame of 4 variables and 12,000 rows.
e) Then I transposed back again. How long does R-Instat take to write a data frame internally with 12,000 columns. That took only the 23 seconds again - which is just great.
f) I then tried to see what - if anything you have done when I ask for the metadata. This freezes up, and I have to restart R-Instat.
g) When I restart I was pleased to see that it works fine on running the log file.

This is soooooooooooo promising. Very exciting indeed.

To report further, there is a free-to-download dataset from DHS. This has 8 data frames, one of which has 7654 variables (6290 rows) and it has others of 4200, 2200 and 1000 variables. The rds file took about 31 seconds to open initially. That is acceptable.

@Patowhiz
Copy link
Contributor Author

@rdstern thanks for testing this.
Like I explained before. The column metadata needs the "paging" structures similar to the data viewer. That's not quite trivial to do correctly, but very doable. I just need to countercheck with @ChrisMarsh82 about my ideas.
I've done a temporary fix with a todo comment. The fix loads the column metadata only when it's visible.
The long term fix is to load only 1000 visible columns. This fix should also b used by the dialogs somehow.

I've been looking into the dialogs issue as well. I'll post my changes for testing once I'm done.
I'll also look at some of your items above.

@Patowhiz Patowhiz changed the title Improves performance in importing of wide datasets Improves performance in importing of wide datasets and multiple receiver Aug 8, 2023
@Patowhiz
Copy link
Contributor Author

Patowhiz commented Aug 8, 2023

@lloyddewit @ChrisMarsh82 when enhancing the performance of the multiple receivers I noticed the following;

  1. Multiple Receivers are assumed to be capable of being attached to more than 1 data frame, that is, every variable can have a different data frame.
  2. Selector uses receiver data frame to set the selector currently selected data frame in some events.
  3. We are always transforming data from one format to another between the receiver and selector even though we are using a list view control in both controls.
  4. It is also assumed that the selector can have variables that belong to multiple data frames in the same dialog at the same time.

I have not yet seen any dialog which loads variables from more than 1 data frame in different options at the same time. In dialogs each selector is associated with 1 data frame. And each receiver is always associated with the data frame selected in the selector. @rdstern is there any dialog that deviates from this 'principle'.

I wasn't able to understand why @volloholic and @dannyparsons made such an implementation, I think this is unnecessary complication which has led to much more complexity in the design of selector and receiver.
@rdstern once I have answers to the above 4 items, I can make changes that have significant performance difference. I don't want to spend more time refactoring the selector and receiver on a feature I haven't seen used anywhere yet.
This PR focused on optimizing existing code implementations, to make them optimal for wide data sets.

On a separate note, I noticed the transpose dialog (dlgTranspose) uses the selector and receiver in an 'unconventional' way. It has to be refactored to make it use the performance gains made in this PR. I suggest @N-thony to have a look at it after this PR is merged.

@Patowhiz
Copy link
Contributor Author

Patowhiz commented Aug 8, 2023

@rdstern @lloyddewit below are my performance tests. I had 2 chrome windows, each with multiple tabs open, visual studio and calculator running.

Before this PR, in 1 trial it took (considering the time taken, I didn't have the time to run a second trial);

  • 960 seconds to load the CarcasonneHeat data from extremes package
  • 10 seconds to load the one variable summaries dialog.
  • 370 seconds to add all the variables to the multiple selector of one variable summaries dialog.
  • Over 6,000 seconds to reload the one variable summaries dialog. I had stopped the application.

I didn't bother proceeding with trial, it was clearly a waste of time

In this PR, in 2 trials it took;

  • Average of 37.5 seconds to load the CarcasonneHeat data from extremes package
  • Average of 6 .5 seconds to load the one variable summaries dialog.
  • Average of 12.5 seconds to add all the variables to the multiple selector of one variable summaries dialog.
  • Average of 25.0 seconds to reload the one variable summaries dialog.
  • Average of 11.5 seconds to load the transpose dialog.
  • Average of 15.5 seconds to add all the variables to the multiple selector of transpose dialog.
  • Average of 6.5 seconds to reload the transpose dialog.

This shows exponential performance gains after optimizing the existing code despite the overengineering of the controls.

@rdstern you are welcome to do your own test trials using the 3 dialogs mentioned above.

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patowhiz I will try to answer your questions later.
Meantime I have been testing what you have done and I suggest something very useful is there, and a small tweak - not a big improvement - is all it needs for now.
a) I tried first with the extRemes dataset. I used the Describe > Summarise facility. This worked fine. It was also quite a severe test of the output window (and Maximise) because there was output for 12,000 variables. Why anyone would want to do that, is unclear, but it all works fine now:

  1. It takes about 30 seconds to open initially. That's ok.
  2. Then the summarise dialog was fine - just about 3 seconds - first time, and almost immediate when I open it again.
  3. I copied all variables over to the receiver and that was fine too. Great!
  4. So was the data processing and the results in the output window.
  5. Then I tried opening the column metadata and it all seemed to freeze. So I restarted R-Instat! But read on.

b) Then I tried with the free DHS data. This is a for real set, because they distribute this to anyone, but ask lots of questions on why you need another one.
This has 8 dataframes. The first is 6000 rows by 7500 variables! The others have 7, 971, 2264, 4275, 940, 526 and 326 variables. The cases are up to 35,000. So a formidable dataset including one data frame with as many variables as I have met before in any survey.
a) All worked fine. I summarised all the variables in turn, switching data frames.
b) Now the surprise. I tried the column metadata. The screen looks terrible and I assumed it had frozen. But I went away for a few minutes and then found it had worked!!!
c) And the data frame metadata was fine.
d) I closed it, and then opening it again. That was almost immediate?

So, for now @Patowhiz instead of a clever solution, with the column metadata, why not have a message to continue box. Are you sure you need the column metadata? If so, be patient. It, will be slow to load the first time. Yes/No. nThen ideally not the mix of screens that looks as though there is a big problem. Then, as I say. It appears! For all 8 data frames. And looks great.

So, rather than getting too clever on the overall display of the metadata, I would prefer that we enhance the selct (of different variables) and I hope Antoine can include the option of being able to apply the selects to the metadata as well.

@Patowhiz
Copy link
Contributor Author

Patowhiz commented Aug 8, 2023

@rdstern thanks for testing this. This is promising, considering it was very difficult to test such wide data sets before this PR.

I spent quite a considerable amount of time (4 days) looking into this issue in a general way. Most solutions that I came up with risked regression in areas that aren't clearly understandable. I'm tempted to not loose the momentum (despite my many other tasks).

If a message is all you need, then that sounds as a good quick task for @N-thony.

I did write some prototype code that could have allowed me to support paging of the metadata. I now wonder whether I should spend more time in crafting something concrete for the August release.

Kindly have a look at the issues I have raised above as well. Would be good if you could get feedback from @volloholic as well, I still don't quite understand why we needed that complexity. I recommend removing it now when I understand it's pitfalls well enough.

The base code that supports paging of column metadata should be usable with columns selects. Basically, the selector, receivers and column metadata will probably use the same data source. So solving the paging feature is a step towards solving the column select feature.

I totally agree we definitely need to support and gain experience in developing for wide datasets. I can see them being increasingly adopted in this age of abundant computing power and AI tools.

I'm very optimistic that the output window can in future comfortably handle large outputs, it's design already supports addition of such implementation.

@rdstern
Copy link
Collaborator

rdstern commented Aug 8, 2023

@Patowhiz If you could do the paging the metadata window quickly, that would be great. Otherwise, it would be good to write an issue while this is all fresh in your mind.

@rdstern
Copy link
Collaborator

rdstern commented Aug 10, 2023

@Patowhiz and @ChrisMarsh82, you asked:

  1. Multiple Receivers are assumed to be capable of being attached to more than 1 data frame, that is, every variable can have a different data frame.
  2. Selector uses receiver data frame to set the selector currently selected data frame in some events.
  3. We are always transforming data from one format to another between the receiver and selector even though we are using a list view control in both controls.
  4. It is also assumed that the selector can have variables that belong to multiple data frames in the same dialog at the same time.

I have answers that relate at least to some of these questions:

  1. A multiple receiver can (in principle) have data from different data frames.

I do not know of any instance where we currently use this feature. It is important for future plans. It should enable dialogs (when needed, and in linked data frames) to analyse data at multiple levels, without them needing to be merged to all be at the same level.

An example (though not in a current multiple receiver) could be a water-balance calculation where you calculate
water balance = (rainfall - evaporation), between limits of say 0 and 100. The evaporation is sometimes the average over the years, so just 366 values - in a special data frame, while the rainfall is for individual years. Currently you have to merge the evaporation column first.

So, you are welcome to have a switch for now, that simplifies (and hopefully speeds up) the work when thism option does not apply. But don't get rid of the option completely.

Your point 4 is:
4. It is also assumed that the selector can have variables that belong to multiple data frames in the same dialog at the same time.
Most dialogs that use 2 data frames have 2 separate selectors. The climatic > crops dialog is a rare exception that has a single selector, used at the daily level and also (for other single receivers) at the summary level.
The new define climatic data may be a second example. Here most of the information is for a daily data frame, but (for the evapotranspiration dialog) we include lat and long now, and that comes from a station level.

If it would really help to avoid this, then we could have 2 selectors in these cases?

I don't quite understand your points 2 and 3. Maybe @ChrisMarsh82 can help?

I hope this helps a bit?

rdstern
rdstern previously approved these changes Aug 22, 2023
Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patowhiz this all seems to work now. If @lloyddewit agrees, then I would be very happy to see this merged now, with another issue on what remains oustanding.
I tried with 1 medium-width example (1000), variables. Then (initially accidentally) I opened a wide one 7400 (1 min 10 seconds) and 12,000 variables (2mins 30 seconds) when I had the metadata window open. They looked fine!
Then I did them in 2 steps, so opening and then adding the metadata window.
Here is the picture while the metadata window is being added:

image

It looks odd, but not terrible.

So, if merged, then I suggest an additional issue to be considered soon, perhaps for the September upgrade, as follows:
a) @Patowhiz said that the transpose dialog has different code, and maybe @N-thony could improve that. This is an obvious dialog when dealing with wide data sets, so perhaps that can be done soon.
b) Discuss and probably install a limit in the Tools > Options menu. What happens with (say) 50,000 variables, etc. R doesn't have a particular limit.
c) Consider a message and perhaps the be patient dialog, when opening the information for the first time.

@Patowhiz
Copy link
Contributor Author

@rdstern I have now added the warning message. If you press no then you will only see an empty column metadata for the data frame that has wide data sets only. The warning is also shown if a wide data set is detected.

The oddity you are referring above is due to how visibility toggling of the "windows" is currently implemented. I think it was expected that the visibility toggling will happen very fast which in this case it's not due to the many things happening in the background in a single thread.
I also noticed the oddity when you initially open R-Instat. I'll quickly have a look and see if I can correct the implementation.

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patowhiz I started checking the latest changes, and then found a more serious bit of regression. The Right-click on data frame > Delete now seems to freeze up. I think the ordinary menu item also doesn't work.

It is fine in the latest merged version. But in your branch it seems to be stuck here?

image

I also have comments about your warning thing, but wanted to make tis response first. It is nothing to do with wide data sets. This was with my small survey (36 rows and 7 variables).

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patowhiz this seems fine and I am therefore approving.
I wonder what your lower limit is, that ask? I tried with 8 files simultaneously that is the free set from DHS and so a good one to show and use. It asked about all 3 of the widest, which have CR: 2262, HR 7654, IR 4275 variables. I tried saying Yes to everything and it was reasonably quick - but worth having the warning.
I then tried with yes to CR 2262, and No to the wider ones. Your limit is clearly less than 2262 and I wonder about making it higher? For example 3000 or even 4000 variables?

Apart from that - and the probably unrelated problem with the Delete data frame it is looking great. I am approving though I worry slightly about the Delete problem. Are there other vulnerable points if we merge?

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patowhiz I have now found a small problem with the warning and that is when it is given. I am getting the warning too often.
a) I said no the first time for the 2 wide data sets. That was fine. Worked well.
b) I then closed the metadata and later I opened it again. It asked me again about those 2 sheets. That's questionable, but I suppose ok.
c) Then, when the metadata was open I changed a numeric to a factor variable on one of the small files (just 7 variables). And it asked me yet again. That's too often and when I was working happily on a small file.
So, at the most it could ask repeatedly when I re-open the meta-data. I would also not mind, if it only asked the first time. But I should be able to work with the metadata window open without it ever asking again.

@Patowhiz
Copy link
Contributor Author

@rdstern thanks for testing this. I have now fixed the issues you raised above.
My limit is actually 1000 variables. The decision behind that number is based on;

  1. The data viewer default rows limit is 1000. We have tested this to know there is no performance issues.
  2. The copy feature limit is 1000 rows. We have tested this to know there is no performance issues.

I can't guarantee if there will be no performance issues beyond that number when filling the grid. If you strongly want more than 1000 I can quickly change the hard coded figure.

Regarding the prompt, I added an extra response button as a bonus. So you users can chose to not give a definitive answer by pressing 'cancel'. I'm happy to change that implementation.

updated the branch with latest changes from the master
@rdstern
Copy link
Collaborator

rdstern commented Aug 30, 2023

@Patowhiz I have been checking the saving of the wide data in this new version as well as the latest released version. I have been using the Carcassonheat data from the ExtRemes package. It is small, i.e. just 4 rows and 12054 variables.
a) I opened with the new (wide) version. This takes about 29 seconds to open. That's fine. I asked for the metadata and said no. Then I tried to save and it gave the overflow error trying to save the data book. (That was file called Extremes_wide.)
b) I tried again and this time I said yes to the metadata. That took about 2 minutes. I then tried saving and got the same error. File called extremes_wide2)
c) So I tried opening the same data in the "old" R-Instat (Version 0.7.16). That took about 12.5 minutes to load from the library, and was painful, in looking as though the software had frozen. I persisted and then saved. It appeared to save instantly. That's extremes_wide3.
d) Now I opened extremes_wide3 in the new test wide version. It opens fine. I added some small data frameas (from the survey). I then tried to save. It saves fine. So, it is writing the first data book that is the problem. Once it has one it seems fine. That's test_wide4. I can open it and save it fine with your new code.
e) I now copied the wide data frame. That takes a little time as one would expect. I now saved that data book. I called it test_wide_double, as there are now 2 wide data frames. That also saves and then opens fine in your new test version.
f) I now appended the 2 wide data frames, so I have a new one with 8 rows instead of 4 as well as the original. Now I get the error again, when saving.

image

I hope this helps.

Here are 2 data books that do open (and save) with your version:

extRemes_wide_double.zip

g) Finally, what took forever, but worked. Was to use the old R-Instat (version 4.16), and append and then save. In the original, the save works fine and quickly. The saving is quick, but of course the appending was painfully slow.

rdstern
rdstern previously approved these changes Aug 30, 2023
Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lloyddewit this is a major improvement.

Copy link
Contributor

@lloyddewit lloyddewit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Patowhiz thanks, looks great, I just had some small comments

instat/dlgDeleteDataFrames.vb Outdated Show resolved Hide resolved
instat/dlgTransposeColumns.vb Outdated Show resolved Hide resolved
instat/ucrColumnMetadata.vb Show resolved Hide resolved
instat/ucrColumnMetadata.vb Outdated Show resolved Hide resolved
instat/ucrColumnMetadata.vb Outdated Show resolved Hide resolved
instat/ucrReceiver.vb Outdated Show resolved Hide resolved
instat/ucrReceiverExpression.vb Outdated Show resolved Hide resolved
instat/ucrReceiverMultiple.vb Show resolved Hide resolved
instat/ucrReceiverMultiple.vb Show resolved Hide resolved
instat/ucrReceiverMultiple.vb Show resolved Hide resolved
@Patowhiz
Copy link
Contributor Author

Patowhiz commented Sep 5, 2023

@lloyddewit thanks for the code review.
I hope I have answered all of your above questions.

@lloyddewit
Copy link
Contributor

@lloyddewit thanks for the code review. I hope I have answered all of your above questions.

@Patowhiz I think there are still 8 unresolved comments above, please could you check? thanks

@Patowhiz
Copy link
Contributor Author

Patowhiz commented Sep 6, 2023

@lloyddewit thanks. I think I've answered all your questions and committed some of your suggestions

updated branch with changes from master
@lloyddewit lloyddewit changed the title Improves performance in importing of wide datasets and multiple receiver Improved performance when importing wide datasets (and fixed bug when selecting multiple variables) Sep 8, 2023
@lloyddewit
Copy link
Contributor

@rdstern if you can retest/approve, then we can merge, thanks

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now don't get any issues saving the data.
I am havin g some interesting problems getting the metadata sometimes.
a) I first tried with the extRemes data, with 2 sheets of 12,000 and it never finished.
b) On another occasion, with just one lot of the 12000 it gave me the following:

image

I continued and it then worked.

c) I then tried with versions that were saved as rds files. For the dhs data it was fine but with no message. It looked as below. For the extremes data it didn't give the message but then worked while I was writing this.

image

I am approving because the changes are so important. I will keep checking.

@Patowhiz
Copy link
Contributor Author

@rdstern when you say no the first time, the message prompt never gets shown to you again, until you restart R-Instat. That means an other imports that follow won't prompt you for column metadata viewing. Same applies to saying yes.
The yes or no prompt is not tied to a specific data frame. It's only tied to detection of the first data frame that has more than 1000 columns.
We could design the prompt to be tied to specific data frames if need be.

As explained in my previous comments, if you accept to view the column metadata, expect some performance issues. This is due to filling the Reogrid with many rows. The solution needed to avoid this is similar to what we currently do with the data viewer.

@rdstern
Copy link
Collaborator

rdstern commented Sep 11, 2023

@Patowhiz thanks. Now I understand. I tried again with the DHS wide data, which is for me now, an rds file. And it asked me in the right way.
Incidentally that is the dataset with one over 7000 variables, one more over 4000 and so on. It is actually the data set from DHS that they allow to be used freely for practice, and I hope sometime soon to tell them how nicely it could be analysed if only they considered R-Instat.

I also found it has a lot, really a lot, of empty variables. When I get rid of those, which I did, then it just has one data frame of over 2000 and two with more than 1000 variables. I said yes and it took only 7 seconds to get the metadata!

We have both approved now, so I hope this can be merged this week.

@lloyddewit lloyddewit merged commit 1dd4e8c into IDEMSInternational:master Sep 12, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants