-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load and manipulate wide files in R-Instat more efficiently? #7161
Comments
Tagging @Patowhiz instead of assigning. |
@rdstern @dannyparsons I imported the same data using R-Instat functions in R studio and below is a benchmark comparison of the 2 softwares in my machine. Import command;
Bench mark time;
As you can see, from the above, it only took an average of around 2 mins more for the same functions to import into R-Instat R object. From the above, it was clear to me some of the optimisations had to be done at R function level especially in the Getting of the .Net data frame as a symbolic expression had no significant time difference. It took an average of 0.002 secs I went further and looked at the command that gets the variables metadata Get variables command; It took an average of 7.188 secs in RStudio and 7.489 secs in R-Instat. Which means it's important that the R command be executed only when necessary. The 7.489 secs could easily become minutes if the command is executed several times in R-Instat. |
@Patowhiz I wonder how much further we can go with this? |
@rdstern thanks for asking. Yes it is easy in R, as I stated in my above comment. Our biggest drawback is in the Once the data is loaded, from what I have noticed currently, getting any data to and from R to R-Instat is pretty first if the rows limit is 1000 (remember even the paste functionality?). Once we load the data into R-Instat, with dynamic loading we can handle any length or size of data that the machine can handle. And this could be done in 2 ways;
|
@Patowhiz sounds good. I am happy with your 2 above. I am not specially worried about memory, particularly as I think memory use by metadata will be dwarfs by or (perhaps soon to come) undo use of memory. And I really don't see memory limitations as a common problem for R-Instat users. I am really happy that we can already manage 3 million, by 50 almost (but not quite) on a 32 bit implementation, and trivially on 64 bit. And I feel we will rarely be needing more than 1 million. The wide data sets too is partly because that is an obvious test to apply, and I would like to pass! But the widest I have seen is 7000 variables, and even that is pretty ridiculous for statistical analyses. But there is a bit of a principle here. The biggest argument against a GUI is the way it limits you, compared to learning the language. That will always be true, but I am very keen to be able to show we have worked to minimise the limitations. |
@Patowhiz excellent. Exciting too! |
@lilyclements you can have a look at my #7161 (comment) above. Would be very useful to get your general response in regards to the R functions responsible for import. |
Looking into this, it seems the slowness is in the I created some dummy data, and looking at the timings: I suggest we should consider using lists, since the process would not increase exponentially as the data size increases. For example,
So, I suggest we consider using a list since they do not get exponentially longer as the size increases. However, I'm not sure how deep this problem runs, and so would want to confirm with @dannyparsons before making any changes. |
@N-thony I think this can now be added to a blocker and @lilyclements can now be tagged to it. Thanks |
@Patowhiz or @N-thony or @lilyclements I wonder where we are now with the wide data files? I tried another simple example as follows - this isn't reading. It is manipulating. These data are 10 numeric variables and 10000 rows. (It is easy to make, and I seem to remember that you and @ChrisMarsh82 were of the view that we don't need to make a particular limit of the number of variables in R-Instat?) Then I use Prepare > Data Reshape > Transpose. This makes it into 10,000 variables and 10 rows. If that works well, then we can do the same with 100,000 rows and hence columns when we transpose. You guys said no limit! It takes quite a long time with the 10,000. Now that may be a one-off, and once we have produced it, then all is fine again? I ask now, because (as you know) the same sort of problem with many levels of a factor seems ok now. I would like to write about this limit as well - in the help. So is this aspect: |
@volloholic @lilyclements and I found the function that took time to execute for wide data sets to be the one below.
The function is called Replacing line We opted to use this solution for now to fix this issue |
@Patowhiz have you got a PR with the changes, or do you want me to do this? Another side note - I spotted a small typo when we went through this. When we implement this, can we also change all "arguements" to say "arguments" (it happens a few times in the file) |
As a side note, we should check these changes fix the following PRs:
|
@lilyclements and @Patowhiz I hope you can make some progress during the June sprint. I suggest that once the basics are reasonably efficient, then we may later be able to "get clever" with selects to get really cool. I suggest there are 3 steps, initially that you need to resolve: (and one to check) a) Can you read in these data - with 12,000 variables |
@Patowhiz I thought we found the solution (at least partially) to this when you were in the UK?
Can this be implemented now? |
@lilyclements @rdstern I'm well aware of this thanks. To optimise at the .Net level, we need to make the column metadata window 'behave' like the data viewer. This will make the column metadata window to load the columns in batches of a 1,000. In regards to the selectors and receivers, we need to refactor them in a way that they don't have to repeatedly load and remove variables from each other. And also enhance them to work well with more than 1,000 variables. A quick fix on the column metadata issue would be to only load the variables when it's visible (this will effectively improve the importing experience in terms of performance ), but there will be a noticeable delay when the user opens the window later. Refactoring the selector and receiver needs to be done carefully to reduce the risk of regression in the whole software. I did look into it and couldn't see an obvious quick fix. I intend to fix this in a way that compliments the search feature in the selector as well. |
@Patowhiz I could easily live with a quick fix on the column metadata, so we make some quick progress on the other 3 problems. From the simple example, namely Then there is work on the 3 problems listed by @lilyclements above, namely a) your reading in of the data initially I don't expect this work to finish during the sprint, but could it get to the stage that it can be done by the next update, or even handed on to someone else in the team, supervised by Lily or Patrick. So, could there be a plan by the end of the sprint? |
@Patowhiz and @lilyclements currently the widest "ordinary" files I have seen are just over 7000 variables. I define "ordinary" as a data frame where the different variables are of different "types", so some may be character, others factor and many numeric. These are annoying, but also very rare. So, it would be great to be able to cope with up to (say) 8000 variables reasonably, but if they took a long time, but anything up to 2000 variables - perhaps ideally 4000 were ok, as long as you were patient, then I would be happy with that. These are wide data frame where we would also like to be able to look at the column metadata. Then there are "others" and a climatic example of an "other" is the first example in the ectRemes package, that is mentioned above and is in the library. This has over 12,000 variables. However, they are all the same (all numeric) except the names, and we don't need access to the metadata to see and change the names. So we don't need the column metadata. If we want to cope with a slightly general situation then we could show (say) the first 5 or 10 columns in the metadata. Ideally, then the last row in the column metadata would apply to all the remaining variables. This will eventually be a "select", but I don't want to get ambitious too early, because that will just dely doing anything. |
ZZHR62FL.zip
This is an spss file that reads fine into R-Instat. It is about 3 mbytes, but is 50 mbytes as an spss sav file.
It has about 6000 records (rows) and over 7000 variables (columns).
When reading into R-Instat there is initially the usual screen that it is taking some time. Then that stops as I assume it is getting to the last steps, which is to load the data into the R-Instat data book. However this last step takes a long time and it looks as though R-Instat is frozen up.
I gather from @dannyparsons that this step can be done much more efficiently and that this should be quite easy to implement. Possibly we should also include the waiting screen during this last step. I am used to it taking (say) 10-20 seconds to complete and that is fine. I have not timed, but suggest this is at least 10 minutes currently.
We might also wish to consider whether there is a maximum that we allow? Perhaps this can - like other limits, and defaults, then be changed in the Tools > Options dialogue. There isn't a problem in R, having (say) 1 million variables, perhaps by transposing a data frame with a million rows. There is a problem loading these into a dialogue, and perhaps also into a grid? The maximum in reogrid is 32,768. This may not apply as we still only show a small subset, but that might also be a sensible limit for us in R-Instat. It also has a limit of 1 million rows and that certainly doesn't apply to us. (I have tried with our new grid system and 3 million rows.
The text was updated successfully, but these errors were encountered: