Replies: 4 comments
-
This comment is to report on a larger dataset. With the 64 bit setup it seems to cope well with 12 million rows (1080 stations by 30 years) for most operations. Then: d) Simple calculation on unstacked variable 5.3 seconds (now 8 seconds) i) Checking for duplicates 630 seconds (10.5 minutes, of which 10 minutes is the check and 30 seconds is the summary of the results.) I report on my "error", which was to use the Station variable, which was ok for the long dataset (12 million rows). But in the longer dataset (36 million rows I had effectively 3 sets of data and hence a Station2 variable. Then Station2 by Date is ok. But I used Station and Date instead, so there are many duplicates. If this is a real sticking point in our use of a very long dataset then I return to wondering whether we can do something about it. At least in the Define climatic, when we want to abort if there are any duplicates, i.e. we don't need to find where they are, or how many. Similarly, i(but more complicated) n the duplications dialogue, perhaps we can check quickly if there are going to be a lot, and ask if we really want to continue. . |
Beta Was this translation helpful? Give feedback.
-
I'm not clear if there's a task here? You said there's a problem as factor, is that an issue of being slow or a problem with defining factors? I could move this to a discussion if this is more of a report on the test with large data. |
Beta Was this translation helpful? Give feedback.
-
I moved the task to issue #6305. I think this is a problem to correct. The rest is interesting (to me anyway!) and for reporting. I wanted it mainly to test whether we could/should extend the visible data frame. I came to the conclusion that we should leave it where it is for now. But it also led to the discussion of investigating R-Instat limits that is addressed in 2 other issues - they followed the discussion with you on refreshing the grid, and also on investigating loading controls in dialogues when there could be an extreme number of variables, or an extreme number of levels of a factor. |
Beta Was this translation helpful? Give feedback.
-
Sounds good, that's clear. I'll move this to a "discussion" as well. |
Beta Was this translation helpful? Give feedback.
-
The data used is the satellite data from Germany. It is a NetCDF file. In our EUMETSAT work we extract 30 years of data for 4 stations. Here I input the whole dataset. It is essentially 1080 "stations" with 30 years of data for each station. So, almost 12 million records. (11,834,640 to be precise).
It is useful to know how R-Instat can cope with this size of data. I did the following and generally have been happy with the speed. I did run out of memory with a 32 bit installation, so am using 64.bit. I have about 20 variables and rising. (so over 200 million data points.)
a) Duplicated a variable: (lon) about 1.4 seconds
b) Define it as a factor: about 31 seconds (45 levels and problems as a factor)
c) Round it to 2 decimals using the calculator: about 2.6 seconds
d) Convert rounded data to factor: about 11 seconds (45 levels and rounding prevented problems)
e) Duplicated another variable (lat, simpler) 1.2 seconds
f) Changed grid dimension to 10,000 rows, became 2.5 seconds, if 100,000 became 27.3 seconds
g) Defined as a factor about 13.8 seconds or 15.6 if 10,000 or 44.1seconds if 100,000 rows
h) Round to 2 decimals again about 2.5 seconds
i) Convert rounded data to factor about 10 seconds (24 levels)
j) Combine into station factor with 1080 levels 1.3 seconds
k) This becomes 3.5 seconds if 10,000 and 32 seconds if 100,000 rows visible.
l) Checking on infilling in climatic menu 7.75 secs
j) Defining climatic 150 seconds - only done once but why so long? Ran out of memory on 32 bit machine.
k) Checking for duplicates 75 seconds
l) Generating 4 new variables for year, month, etc 19.5 seconds (becomes 22.5 seconds if 10,000 grid) so not that bad with 4 variables to write.
m) Now unstack by Station 12.8 seconds with default grid visible. There are now 1081 variables!
n) Becomes 16.78 seconds with 100 variables visible (Note this is a very severe test, because I am writing all 100 variables.
o) Becomes 17.25 with 300 variables visible and 22.1 with 500 and 26.7 with 1000.
p) A simple calculation on an unstacked columns took 22 seconds with 1000 variables visible. It took 4 seconds when I reduced the variables to the default of 30. Then 5.2 with 100 visible and 7.8 with 300. It was reduced a bit with even less (250 rows by 15 variables.
q) The speed does not seem to be affected by the visibility of the grid.
Note that I now had the data in R-Instat twice, namely the original with more variables 11 million data points by 30 (now) It all seems ok still.
So far this is highly satisfactory except
On the number of rows there is a time penalty of perhaps 2 seconds for each variable written through increasing the rows to 10,000. It is much slower if 100,000 - see above. The degradation is much less when the number of variables was increased - even to 100. But it was there. I was testing items that write to the data frame and we are not always doing this.
I can see good reasons for increasing the number of variables that are visible. It isn't a big deal and I have done it. And it could become pretty high when we want. When this is the case, then why not. But as a norm I am inclined to stay with what we have. I suggest we need to be more cautious with the number of rows. Where 2000, or 3000 would make a difference to the visibility, then why not - possibly again temporarily.
Beta Was this translation helpful? Give feedback.
All reactions