Testing R-Instat with a large climatic dataset #6330

rdstern · 2021-03-19T22:27:07Z

rdstern
Mar 19, 2021
Maintainer

The data used is the satellite data from Germany. It is a NetCDF file. In our EUMETSAT work we extract 30 years of data for 4 stations. Here I input the whole dataset. It is essentially 1080 "stations" with 30 years of data for each station. So, almost 12 million records. (11,834,640 to be precise).

It is useful to know how R-Instat can cope with this size of data. I did the following and generally have been happy with the speed. I did run out of memory with a 32 bit installation, so am using 64.bit. I have about 20 variables and rising. (so over 200 million data points.)
a) Duplicated a variable: (lon) about 1.4 seconds
b) Define it as a factor: about 31 seconds (45 levels and problems as a factor)
c) Round it to 2 decimals using the calculator: about 2.6 seconds
d) Convert rounded data to factor: about 11 seconds (45 levels and rounding prevented problems)
e) Duplicated another variable (lat, simpler) 1.2 seconds
f) Changed grid dimension to 10,000 rows, became 2.5 seconds, if 100,000 became 27.3 seconds
g) Defined as a factor about 13.8 seconds or 15.6 if 10,000 or 44.1seconds if 100,000 rows
h) Round to 2 decimals again about 2.5 seconds
i) Convert rounded data to factor about 10 seconds (24 levels)
j) Combine into station factor with 1080 levels 1.3 seconds
k) This becomes 3.5 seconds if 10,000 and 32 seconds if 100,000 rows visible.
l) Checking on infilling in climatic menu 7.75 secs
j) Defining climatic 150 seconds - only done once but why so long? Ran out of memory on 32 bit machine.
k) Checking for duplicates 75 seconds
l) Generating 4 new variables for year, month, etc 19.5 seconds (becomes 22.5 seconds if 10,000 grid) so not that bad with 4 variables to write.
m) Now unstack by Station 12.8 seconds with default grid visible. There are now 1081 variables!
n) Becomes 16.78 seconds with 100 variables visible (Note this is a very severe test, because I am writing all 100 variables.
o) Becomes 17.25 with 300 variables visible and 22.1 with 500 and 26.7 with 1000.
p) A simple calculation on an unstacked columns took 22 seconds with 1000 variables visible. It took 4 seconds when I reduced the variables to the default of 30. Then 5.2 with 100 visible and 7.8 with 300. It was reduced a bit with even less (250 rows by 15 variables.
q) The speed does not seem to be affected by the visibility of the grid.
Note that I now had the data in R-Instat twice, namely the original with more variables 11 million data points by 30 (now) It all seems ok still.

So far this is highly satisfactory except

for the issues with the factors - and that is not directly associated with the length of the data frame.
Still the define climatic is the one that is slow. Why is it twice as slow as getting an actual variable and writing it out? And why does it need so much memory?

On the number of rows there is a time penalty of perhaps 2 seconds for each variable written through increasing the rows to 10,000. It is much slower if 100,000 - see above. The degradation is much less when the number of variables was increased - even to 100. But it was there. I was testing items that write to the data frame and we are not always doing this.

I can see good reasons for increasing the number of variables that are visible. It isn't a big deal and I have done it. And it could become pretty high when we want. When this is the case, then why not. But as a norm I am inclined to stay with what we have. I suggest we need to be more cautious with the number of rows. Where 2000, or 3000 would make a difference to the visibility, then why not - possibly again temporarily.

rdstern · 2021-03-20T12:50:35Z

rdstern
Mar 20, 2021
Maintainer Author

This comment is to report on a larger dataset. With the 64 bit setup it seems to cope well with 12 million rows (1080 stations by 30 years) for most operations.
I have made it 3 times the size, so effectively 3240 stations by 30 years. I have been curious about limits for some time, and it is an obvious question. With a 32 bit setup I had the impression that about 5 million cases was the limit.
I am no longer changing the size of the observed grid.
Appending took 25 seconds

Then:
a) Convert to factor 3 levels - 5 seconds - that's promising! (now 3.4 seconds)
b) Combine factors now to 3248 levels - 2.2 seconds (now 2.5 seconds)
c) Unstacking to 3241 variables 87 seconds - mostly the importing step - dcast was quick. (now 11 sec and 40 sec, so 51 in total!)

d) Simple calculation on unstacked variable 5.3 seconds (now 8 seconds)
Note loading the calculator took a long time - then it opened with the data selector filled. (Perhaps at least monitor that?)
e) Simple calculation on long variable 2.8 seconds! (now 3.8 seconds, but 2.5 seconds when deleted extra data frames, which was situation earlier.)
f) Checking for infill 25 seconds (now 10 seconds!)
g) Adding 4 variables (year, month_abbt, doy,dom) 48 seconds (now 55 seconds)
h) Defining climatic (with error), so not unique: 3760 seconds (62 minutes)
(Then it crashed on the next command - I don't know if related!)

i) Checking for duplicates 630 seconds (10.5 minutes, of which 10 minutes is the check and 30 seconds is the summary of the results.)
j) Checking for duplicates with error - lots of duplicates - used for the define crash. The duplicates ran for a long time - 2+ hours and then crashed R-Instat without a message. It just went blank.

I report on my "error", which was to use the Station variable, which was ok for the long dataset (12 million rows). But in the longer dataset (36 million rows I had effectively 3 sets of data and hence a Station2 variable. Then Station2 by Date is ok. But I used Station and Date instead, so there are many duplicates.

If this is a real sticking point in our use of a very long dataset then I return to wondering whether we can do something about it. At least in the Define climatic, when we want to abort if there are any duplicates, i.e. we don't need to find where they are, or how many. Similarly, i(but more complicated) n the duplications dialogue, perhaps we can check quickly if there are going to be a lot, and ask if we really want to continue.

.

0 replies

dannyparsons · 2021-03-24T09:46:29Z

dannyparsons
Mar 24, 2021
Maintainer

I'm not clear if there's a task here? You said there's a problem as factor, is that an issue of being slow or a problem with defining factors? I could move this to a discussion if this is more of a report on the test with large data.

0 replies

rdstern · 2021-03-24T09:52:36Z

rdstern
Mar 24, 2021
Maintainer Author

I moved the task to issue #6305. I think this is a problem to correct. The rest is interesting (to me anyway!) and for reporting. I wanted it mainly to test whether we could/should extend the visible data frame. I came to the conclusion that we should leave it where it is for now. But it also led to the discussion of investigating R-Instat limits that is addressed in 2 other issues - they followed the discussion with you on refreshing the grid, and also on investigating loading controls in dialogues when there could be an extreme number of variables, or an extreme number of levels of a factor.

0 replies

dannyparsons · 2021-03-24T11:58:07Z

dannyparsons
Mar 24, 2021
Maintainer

Sounds good, that's clear. I'll move this to a "discussion" as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing R-Instat with a large climatic dataset #6330

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Testing R-Instat with a large climatic dataset #6330

rdstern Mar 19, 2021 Maintainer

Replies: 4 comments

rdstern Mar 20, 2021 Maintainer Author

dannyparsons Mar 24, 2021 Maintainer

rdstern Mar 24, 2021 Maintainer Author

dannyparsons Mar 24, 2021 Maintainer

rdstern
Mar 19, 2021
Maintainer

rdstern
Mar 20, 2021
Maintainer Author

dannyparsons
Mar 24, 2021
Maintainer

rdstern
Mar 24, 2021
Maintainer Author

dannyparsons
Mar 24, 2021
Maintainer