-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test data.table set to use only 50% of cpus #202
Comments
Output of https://github.com/h2oai/db-benchmark/blob/63fb0fca9d7f4d9806c01e418000e42bb31778d7/_utils/compare-data.table.R script to compare last two runs of data.table (40 threads vs 20 threads). source("_utils/time.R")
if (system("tail -1 time.csv | cut -d',' -f2", intern=TRUE)!="1621364165")
stop("time.csv and logs.csv should be as of 1621364165 batch run, filter out newer rows in those files")
## groupby ----
d = tail.time("data.table", "groupby", i=c(1L, 2L))
setnames(d, c("20210517_2f2f62d","20210518_2f2f62d"), c("th_40","th_20"))
if (nrow(d[(is.na(th_40) & !is.na(th_20)) | (!is.na(th_40) & is.na(th_20))])) {
stop("number of threads had an impact on completion of queries")
} else {
d = d[!is.na(th_40)]
}
d[, th_40_20:=th_40/th_20]
## improvement
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows)]
# in_rows mean median
#1: 1e7 1.0242721 0.9609988
#2: 1e8 0.9378870 0.9455267
#3: 1e9 0.9506561 0.9569359
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(knasorted)]
# knasorted mean median
#1: 1e2 cardinality factor, 0% NAs, unsorted data 1.0393667 0.9538973
#2: 1e1 cardinality factor, 0% NAs, unsorted data 0.9521915 0.9544223
#3: 2e0 cardinality factor, 0% NAs, unsorted data 0.9604950 0.9569359
#4: 1e2 cardinality factor, 0% NAs, pre-sorted data 0.9371154 0.9487804
#5: 1e2 cardinality factor, 5% NAs, unsorted data 0.9678192 0.9598999
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(question_group)]
# question_group mean median
#1: basic 0.9548596 0.9301310
#2: advanced 0.9897345 0.9806791
## worst case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.max(mean)]
# in_rows knasorted mean median
#1: 1e7 1e2 cardinality factor, 0% NAs, unsorted data 1.239259 0.9620776
## best case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.min(mean)]
# in_rows knasorted mean median
#1: 1e8 1e2 cardinality factor, 0% NAs, unsorted data 0.9235102 0.9200373
## worst case for single question
d[which.max(th_40_20)]
# in_rows knasorted question_group question th_40 th_20 th_40_20
#1: 1e7 1e2 cardinality factor, 0% NAs, unsorted data basic sum v1 by id1:id2 0.413 0.118 3.5
## best case for single question
d[which.min(th_40_20)]
# in_rows knasorted question_group question th_40 th_20 th_40_20
#1: 1e9 1e2 cardinality factor, 5% NAs, unsorted data basic sum v1 mean v3 by id3 15.22 21.104 0.7211903
## join ----
d = tail.time("data.table", "join", i=c(1L, 2L))
setnames(d, c("20210517_2f2f62d","20210518_2f2f62d"), c("th_40","th_20"))
if (nrow(d[(is.na(th_40) & !is.na(th_20)) | (!is.na(th_40) & is.na(th_20))])) {
stop("number of threads had an impact on completion of queries")
} else {
d = d[!is.na(th_40)]
}
d[, th_40_20:=th_40/th_20]
## improvement
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows)]
# in_rows mean median
#1: 1e7 1.0149302 1.0000000
#2: 1e8 0.9143243 0.9008573
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(knasorted)]
# knasorted mean median
#1: 0% NAs, unsorted data 0.9385902 0.9144130
#2: 5% NAs, unsorted data 0.9612286 0.9294773
#3: 0% NAs, pre-sorted data 0.9940629 0.9705720
## worst case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.max(mean)]
# in_rows knasorted mean median
#1: 1e7 0% NAs, pre-sorted data 1.055906 1.05
## best case by data
d[, .(mean=mean(th_40_20), median=median(th_40_20)), .(in_rows, knasorted)][which.min(mean)]
# in_rows knasorted mean median
#1: 1e8 0% NAs, unsorted data 0.8983325 0.8773762
## worst case for single question
d[which.max(th_40_20)]
# in_rows knasorted question th_40 th_20 th_40_20
#1: 1e7 5% NAs, unsorted data medium inner on factor 0.513 0.443 1.158014
## best case for single question
d[which.min(th_40_20)]
# in_rows knasorted question th_40 th_20 th_40_20
#1: 1e8 0% NAs, unsorted data medium outer on int 8.143 9.558 0.8519565 |
groupbyWe can see that on smallest data size (0.5GB) using 100% vs 50% can eventually cause degrade in performance. It seems that only smallest size is actually sensitive to number of threads. 350% slow down in single question may look serious but we have to remember that this is not an average, its total timing is well under a second, thus it is highly sensitive to any noise, even a small background running processes on the machine. The next worst case single question was only 26% slow down, therefore we can consider 350% case as an outlier. |
joinWe can see that on smallest data size (0.5GB) using 100% vs 50% has marginal impact on performance. On 5GB data speed up is clearly visible. Join is less sensitive to threads, but we it is also less parallelized as of now. Once |
The overall conclusion seems to be that difference between 50% (default) and 100% is not that significant, users can safely stay on the default and they will not lose much. If users are after performance and and working with larger data then setting more threads is worth. This of course assume the process does not run in an shared environment. In case of a shared environment, or a desktop computer where user performs other activities I would advise to keep default 50%. fyi @mattdowle |
And this is what I have assumed :). Thank you for running these tests! |
Analyze impact of using 50% of cpus for data.table (default) vs. currently set to use 100%
The text was updated successfully, but these errors were encountered: