Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark on bigger data #184

Closed
AdrianAntico opened this issue Aug 27, 2021 · 6 comments
Closed

Benchmark on bigger data #184

AdrianAntico opened this issue Aug 27, 2021 · 6 comments

Comments

@AdrianAntico
Copy link

This is a pretty cool package and I'm starting to test some of the functions available. I already updated a function inside my R package because of it.

I saw you mention that as data grows data.table by itself will possibly become faster. I just reran one of the benchmarks on a bigger data set and pasted the code / results below. Are your c/c++ operations parallelized? If so, any idea why the benchmark results would reverse for bigger data sets?

library(data.table)
library(collapse)
library(magrittr)


# Build bigger data set
DT <- qDT(wlddev)
for (i in 1:6) {
  DT <- data.table::rbindlist(list(DT, DT))
} 

# Row count
DT[, .N]
# 53968896 rows


microbenchmark::microbenchmark(
  collapse = DT %>% gby(country) %>% get_vars(9:13) %>% fmean,
  data.table = DT[, lapply(.SD, mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
  data.table_base = DT[, lapply(.SD, base::mean, na.rm = TRUE), keyby = country, .SDcols = 9:13],
  hybrid_bad = DT[, lapply(.SD, fmean), keyby = country, .SDcols = 9:13],
  hybrid_ok = DT[, fmean(gby(.SD, country)), .SDcols = c(1L, 9:13)])

Unit: milliseconds
expr       min        lq      mean    median        uq       max neval
collapse 1199.5509 1204.3485 1231.1221 1207.0481 1211.7302 1645.8245   100
data.table  513.9339  543.9380  603.1337  585.2910  636.5891  942.6035   100
data.table_base 2439.0390 2545.0094 2611.9677 2589.7708 2654.7859 3027.9439   100
hybrid_bad  846.0849  856.0343  875.6276  859.4227  873.3109 1125.8064   100
hybrid_ok 1202.0246 1205.8919 1236.3116 1208.3820 1221.5040 1418.8378   100
@SebKrantz
Copy link
Owner

Hello, thanks for the benchmarks. The package is completely serial at the c/c++ level, no parallelism yet. That explains the benchmarks. data.table offers sub-column level parallelism, which is great for huge data, but possibly slow on data of moderate size (1 mio obs). I have planned to parallelize collapse, hopefully an initial release by the end of the year, but I will have a release before that which adds a few things I just find essential (like multiple-assignment, reference operations on vectors and matrices, a faster table function etc.), and it will be an ambitious project as I intend to do fluent internal transitioning between serial code, column-level and sub-column level parallelism depending on the data size - in a thread-safe manor. The aim at the end is to have a package that satisfies big data people but preserves the current speed of statistical computations on small and medium sized data. Initial benckmarks with parallel versions of the functions show that a parallel collapse would be very competitive with data.table for big data.

@AdrianAntico
Copy link
Author

Thanks for the update. Sounds like you have a lot on your plate. If you're open to suggestions, I made one to the data.table and datatable guys yesterday that you might find interesting.

Rdatatable/data.table#2778 (comment)

@SebKrantz
Copy link
Owner

Thanks, but I don't think I'll do rolling functions, also because the roll package is an absolute blast (way faster than data.table).

@AdrianAntico
Copy link
Author

Not sure about your comment.

I'm showing the opposite performance-wise. For a single variable / single period column creation, data.table is showing to be 12.7% faster.

For multiple columns and multiple periods, I'm getting a compounding effect. In the example below, I'm showing a 22x run time advantage for data.table.

# Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
  datatemp <- RemixAutoML::FakeDataGenerator(
    Correlation = 0.75,
    N = 25000L,
    ID = 0L,
    ZIP = 0L,
    FactorCount = 0L,
    AddDate = TRUE,
    Classification = FALSE,
    MultiClass = FALSE)
  datatemp[, Factor1 := eval(Level)]
  if(Count == 1L) {
    data <- data.table::copy(datatemp)
  } else {
    data <- data.table::rbindlist(
      list(data, data.table::copy(datatemp)))
  }
  Count <- Count + 1L
}

# Copy data
data1 <- data.table::copy(data)

# Benchmark
microbenchmark::microbenchmark(
  times = 1,
  data[, temp := roll::roll_mean(x = Adrian, width = 5), by = "Factor1"],
  data1[, temp := data.table::frollmean(x = Adrian, n = 5), by = "Factor1"]
)

# Unit: milliseconds
# expr    min     lq   mean median     uq    max neval
# data[, `:=`(temp, roll::roll_mean(x = Adrian, width = 5)),    by = "Factor1"] 9.1589 9.1589 9.1589 9.1589 9.1589 9.1589     1
# data1[, `:=`(temp, data.table::frollmean(x = Adrian, n = 5)), by = "Factor1"] 7.9950 7.9950 7.9950 7.9950 7.9950 7.9950     1


# Create fake Panel Data----
Count <- 1L
for(Level in LETTERS) {
  datatemp <- RemixAutoML::FakeDataGenerator(
    Correlation = 0.75,
    N = 25000L,
    ID = 0L,
    ZIP = 0L,
    FactorCount = 0L,
    AddDate = TRUE,
    Classification = FALSE,
    MultiClass = FALSE)
  datatemp[, Factor1 := eval(Level)]
  if(Count == 1L) {
    data <- data.table::copy(datatemp)
  } else {
    data <- data.table::rbindlist(
      list(data, data.table::copy(datatemp)))
  }
  Count <- Count + 1L
}

# Copy data
data1 <- data.table::copy(data)

# Define cols and periods
periods = 1:10
cols <- names(data)[1:5] 
MA_Names <- c()
for(t in cols) for(p in periods) MA_Names <- c(MA_Names, paste0(t, p))

# roll function
roll_test <- function() {
  for(t in cols) {
    for(p in periods) {
      data[, paste0(t, p) := roll::roll_mean(get(t), width = p), by = "Factor1"]
    }
  }
}

# data.table function
dt_test <- function() {
  data1[, (MA_Names) := data.table::frollmean(x = .SD, n = c(periods)), .SDcols = c(cols), by = "Factor1"]
}

# Benchmark
microbenchmark::microbenchmark(
  times = 1,
  roll_test(),
  dt_test()
)

# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval
# roll_test() 2330.8222 2330.8222 2330.8222 2330.8222 2330.8222 2330.8222     1
# dt_test()    100.1643  100.1643  100.1643  100.1643  100.1643  100.1643     1

@SebKrantz
Copy link
Owner

You are right, they are actually quite comparable. The roll version appears to be slightly faster operating column-wise on a matrix than the data.table version applied over columns on a data table. I guess it was just based on a feeling I had when working with roll and matrices. In any case the package implements all of the necessary operations (including rolling linear models), so I don't really see the need to go into rolling statistics.

@AdrianAntico
Copy link
Author

AdrianAntico commented Aug 27, 2021

I'd argue that the use case I listed to the data.table folks is pretty unique. I built a version for myself but it utilizes data.table, not c / c++. While creating rolling stats fast is important for full data sets, the partial data set use case is much more critical from a model scoring perspective when milliseconds matter most. I'm okay if you don't want to pursue this. I figured I'd run it by you just in case.

Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants