Odd Performance Characteristics In Benchmarks #1695

RoloEdits · 2023-10-20T05:57:50Z

Description
When benchmarking for calamine and updating the readme with the info to try to see where its performance is in the language ecosystems, I used excelize as the library for go. During the benchmarking I noticed odd behavior.

This is the program I put together. Taken and modified from the example.

package main

import (
        "fmt"
        "github.com/xuri/excelize/v2"
)

func main() {
        // Open workbook
        file, err := excelize.OpenFile(`NYC_311_SR_2010-2020-sample-1M.xlsx`)

        if err != nil {
                fmt.Println(err)
                return
        }

        defer func() {
                // Close the spreadsheet.
                if err := file.Close(); err != nil {
                        fmt.Println(err)
                }
        }()

        // Get worksheet
        rows, err := file.GetRows("NYC_311_SR_2010-2020-sample-1M")
        if err != nil {
                fmt.Println(err)
                return
        }

        // Iterate over rows
        for _, row := range rows {
                _ = row
        }
}

The benchmarks gave this result:

0.22.1 calamine.exe
  Time (mean ± σ):     25.278 s ±  0.424 s    [User: 24.852 s, System: 0.470 s]
  Range (min … max):   24.980 s … 26.369 s    10 runs

v2.8.0 excelize.exe
  Time (mean ± σ):     199.709 s ± 11.671 s    [User: 158.678 s, System: 69.350 s]
  Range (min … max):   193.934 s … 232.725 s    10 runs

I'm an outsider coming with basically zero go knowledge, so excuse me if this is for nothing, but most benchmarks in a Rust vs Go are usually not that far apart. A 7.9x difference seems out of the ordinary.

In another benchmark, I noticed some excessive reading. 11x the file size on disk:

As well as writing, when there is no writing logic in the program:

The cpu also has a lot of spikes, from what I presume is garbage collection

The dataset was from https://raw.githubusercontent.com/wiki/jqnatividad/qsv/files/NYC_311_SR_2010-2020-sample-1M.7z saved as an xlsx file. 1M rows, 41 columns, 28M cells with values in it.

Output of go version:

go version go1.21.3 windows/amd64

Excelize version or commit ID:

v2.8.0

Environment details (OS, Microsoft Excel™ version, physical, etc.):
OS: Windows 11
CPU: RYZEN 9 5900X @ 4GHz
SSD: Sabrent 2TB Gen 4 PCIE

The text was updated successfully, but these errors were encountered:

xuri · 2023-10-20T07:06:34Z

Thanks for your issue. There are two kinds of functions in the excelize library: normal mode functions and stream mode functions. The stream mode function is used to generate or reading a worksheet with the amount of data in lower resource usage, please try to using using rows iterator like this:

package main

import (
    "fmt"

    "github.com/xuri/excelize/v2"
)

func main() {
    // Open workbook
    file, err := excelize.OpenFile(`NYC_311_SR_2010-2020-sample-1M.xlsx`)

    if err != nil {
        fmt.Println(err)
        return
    }

    defer func() {
        // Close the spreadsheet.
        if err := file.Close(); err != nil {
            fmt.Println(err)
        }
    }()

    // Get worksheet
    rows, err := file.Rows("NYC_311_SR_2010-2020-sample-1M")
    if err != nil {
        fmt.Println(err)
        return
    }
    for rows.Next() {
    }
}

2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, 500GB SSD, macOS Sonoma 14.0, go1.20 darwin/amd64

v2.8.0 excelize
  Time (mean ± σ):     62.155 s ±  3.452 s    [User: 60.534 s, System: 3.326 s]
  Range (min … max):   54.745 s … 68.057 s    10 runs

RoloEdits · 2023-10-20T07:27:16Z

Thanks for the quick response. This is the updated data.

Benchmark 1: excelize.exe
  Time (mean ± σ):     44.254 s ±  0.574 s    [User: 46.071 s, System: 7.754 s]
  Range (min … max):   42.947 s … 44.911 s    10 runs

I still notice the writes that are being done. I guess this is just part of the implementation?

And the memory usage

I'll be sure to update the benchmarks in calamines docs.

xuri · 2023-10-20T07:39:57Z

To avoid high memory usage for reading large files, this library allows user-specific UnzipXMLSizeLimit options when opening the workbook, to set the memory limit on the unzipping worksheet and shared string table in bytes, worksheet XML will be extracted to the system temporary directory when the file size is over this value, so you can see that data written in reading mode, and you can change the default for that to avoid this behavior. Also reference the docs and issue #1581.

Previous `excelize` data was gotten using an improper iterator. New code comes from [here](qax-os/excelize#1695 (comment)).

xuri · 2023-10-21T10:42:44Z

I closed this. If you have any questions, please let me know to reopen this anytime.

RoloEdits added a commit to RoloEdits/calamine that referenced this issue Oct 21, 2023

docs(performance): update excelize data and add openpyxl

36f37be

Previous `excelize` data was gotten using an improper iterator. New code comes from [here](qax-os/excelize#1695 (comment)).

RoloEdits mentioned this issue Oct 21, 2023

"Big-Data File" or "Database" based backend for very large spreadsheets tafia/calamine#368

Closed

xuri closed this as completed Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Odd Performance Characteristics In Benchmarks #1695

Odd Performance Characteristics In Benchmarks #1695

RoloEdits commented Oct 20, 2023

xuri commented Oct 20, 2023

RoloEdits commented Oct 20, 2023

xuri commented Oct 20, 2023

xuri commented Oct 21, 2023

Odd Performance Characteristics In Benchmarks #1695

Odd Performance Characteristics In Benchmarks #1695

Comments

RoloEdits commented Oct 20, 2023

xuri commented Oct 20, 2023

RoloEdits commented Oct 20, 2023

xuri commented Oct 20, 2023

xuri commented Oct 21, 2023