Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd Performance Characteristics In Benchmarks #1695

Closed
RoloEdits opened this issue Oct 20, 2023 · 4 comments
Closed

Odd Performance Characteristics In Benchmarks #1695

RoloEdits opened this issue Oct 20, 2023 · 4 comments

Comments

@RoloEdits
Copy link

Description
When benchmarking for calamine and updating the readme with the info to try to see where its performance is in the language ecosystems, I used excelize as the library for go. During the benchmarking I noticed odd behavior.

This is the program I put together. Taken and modified from the example.

package main

import (
        "fmt"
        "github.com/xuri/excelize/v2"
)

func main() {
        // Open workbook
        file, err := excelize.OpenFile(`NYC_311_SR_2010-2020-sample-1M.xlsx`)

        if err != nil {
                fmt.Println(err)
                return
        }

        defer func() {
                // Close the spreadsheet.
                if err := file.Close(); err != nil {
                        fmt.Println(err)
                }
        }()

        // Get worksheet
        rows, err := file.GetRows("NYC_311_SR_2010-2020-sample-1M")
        if err != nil {
                fmt.Println(err)
                return
        }

        // Iterate over rows
        for _, row := range rows {
                _ = row
        }
}

The benchmarks gave this result:

0.22.1 calamine.exe
  Time (mean ± σ):     25.278 s ±  0.424 s    [User: 24.852 s, System: 0.470 s]
  Range (min … max):   24.980 s … 26.369 s    10 runs

v2.8.0 excelize.exe
  Time (mean ± σ):     199.709 s ± 11.671 s    [User: 158.678 s, System: 69.350 s]
  Range (min … max):   193.934 s … 232.725 s    10 runs

I'm an outsider coming with basically zero go knowledge, so excuse me if this is for nothing, but most benchmarks in a Rust vs Go are usually not that far apart. A 7.9x difference seems out of the ordinary.

In another benchmark, I noticed some excessive reading. 11x the file size on disk:
bytes_from_disk

As well as writing, when there is no writing logic in the program:
bytes_to_disk

The cpu also has a lot of spikes, from what I presume is garbage collection
cpu_usage

The dataset was from https://raw.githubusercontent.com/wiki/jqnatividad/qsv/files/NYC_311_SR_2010-2020-sample-1M.7z saved as an xlsx file. 1M rows, 41 columns, 28M cells with values in it.

Output of go version:

go version go1.21.3 windows/amd64

Excelize version or commit ID:

v2.8.0

Environment details (OS, Microsoft Excel™ version, physical, etc.):
OS: Windows 11
CPU: RYZEN 9 5900X @ 4GHz
SSD: Sabrent 2TB Gen 4 PCIE

@xuri
Copy link
Member

xuri commented Oct 20, 2023

Thanks for your issue. There are two kinds of functions in the excelize library: normal mode functions and stream mode functions. The stream mode function is used to generate or reading a worksheet with the amount of data in lower resource usage, please try to using using rows iterator like this:

package main

import (
    "fmt"

    "github.com/xuri/excelize/v2"
)

func main() {
    // Open workbook
    file, err := excelize.OpenFile(`NYC_311_SR_2010-2020-sample-1M.xlsx`)

    if err != nil {
        fmt.Println(err)
        return
    }

    defer func() {
        // Close the spreadsheet.
        if err := file.Close(); err != nil {
            fmt.Println(err)
        }
    }()

    // Get worksheet
    rows, err := file.Rows("NYC_311_SR_2010-2020-sample-1M")
    if err != nil {
        fmt.Println(err)
        return
    }
    for rows.Next() {
    }
}

2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, 500GB SSD, macOS Sonoma 14.0, go1.20 darwin/amd64

v2.8.0 excelize
  Time (mean ± σ):     62.155 s ±  3.452 s    [User: 60.534 s, System: 3.326 s]
  Range (min … max):   54.745 s … 68.057 s    10 runs

@RoloEdits
Copy link
Author

Thanks for the quick response. This is the updated data.

Benchmark 1: excelize.exe
  Time (mean ± σ):     44.254 s ±  0.574 s    [User: 46.071 s, System: 7.754 s]
  Range (min … max):   42.947 s … 44.911 s    10 runs

bytes_from_disk

I still notice the writes that are being done. I guess this is just part of the implementation?

bytes_to_disk

cpu_usage

And the memory usage
mem_usage
virt_mem_usage

I'll be sure to update the benchmarks in calamines docs.

@xuri
Copy link
Member

xuri commented Oct 20, 2023

To avoid high memory usage for reading large files, this library allows user-specific UnzipXMLSizeLimit options when opening the workbook, to set the memory limit on the unzipping worksheet and shared string table in bytes, worksheet XML will be extracted to the system temporary directory when the file size is over this value, so you can see that data written in reading mode, and you can change the default for that to avoid this behavior. Also reference the docs and issue #1581.

RoloEdits added a commit to RoloEdits/calamine that referenced this issue Oct 21, 2023
Previous `excelize` data was gotten using an improper iterator. New code comes from [here](qax-os/excelize#1695 (comment)).
@xuri
Copy link
Member

xuri commented Oct 21, 2023

I closed this. If you have any questions, please let me know to reopen this anytime.

@xuri xuri closed this as completed Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants