Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of the default date/time parsing methods (16% speed-up) #1298

Merged
merged 7 commits into from
Aug 29, 2023

Conversation

sequba
Copy link
Contributor

@sequba sequba commented Aug 14, 2023

Test case

A spreadsheet with 500k rows and 10 columns filled with string data. Total of 5M data cells, no formulas, no date/time values.

Script:

const hf = HyperFormula.buildFromArray([], {
  licenseKey: 'gpl-v3',
  maxRows: 500100,
  useStats: true,
  chooseAddressMappingPolicy: new AlwaysDense(),
})

const data = Array(500000).fill(0).map(() => Array(10).fill(0).map(() => 'A500000'))

var ty1 = (new Date()).getTime()
hf.setSheetContent(0, data)
var ty2 = (new Date()).getTime()
console.log(ty2 - ty1)
console.log(hf.getStats())

Ideas for improvement

  • Function getTopSortedWithSccSubgraphFrom (total time 42.7%), which is the iterative implementation of the Tarjan algorithm that performs the topological sorting of the dependency graph and finds cycles (SCCs) in the graph. In this test case, the dependency graph is trivial; it contains only isolated nodes without any edges. It seems that it can be done more efficiently.
  • Function parseDateTimeFromConfigFormats (total time 21.2%), which tries to parse all string data as date/time values. This test case contains no date/time values, so there might be some way of saving time by determining it quickly and avoiding running the heavy parsing operations.

This PR focuses on optimizing date/time parsing functions

  1. Date format string and time format string (provided in the engine's configuration) need to be pre-processed before using them to parse date/time values from the input strings. I extracted this pre-processing code and applied memorization to it so that it is being run only once per configured format instead of once for every data cell in the spreadsheet.
  2. I introduced a quick regexp check to detect early the strings that certainly couldn't be parsed to date/time values and avoid running expensive parsing operations.

Results

Total time:
Before: 32264ms
After: 26391ms

Function parseDateTimeFromConfigFormats:
Before: 21.2%
After: 6.3%

Profiler:
Before:
before

After:
after

Profiler: Chrome Dev Tools

This result corresponds to ~16% speed-up of HyperFormula for this use-case.

How did you test your changes?

  • correctness verified by the full suite of unit tests
  • verified the improvement on the test case described above
  • run regular performance benchmarks (results included in this PR as a comment below)

Types of changes

  • Breaking change (a fix or a feature because of which an existing functionality doesn't work as expected anymore)
  • New feature or improvement (a non-breaking change that adds functionality)
  • Bug fix (a non-breaking change that fixes an issue)
  • Additional language file, or a change to an existing language file (translations)
  • Change to the documentation

Related issues:

  1. Slow graph building with 500k of rows without formulas #876

Checklist:

  • I have reviewed the guidelines about Contributing to HyperFormula and I confirm that my code follows the code style of this project.
  • I have signed the Contributor License Agreement.
  • My change is compliant with the OpenDocument standard.
  • My change is compatible with Microsoft Excel.
  • My change is compatible with Google Sheets.
  • I described my changes in the CHANGELOG.md file.
  • My changes require a documentation update.
  • My changes require a migration guide.

@github-actions
Copy link

github-actions bot commented Aug 14, 2023

Performance comparison of head (4224b3a) vs base (276f731)

                                     testName |   base |  head |  change
------------------------------------------------------------------------
                                      Sheet A | 1003.6 | 981.2 |  -2.23%
                                      Sheet B |  358.7 | 380.3 |  +6.02%
                                      Sheet T |  397.8 |   308 | -22.57%
                                Column ranges |  851.4 | 885.9 |  +4.05%
Sheet A:  change value, add/remove row/column |     51 |    56 |  +9.80%
 Sheet B: change value, add/remove row/column |    424 |   423 |  -0.24%
                   Column ranges - add column |    330 |   586 | +77.58%
                Column ranges - without batch |    866 |   912 |  +5.31%
                        Column ranges - batch |    240 |   453 | +88.75%

@sequba sequba changed the base branch from master to develop August 14, 2023 15:22
@sequba sequba self-assigned this Aug 16, 2023
@sequba sequba requested a review from budnix August 16, 2023 09:14
Copy link
Member

@budnix budnix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👌

@sequba sequba merged commit ef996d2 into develop Aug 29, 2023
21 checks passed
@sequba sequba deleted the feature/issue-876-datetime-parsing branch August 29, 2023 11:52
@sequba sequba changed the title Improve efficiency of the default date/time parsing methods Improve efficiency of the default date/time parsing methods (16% speed-up) Aug 30, 2023
Copy link

codecov bot commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 98.80952% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.23%. Comparing base (276f731) to head (4224b3a).
Report is 81 commits behind head on develop.

Files with missing lines Patch % Lines
src/DateTimeDefault.ts 98.70% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1298      +/-   ##
===========================================
+ Coverage    97.20%   97.23%   +0.03%     
===========================================
  Files          167      167              
  Lines        14299    14304       +5     
  Branches      3064     3065       +1     
===========================================
+ Hits         13899    13909      +10     
  Misses         395      395              
+ Partials         5        0       -5     
Files with missing lines Coverage Δ
src/DateTimeHelper.ts 96.19% <100.00%> (+0.54%) ⬆️
src/format/format.ts 99.30% <100.00%> (ø)
src/DateTimeDefault.ts 97.43% <98.70%> (+3.68%) ⬆️

... and 3 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants