Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #1506.
As detailed at https://github.com/johnkerl/miller/blob/6.11.0/pkg/mlrval/mlrmap.go#L1-L61 , there is a performance trade-off when using hashmaps (vs linear search) for key lookups within records. For lower column-count, computing the hash maps takes a little more time than is saved by having them. But for higher column-count (see #1506) the penalty for non-hashing becomes prohibitive.
Thus we will now default to use record-hashing. Users can still (as always) use
mlr --no-hash-records
for a small performance gain on low-column-count data.Analysis
Preparation of data
Here is a script to generate TSV files of varying row and column counts:
mkt.py
Example output:
We can create files of varying dimensions like this:
File-size details:
Timings
Comparison of ingest performance
Recall that the "tall" case is where previous performance optimizations have been focused, and the "wide" case is the current area of interest as surfaced by #1506. TL;DR is that we see a huge improvement in the wide case for this PR, along with a near break-even in the tall case.
Analysis of ingest-performance timings
Again, we see that widely varying row and column counts (which was not previously done at https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks) we have a game-changing improvement in the wide case and a near-break-even in the tall case.
Other benchmarks
Following https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks.
Prep:
cp mlr mlr-hn
before this PR.cp mlr mlr-hy
after this PR.Outputs using the above benchmark scripts:
Conclusion