Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable record-hashing by default #1507

Merged
merged 2 commits into from
Feb 26, 2024
Merged

Enable record-hashing by default #1507

merged 2 commits into from
Feb 26, 2024

Conversation

johnkerl
Copy link
Owner

@johnkerl johnkerl commented Feb 26, 2024

Summary

Resolves #1506.

As detailed at https://github.com/johnkerl/miller/blob/6.11.0/pkg/mlrval/mlrmap.go#L1-L61 , there is a performance trade-off when using hashmaps (vs linear search) for key lookups within records. For lower column-count, computing the hash maps takes a little more time than is saved by having them. But for higher column-count (see #1506) the penalty for non-hashing becomes prohibitive.

Thus we will now default to use record-hashing. Users can still (as always) use mlr --no-hash-records for a small performance gain on low-column-count data.

Analysis

Preparation of data

Here is a script to generate TSV files of varying row and column counts:

mkt.py

#!/usr/bin/env python

import sys

nrow = 2
ncol = 100
if len(sys.argv) == 2:
    ncol = int(sys.argv[1])
if len(sys.argv) == 3:
    nrow = int(sys.argv[1])
    ncol = int(sys.argv[2])

prefix = "k"
for i in range(nrow):
    for j in range(ncol):
        if j == 0:
            sys.stdout.write("%s%07d" % (prefix, j))
        else:
                sys.stdout.write("\t%s%07d" % (prefix, j))
    sys.stdout.write("\n")
    prefix = "v"

Example output:

$ ./mkt.py 3 5
k0000000	k0000001	k0000002	k0000003	k0000004
v0000000	v0000001	v0000002	v0000003	v0000004
v0000000	v0000001	v0000002	v0000003	v0000004

We can create files of varying dimensions like this:

wide_nrows="2 3 4 5 6 7 8 9 10"
wide_ncols=" 10000 20000 40000 60000 80000 100000"

tall_nrows="20000 30000 40000 50000 60000 70000 80000 90000 100000"
tall_ncols="10 20 40 60 80 100"

echo WIDE DATA

for nrow in $wide_nrows; do
  for ncol in $wide_ncols; do
    echo $nrow $ncol
    ./mkt.py $nrow $ncol > wide-$nrow-$ncol.tsv
  done
done

echo TALL DATA

for nrow in $tall_nrows; do
  for ncol in $tall_ncols; do
    echo $nrow $ncol
    ./mkt.py $nrow $ncol > tall-$nrow-$ncol.tsv
  done
done

File-size details:


$ ls -lh tall-*.tsv | head
-rw-r--r--  1 kerl  staff   8.6M Feb 25 19:15 tall-100000-10.tsv
-rw-r--r--  1 kerl  staff    86M Feb 25 19:15 tall-100000-100.tsv
-rw-r--r--  1 kerl  staff    17M Feb 25 19:15 tall-100000-20.tsv
-rw-r--r--  1 kerl  staff    34M Feb 25 19:15 tall-100000-40.tsv
-rw-r--r--  1 kerl  staff    51M Feb 25 19:15 tall-100000-60.tsv
-rw-r--r--  1 kerl  staff    69M Feb 25 19:15 tall-100000-80.tsv
-rw-r--r--  1 kerl  staff   1.7M Feb 25 19:15 tall-20000-10.tsv
-rw-r--r--  1 kerl  staff    17M Feb 25 19:15 tall-20000-100.tsv
-rw-r--r--  1 kerl  staff   3.4M Feb 25 19:15 tall-20000-20.tsv
-rw-r--r--  1 kerl  staff   6.9M Feb 25 19:15 tall-20000-40.tsv
$ ls -lh wide-*.tsv | head
-rw-r--r--  1 kerl  staff   879K Feb 25 19:15 wide-10-10000.tsv
-rw-r--r--  1 kerl  staff   8.6M Feb 25 19:15 wide-10-100000.tsv
-rw-r--r--  1 kerl  staff   1.7M Feb 25 19:15 wide-10-20000.tsv
-rw-r--r--  1 kerl  staff   3.4M Feb 25 19:15 wide-10-40000.tsv
-rw-r--r--  1 kerl  staff   5.1M Feb 25 19:15 wide-10-60000.tsv
-rw-r--r--  1 kerl  staff   6.9M Feb 25 19:15 wide-10-80000.tsv
-rw-r--r--  1 kerl  staff   176K Feb 25 19:14 wide-2-10000.tsv
-rw-r--r--  1 kerl  staff   1.7M Feb 25 19:15 wide-2-100000.tsv
-rw-r--r--  1 kerl  staff   352K Feb 25 19:14 wide-2-20000.tsv
-rw-r--r--  1 kerl  staff   703K Feb 25 19:15 wide-2-40000.tsv


Timings

echo WIDE OLD

for nrow in $wide_nrows; do
  for ncol in $wide_ncols; do
    echo $nrow $ncol $(justtime -r mlr --no-hash-records --tsv nothing wide-$nrow-$ncol.tsv)
  done
done | mlr --pprint --hi label nrow,ncol,seconds | tee wide-old.tbl

echo TALL OLD

for nrow in $tall_nrows; do
  for ncol in $tall_ncols; do
    echo $nrow $ncol $(justtime -r mlr --no-hash-records --tsv nothing tall-$nrow-$ncol.tsv)
  done
done | mlr --pprint --hi label nrow,ncol,seconds | tee tall-old.tbl

echo WIDE NEW

for nrow in $wide_nrows; do
  for ncol in $wide_ncols; do
    echo $nrow $ncol $(justtime -r mlr --hash-records --tsv nothing wide-$nrow-$ncol.tsv)
  done
done | mlr --pprint --hi label nrow,ncol,seconds | tee wide-new.tbl

echo TALL NEW

for nrow in $tall_nrows; do
  for ncol in $tall_ncols; do
    echo $nrow $ncol $(justtime -r mlr --hash-records --tsv nothing tall-$nrow-$ncol.tsv)
  done
done | mlr --pprint --hi label nrow,ncol,seconds | tee tall-new.tbl

Comparison of ingest performance

Recall that the "tall" case is where previous performance optimizations have been focused, and the "wide" case is the current area of interest as surfaced by #1506. TL;DR is that we see a huge improvement in the wide case for this PR, along with a near break-even in the tall case.

echo COMPARE WIDE

mlr --pprint reshape -s nrow,seconds wide-old.tbl
mlr --pprint reshape -s nrow,seconds wide-new.tbl
mlr --pprint reshape -s ncol,seconds wide-old.tbl
mlr --pprint reshape -s ncol,seconds wide-new.tbl

echo COMPARE TALL

mlr --pprint reshape -s nrow,seconds tall-old.tbl
mlr --pprint reshape -s nrow,seconds tall-new.tbl
mlr --pprint reshape -s ncol,seconds tall-old.tbl
mlr --pprint reshape -s ncol,seconds tall-new.tbl
COMPARE WIDE

ncol   2      3      4      5      6      7      8      9      10
10000  0.207  0.240  0.347  0.464  0.575  1.719  0.799  0.925  1.021
20000  0.461  0.929  1.373  1.800  2.253  2.704  3.162  3.596  4.043
40000  1.812  3.635  5.816  7.185  9.049  10.803 12.583 14.354 16.198
60000  4.138  8.225  12.673 16.077 20.219 24.332 28.371 32.349 36.474
80000  7.226  14.651 21.572 28.757 36.113 43.305 51.895 58.540 65.093
100000 11.375 22.816 33.969 45.590 57.351 68.354 80.531 92.286 106.348

ncol   2     3     4     5     6     7     8     9     10
10000  0.017 0.020 0.021 0.025 0.027 0.029 0.032 0.033 0.036
20000  0.020 0.026 0.028 0.034 0.038 0.043 0.048 0.053 0.057
40000  0.026 0.035 0.043 0.053 0.062 0.073 0.081 0.095 0.106
60000  0.032 0.047 0.061 0.079 0.099 0.111 0.124 0.136 0.151
80000  0.037 0.055 0.073 0.099 0.118 0.139 0.154 0.175 0.193
100000 0.043 0.063 0.089 0.113 0.139 0.165 0.184 0.214 0.243

nrow 10000 20000 40000  60000  80000  100000
2    0.207 0.461 1.812  4.138  7.226  11.375
3    0.240 0.929 3.635  8.225  14.651 22.816
4    0.347 1.373 5.816  12.673 21.572 33.969
5    0.464 1.800 7.185  16.077 28.757 45.590
6    0.575 2.253 9.049  20.219 36.113 57.351
7    1.719 2.704 10.803 24.332 43.305 68.354
8    0.799 3.162 12.583 28.371 51.895 80.531
9    0.925 3.596 14.354 32.349 58.540 92.286
10   1.021 4.043 16.198 36.474 65.093 106.348

nrow 10000 20000 40000 60000 80000 100000
2    0.017 0.020 0.026 0.032 0.037 0.043
3    0.020 0.026 0.035 0.047 0.055 0.063
4    0.021 0.028 0.043 0.061 0.073 0.089
5    0.025 0.034 0.053 0.079 0.099 0.113
6    0.027 0.038 0.062 0.099 0.118 0.139
7    0.029 0.043 0.073 0.111 0.139 0.165
8    0.032 0.048 0.081 0.124 0.154 0.184
9    0.033 0.053 0.095 0.136 0.175 0.214
10   0.036 0.057 0.106 0.151 0.193 0.243

COMPARE TALL

ncol 20000 30000 40000 50000 60000 70000 80000 90000 100000
10   0.046 0.062 0.077 0.093 0.106 0.121 0.137 0.150 0.169
20   0.079 0.111 0.140 0.172 0.206 0.236 0.268 0.294 0.327
40   0.154 0.226 0.288 0.359 0.429 0.494 0.563 0.629 0.699
60   0.250 0.361 0.475 0.587 0.709 0.817 0.928 1.039 1.152
80   0.354 0.521 0.685 0.869 1.023 1.194 1.348 1.513 1.685
100  0.484 0.719 0.948 1.165 1.406 1.620 1.847 2.071 2.308

ncol 20000 30000 40000 50000 60000 70000 80000 90000 100000
10   0.054 0.072 0.091 0.109 0.135 0.152 0.176 0.189 0.206
20   0.096 0.135 0.180 0.217 0.264 0.304 0.347 0.376 0.414
40   0.180 0.255 0.337 0.415 0.510 0.585 0.674 0.732 0.809
60   0.275 0.396 0.524 0.651 0.794 0.926 1.063 1.151 1.273
80   0.345 0.503 0.674 0.882 1.009 1.177 1.396 1.453 1.611
100  0.413 0.606 0.814 1.033 1.224 1.425 1.580 1.763 1.957

nrow   10    20    40    60    80    100
20000  0.046 0.079 0.154 0.250 0.354 0.484
30000  0.062 0.111 0.226 0.361 0.521 0.719
40000  0.077 0.140 0.288 0.475 0.685 0.948
50000  0.093 0.172 0.359 0.587 0.869 1.165
60000  0.106 0.206 0.429 0.709 1.023 1.406
70000  0.121 0.236 0.494 0.817 1.194 1.620
80000  0.137 0.268 0.563 0.928 1.348 1.847
90000  0.150 0.294 0.629 1.039 1.513 2.071
100000 0.169 0.327 0.699 1.152 1.685 2.308

nrow   10    20    40    60    80    100
20000  0.054 0.096 0.180 0.275 0.345 0.413
30000  0.072 0.135 0.255 0.396 0.503 0.606
40000  0.091 0.180 0.337 0.524 0.674 0.814
50000  0.109 0.217 0.415 0.651 0.882 1.033
60000  0.135 0.264 0.510 0.794 1.009 1.224
70000  0.152 0.304 0.585 0.926 1.177 1.425
80000  0.176 0.347 0.674 1.063 1.396 1.580
90000  0.189 0.376 0.732 1.151 1.453 1.763
100000 0.206 0.414 0.809 1.273 1.611 1.957

Analysis of ingest-performance timings


mlr --pprint \
  --from tall-new.tbl \
  rename seconds,old_seconds \
  then join -j nrow,ncol -f tall-old.tbl \
  then rename seconds,new_seconds \
  then put '$ratio=int(100*$new_seconds/$old_seconds)'

nrow   ncol new_seconds old_seconds ratio
20000  10   0.046       0.054       85
20000  20   0.079       0.096       82
20000  40   0.154       0.180       85
20000  60   0.250       0.275       90
20000  80   0.354       0.345       102
20000  100  0.484       0.413       117
30000  10   0.062       0.072       86
30000  20   0.111       0.135       82
30000  40   0.226       0.255       88
30000  60   0.361       0.396       91
30000  80   0.521       0.503       103
30000  100  0.719       0.606       118
40000  10   0.077       0.091       84
40000  20   0.140       0.180       77
40000  40   0.288       0.337       85
40000  60   0.475       0.524       90
40000  80   0.685       0.674       101
40000  100  0.948       0.814       116
50000  10   0.093       0.109       85
50000  20   0.172       0.217       79
50000  40   0.359       0.415       86
50000  60   0.587       0.651       90
50000  80   0.869       0.882       98
50000  100  1.165       1.033       112
60000  10   0.106       0.135       78
60000  20   0.206       0.264       78
60000  40   0.429       0.510       84
60000  60   0.709       0.794       89
60000  80   1.023       1.009       101
60000  100  1.406       1.224       114
70000  10   0.121       0.152       79
70000  20   0.236       0.304       77
70000  40   0.494       0.585       84
70000  60   0.817       0.926       88
70000  80   1.194       1.177       101
70000  100  1.620       1.425       113
80000  10   0.137       0.176       77
80000  20   0.268       0.347       77
80000  40   0.563       0.674       83
80000  60   0.928       1.063       87
80000  80   1.348       1.396       96
80000  100  1.847       1.580       116
90000  10   0.150       0.189       79
90000  20   0.294       0.376       78
90000  40   0.629       0.732       85
90000  60   1.039       1.151       90
90000  80   1.513       1.453       104
90000  100  2.071       1.763       117
100000 10   0.169       0.206       82
100000 20   0.327       0.414       78
100000 40   0.699       0.809       86
100000 60   1.152       1.273       90
100000 80   1.685       1.611       104
100000 100  2.308       1.957       117

mlr --pprint \
  --from tall-new.tbl \
  rename seconds,old_seconds \
  then join -j nrow,ncol -f tall-old.tbl \
  then rename seconds,new_seconds \
  then put '$ratio=int(100*$new_seconds/$old_seconds)' \
  then stats1 -a mean -f ratio

ratio_mean
91.92592592592592

mlr --pprint \
  --from tall-new.tbl \
  rename seconds,old_seconds \
  then join -j nrow,ncol -f tall-old.tbl \
  then rename seconds,new_seconds \
  then put '$ratio=int(100*$new_seconds/$old_seconds)' \
  then sort -n ratio

nrow   ncol new_seconds old_seconds ratio
40000  20   0.140       0.180       77
70000  20   0.236       0.304       77
80000  10   0.137       0.176       77
80000  20   0.268       0.347       77
60000  10   0.106       0.135       78
60000  20   0.206       0.264       78
90000  20   0.294       0.376       78
100000 20   0.327       0.414       78
50000  20   0.172       0.217       79
70000  10   0.121       0.152       79
90000  10   0.150       0.189       79
20000  20   0.079       0.096       82
30000  20   0.111       0.135       82
100000 10   0.169       0.206       82
80000  40   0.563       0.674       83
40000  10   0.077       0.091       84
60000  40   0.429       0.510       84
70000  40   0.494       0.585       84
20000  10   0.046       0.054       85
20000  40   0.154       0.180       85
40000  40   0.288       0.337       85
50000  10   0.093       0.109       85
90000  40   0.629       0.732       85
30000  10   0.062       0.072       86
50000  40   0.359       0.415       86
100000 40   0.699       0.809       86
80000  60   0.928       1.063       87
30000  40   0.226       0.255       88
70000  60   0.817       0.926       88
60000  60   0.709       0.794       89
20000  60   0.250       0.275       90
40000  60   0.475       0.524       90
50000  60   0.587       0.651       90
90000  60   1.039       1.151       90
100000 60   1.152       1.273       90
30000  60   0.361       0.396       91
80000  80   1.348       1.396       96
50000  80   0.869       0.882       98
40000  80   0.685       0.674       101
60000  80   1.023       1.009       101
70000  80   1.194       1.177       101
20000  80   0.354       0.345       102
30000  80   0.521       0.503       103
90000  80   1.513       1.453       104
100000 80   1.685       1.611       104
50000  100  1.165       1.033       112
70000  100  1.620       1.425       113
60000  100  1.406       1.224       114
40000  100  0.948       0.814       116
80000  100  1.847       1.580       116
20000  100  0.484       0.413       117
90000  100  2.071       1.763       117
100000 100  2.308       1.957       117
30000  100  0.719       0.606       118

Again, we see that widely varying row and column counts (which was not previously done at https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks) we have a game-changing improvement in the wide case and a near-break-even in the tall case.

Other benchmarks

Following https://miller.readthedocs.io/en/latest/new-in-miller-6/#performance-benchmarks.

Prep:

  • cp mlr mlr-hn before this PR.
  • cp mlr mlr-hy after this PR.
mlr --ocsv repeat -n 100 data/medium > ~/tmp/big.csv
mlr --c2d cat ~/tmp/big.csv > ~/tmp/big.dkvp
mlr --c2j cat ~/tmp/big.csv > ~/tmp/big.json
mlr --c2x cat ~/tmp/big.csv > ~/tmp/big.xtab
mlr --c2n cat ~/tmp/big.csv > ~/tmp/big.nidx

Outputs using the above benchmark scripts:

$ sh scripts/chain-cmps.sh ./mlr-hn ./mlr-hy

TIME IN SECONDS 0.432 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv check
TIME IN SECONDS 0.409 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv check
TIME IN SECONDS 0.412 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv check
TIME IN SECONDS 0.528 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv check
TIME IN SECONDS 0.528 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv check
TIME IN SECONDS 0.532 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv check

TIME IN SECONDS 0.438 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv cat
TIME IN SECONDS 0.432 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv cat
TIME IN SECONDS 0.435 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv cat
TIME IN SECONDS 0.540 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv cat
TIME IN SECONDS 0.543 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv cat
TIME IN SECONDS 0.539 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv cat

TIME IN SECONDS 0.418 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv tail
TIME IN SECONDS 0.424 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv tail
TIME IN SECONDS 0.418 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv tail
TIME IN SECONDS 0.535 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv tail
TIME IN SECONDS 0.533 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv tail
TIME IN SECONDS 0.544 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv tail

TIME IN SECONDS 1.302 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv tac
TIME IN SECONDS 1.149 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv tac
TIME IN SECONDS 1.036 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv tac
TIME IN SECONDS 1.422 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv tac
TIME IN SECONDS 1.223 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv tac
TIME IN SECONDS 1.202 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv tac

TIME IN SECONDS 1.115 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv sort -f shape
TIME IN SECONDS 1.118 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv sort -f shape
TIME IN SECONDS 0.933 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv sort -f shape
TIME IN SECONDS 1.212 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv sort -f shape
TIME IN SECONDS 1.437 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv sort -f shape
TIME IN SECONDS 1.444 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv sort -f shape

TIME IN SECONDS 0.986 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv sort -n quantity
TIME IN SECONDS 1.033 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv sort -n quantity
TIME IN SECONDS 1.154 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv sort -n quantity
TIME IN SECONDS 1.337 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv sort -n quantity
TIME IN SECONDS 1.364 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv sort -n quantity
TIME IN SECONDS 1.354 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv sort -n quantity

TIME IN SECONDS 0.422 -- ./mlr-hn --c2p stats1 -a min,mean,max -f quantity,rate -g shape /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.420 -- ./mlr-hn --c2p stats1 -a min,mean,max -f quantity,rate -g shape /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.421 -- ./mlr-hn --c2p stats1 -a min,mean,max -f quantity,rate -g shape /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.541 -- ./mlr-hy --c2p stats1 -a min,mean,max -f quantity,rate -g shape /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.542 -- ./mlr-hy --c2p stats1 -a min,mean,max -f quantity,rate -g shape /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.542 -- ./mlr-hy --c2p stats1 -a min,mean,max -f quantity,rate -g shape /Users/kerl/tmp/big.csv

TIME IN SECONDS 0.564 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv put -f scripts/chain-1.mlr
TIME IN SECONDS 0.566 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv put -f scripts/chain-1.mlr
TIME IN SECONDS 0.568 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv put -f scripts/chain-1.mlr
TIME IN SECONDS 0.606 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv put -f scripts/chain-1.mlr
TIME IN SECONDS 0.598 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv put -f scripts/chain-1.mlr
TIME IN SECONDS 0.606 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv put -f scripts/chain-1.mlr

----------------------------------------------------------------

$ sh scripts/chain-lengths.sh ./mlr-hn ./mlr-hy

TIME IN SECONDS 0.560 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.548 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.543 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.636 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.584 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.578 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr

TIME IN SECONDS 0.593 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.609 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.597 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.633 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.631 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.623 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr

TIME IN SECONDS 0.692 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.689 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.679 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.743 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.743 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.865 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr

TIME IN SECONDS 0.841 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.843 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.835 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.902 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.922 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.911 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr

TIME IN SECONDS 1.011 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.008 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 0.980 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.066 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.245 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.108 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr

TIME IN SECONDS 1.128 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.129 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.168 -- ./mlr-hn --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.288 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.253 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr
TIME IN SECONDS 1.241 -- ./mlr-hy --csv --from /Users/kerl/tmp/big.csv then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr then put -f scripts/chain-1.mlr


kerl@arvad[prod][main][miller]$ sh scripts/time-big-files ./mlr-hn ./mlr-hy

TIME IN SECONDS 0.484 -- ./mlr-hn --csv cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.425 -- ./mlr-hn --csv cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.425 -- ./mlr-hn --csv cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.594 -- ./mlr-hy --csv cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.537 -- ./mlr-hy --csv cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.537 -- ./mlr-hy --csv cat /Users/kerl/tmp/big.csv

TIME IN SECONDS 0.493 -- ./mlr-hn --csvlite cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.488 -- ./mlr-hn --csvlite cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.486 -- ./mlr-hn --csvlite cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.599 -- ./mlr-hy --csvlite cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.597 -- ./mlr-hy --csvlite cat /Users/kerl/tmp/big.csv
TIME IN SECONDS 0.602 -- ./mlr-hy --csvlite cat /Users/kerl/tmp/big.csv

TIME IN SECONDS 0.642 -- ./mlr-hn --dkvp cat /Users/kerl/tmp/big.dkvp
TIME IN SECONDS 0.638 -- ./mlr-hn --dkvp cat /Users/kerl/tmp/big.dkvp
TIME IN SECONDS 0.638 -- ./mlr-hn --dkvp cat /Users/kerl/tmp/big.dkvp
TIME IN SECONDS 0.759 -- ./mlr-hy --dkvp cat /Users/kerl/tmp/big.dkvp
TIME IN SECONDS 0.761 -- ./mlr-hy --dkvp cat /Users/kerl/tmp/big.dkvp
TIME IN SECONDS 0.764 -- ./mlr-hy --dkvp cat /Users/kerl/tmp/big.dkvp

TIME IN SECONDS 1.793 -- ./mlr-hn --nidx cat /Users/kerl/tmp/big.nidx
TIME IN SECONDS 1.787 -- ./mlr-hn --nidx cat /Users/kerl/tmp/big.nidx
TIME IN SECONDS 1.788 -- ./mlr-hn --nidx cat /Users/kerl/tmp/big.nidx
TIME IN SECONDS 1.901 -- ./mlr-hy --nidx cat /Users/kerl/tmp/big.nidx
TIME IN SECONDS 1.905 -- ./mlr-hy --nidx cat /Users/kerl/tmp/big.nidx
TIME IN SECONDS 1.902 -- ./mlr-hy --nidx cat /Users/kerl/tmp/big.nidx

TIME IN SECONDS 0.635 -- ./mlr-hn --xtab cat /Users/kerl/tmp/big.xtab
TIME IN SECONDS 0.630 -- ./mlr-hn --xtab cat /Users/kerl/tmp/big.xtab
TIME IN SECONDS 0.634 -- ./mlr-hn --xtab cat /Users/kerl/tmp/big.xtab
TIME IN SECONDS 0.635 -- ./mlr-hy --xtab cat /Users/kerl/tmp/big.xtab
TIME IN SECONDS 0.632 -- ./mlr-hy --xtab cat /Users/kerl/tmp/big.xtab
TIME IN SECONDS 0.645 -- ./mlr-hy --xtab cat /Users/kerl/tmp/big.xtab

TIME IN SECONDS 5.735 -- ./mlr-hn --json cat /Users/kerl/tmp/big.json
TIME IN SECONDS 5.224 -- ./mlr-hn --json cat /Users/kerl/tmp/big.json
TIME IN SECONDS 5.220 -- ./mlr-hn --json cat /Users/kerl/tmp/big.json
TIME IN SECONDS 5.444 -- ./mlr-hy --json cat /Users/kerl/tmp/big.json
TIME IN SECONDS 5.483 -- ./mlr-hy --json cat /Users/kerl/tmp/big.json
TIME IN SECONDS 5.328 -- ./mlr-hy --json cat /Users/kerl/tmp/big.json

Conclusion

  • Ingestion performance is dramatically improved in the high-column-count case
  • Ingestion performance is near-break-even in the low-column-count case
  • Other prior benchmarks are near-break-even

@johnkerl johnkerl marked this pull request as ready for review February 26, 2024 02:51
@johnkerl johnkerl merged commit fb1f7f8 into main Feb 26, 2024
6 checks passed
@johnkerl johnkerl deleted the kerl/hash-on branch February 26, 2024 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read performance can be improved for high-column-count data
1 participant