Jonas‘ optimization ideas #7

jonashaag · 2023-02-20T18:28:52Z

I’ll use this to brain dump a few ideas. Maybe some of them are useful.

jonashaag · 2023-02-20T18:30:58Z

STATUS: This is done for lightgbm (#15), and for sklearn we're not doing it (#19)

We could try Parquet for storing the arrays. It has great support for sparse arrays (lots of NaNs, maybe even lots of arbitrary identical values)

I think Parquet is also pretty smart about using the smallest possible integer type on disk.

Also, in Parquet repeated values are essentially free because of Run Length Encoding.

We should be able to embed Parquet data into the pickle file.

jonashaag · 2023-02-20T18:31:56Z

STATUS: We don't need this for lightgbm since it uses Parquet, and the sklearn code currently has no boolean arrays.

We can use NumPy‘s pack functionality to represent boolean arrays as bitmaps (Parquet will do this by default)

YYYasin19 · 2023-02-21T09:25:12Z

We could try Parquet for storing the arrays.

You mean storing the whole model (for example, all of the 300 trees) in a large parquet-Table, right?
That seems to come in at around 10Mb without compression, and 6.7Mb with compression='gzip' enabled for an initially 20Mb large lgbm model file. This is without any further optimization, not even string parsing etc.

jonashaag · 2023-02-26T12:31:42Z

In a real-world model I just benchmarked we have ALL children_left like this:

[1, 2, 3, ..., 42, -1, -1, ..., N]

ie. it is equivalent to range(1, N+1) with some -1 for the leaves.

If we replace the -1 with some more efficient representation, we can save ~10% of final size.

Examples of more efficient representations:

List of -1 positions
Bitmap/boolean array of -1 positions
Encoding -1 as the previous value, eg. [1, 2, 3, ..., 42, 42, 42, ..., N]; this should help with compression because it doesn't destroy the pattern as much

jonashaag · 2023-03-03T16:53:36Z

STATUS: Parquet seems to handle this just fine, not sure about lzma

We found in the lgbm data a lot of values like 1e-35. Are they NaN? If so we could replace them by NaN and profit from Parquet's bitset-based NaN representation.

jonashaag · 2023-03-13T11:06:46Z

Combine sklearn trees into a single array to profit from potentially better Parquet compression. Eg. if your random forest has 100 trees, concat each of the 100 tree arrays, like we do with lightgbm.

Could be that this doesn't give a lot of reduction if trees are large enough and forests small enough though. We can easily check manually.

jonashaag · 2023-09-18T07:42:00Z

Use Pseudodecimal Encoding from btrblocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jonas‘ optimization ideas #7

Jonas‘ optimization ideas #7

jonashaag commented Feb 20, 2023

jonashaag commented Feb 20, 2023 •

edited

Loading

jonashaag commented Feb 20, 2023 •

edited

Loading

YYYasin19 commented Feb 21, 2023

jonashaag commented Feb 26, 2023 •

edited

Loading

jonashaag commented Mar 3, 2023 •

edited

Loading

jonashaag commented Mar 13, 2023 •

edited

Loading

jonashaag commented Sep 18, 2023

Jonas‘ optimization ideas #7

Jonas‘ optimization ideas #7

Comments

jonashaag commented Feb 20, 2023

jonashaag commented Feb 20, 2023 • edited Loading

jonashaag commented Feb 20, 2023 • edited Loading

YYYasin19 commented Feb 21, 2023

jonashaag commented Feb 26, 2023 • edited Loading

jonashaag commented Mar 3, 2023 • edited Loading

jonashaag commented Mar 13, 2023 • edited Loading

jonashaag commented Sep 18, 2023

jonashaag commented Feb 20, 2023 •

edited

Loading

jonashaag commented Feb 20, 2023 •

edited

Loading

jonashaag commented Feb 26, 2023 •

edited

Loading

jonashaag commented Mar 3, 2023 •

edited

Loading

jonashaag commented Mar 13, 2023 •

edited

Loading