Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jonas‘ optimization ideas #7

Open
jonashaag opened this issue Feb 20, 2023 · 7 comments
Open

Jonas‘ optimization ideas #7

jonashaag opened this issue Feb 20, 2023 · 7 comments

Comments

@jonashaag
Copy link
Contributor

I’ll use this to brain dump a few ideas. Maybe some of them are useful.

@jonashaag
Copy link
Contributor Author

jonashaag commented Feb 20, 2023

STATUS: This is done for lightgbm (#15), and for sklearn we're not doing it (#19)

We could try Parquet for storing the arrays. It has great support for sparse arrays (lots of NaNs, maybe even lots of arbitrary identical values)

I think Parquet is also pretty smart about using the smallest possible integer type on disk.

Also, in Parquet repeated values are essentially free because of Run Length Encoding.

We should be able to embed Parquet data into the pickle file.

@jonashaag
Copy link
Contributor Author

jonashaag commented Feb 20, 2023

STATUS: We don't need this for lightgbm since it uses Parquet, and the sklearn code currently has no boolean arrays.

We can use NumPy‘s pack functionality to represent boolean arrays as bitmaps (Parquet will do this by default)

@YYYasin19
Copy link
Contributor

We could try Parquet for storing the arrays.

You mean storing the whole model (for example, all of the 300 trees) in a large parquet-Table, right?
That seems to come in at around 10Mb without compression, and 6.7Mb with compression='gzip' enabled for an initially 20Mb large lgbm model file. This is without any further optimization, not even string parsing etc.

@jonashaag
Copy link
Contributor Author

jonashaag commented Feb 26, 2023

In a real-world model I just benchmarked we have ALL children_left like this:

[1, 2, 3, ..., 42, -1, -1, ..., N]

ie. it is equivalent to range(1, N+1) with some -1 for the leaves.

If we replace the -1 with some more efficient representation, we can save ~10% of final size.

Examples of more efficient representations:

  • List of -1 positions
  • Bitmap/boolean array of -1 positions
  • Encoding -1 as the previous value, eg. [1, 2, 3, ..., 42, 42, 42, ..., N]; this should help with compression because it doesn't destroy the pattern as much

@jonashaag
Copy link
Contributor Author

jonashaag commented Mar 3, 2023

STATUS: Parquet seems to handle this just fine, not sure about lzma

We found in the lgbm data a lot of values like 1e-35. Are they NaN? If so we could replace them by NaN and profit from Parquet's bitset-based NaN representation.

@jonashaag
Copy link
Contributor Author

jonashaag commented Mar 13, 2023

Combine sklearn trees into a single array to profit from potentially better Parquet compression. Eg. if your random forest has 100 trees, concat each of the 100 tree arrays, like we do with lightgbm.

Could be that this doesn't give a lot of reduction if trees are large enough and forests small enough though. We can easily check manually.

@jonashaag
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants