-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jonas‘ optimization ideas #7
Comments
STATUS: This is done for lightgbm (#15), and for sklearn we're not doing it (#19) We could try Parquet for storing the arrays. It has great support for sparse arrays (lots of NaNs, maybe even lots of arbitrary identical values) I think Parquet is also pretty smart about using the smallest possible integer type on disk. Also, in Parquet repeated values are essentially free because of Run Length Encoding. We should be able to embed Parquet data into the pickle file. |
STATUS: We don't need this for lightgbm since it uses Parquet, and the sklearn code currently has no boolean arrays. We can use NumPy‘s |
You mean storing the whole model (for example, all of the 300 trees) in a large parquet-Table, right? |
In a real-world model I just benchmarked we have ALL
ie. it is equivalent to If we replace the Examples of more efficient representations:
|
STATUS: Parquet seems to handle this just fine, not sure about lzma We found in the lgbm data a lot of values like |
Combine sklearn trees into a single array to profit from potentially better Parquet compression. Eg. if your random forest has 100 trees, concat each of the 100 tree arrays, like we do with lightgbm. Could be that this doesn't give a lot of reduction if trees are large enough and forests small enough though. We can easily check manually. |
I’ll use this to brain dump a few ideas. Maybe some of them are useful.
The text was updated successfully, but these errors were encountered: