-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added EnsembleLda for stable LDA topics #2980
Conversation
…into EnsembleLda
…tion to simply pickling the whole thing.
…into EnsembleLda
… differences across architectures
Didn't work out on Friday and also probably not today, but we will be on it shortly again and finish up what is still open now |
Tests should pass now. We definitely think that dataclasses and typing would create a much cleaner codebase even though they were not directly asked for in the review, but doing it correctly would result in somewhat large changes. Therefore we decided to leave out such changes for a future PR. We are looking forward to contributing this to version 4.1 and are waiting for your feedback now. |
@aloosley added two simple dataclasses Topic and Cluster now, it is already a lot better and wasn't much work after all. I think this is allright tests are failing because
whatever https://dl.bintray.com/sbt/debian is, it is not accessible. Maybe this is a temporary issue |
Note to self: need to work around bintray being sunset. |
Decoupled multiprocessing code from EnsembleLda class. This reduces the length of the class by several hundred lines, making it slightly easier to understand. Added _generate_topic_models_worker function to clarify distinction between single-process and multi-process code. Fixed flake8 problem (l is an ambiguous variable name) Adjusted _teardown function (removed i parameter, it's only for logs) Moved _MAX_RANDOM_STATE to module level
@aloosley @sezanzeb Finally got a chance to sit down and have a final look at this. Everything looks good, with the exception that the EnsembleLda class was still a bit too busy: on top of the actual model functionality, it's dealing with ton of multiprocessing stuff. This makes it difficult to understand what's going on. I mentioned this earlier but it may have been lost in the rest of the comments. Anyway, to help things move along, I made the changes myself. I hope you guys don't mind. All the EnsembleLda tests pass locally (they're still running in CI as I write this) so I don't think I broke anything, but just in case I did, can you please have a glance at the changes here: 71b33dd? |
we don't want to hide the details of the problem
Taking a look at the commit, 90% is the excavation of logic from the |
Thank you so much @mpenkov and @piskvorky for your continuous interest in this! I hope the community will find good use for it :) |
Many thanks @mpenkov and @piskvorky for believing in this stable topic modeling idea (turn @sezanzeb's thesis project) and working with us to get it out there for the world to use more easily. |
Reopened #2282 because I lost access to the repo there.
I'll merge the up to date develop now and I'm not sure if the required documentation was provided, I'll check that (#2673).