-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve documentation and examples #101
Comments
I'm having an issue when I'm trying to replicate the Boston House Pricing Prediction notebook. I'm not sure if the package names are outdated or I made a mistake installing them but I get the following error when I'm importing the packages in the first block:
In fact, none of the I have no trouble with installing
Is it an issue with the documentation? If so can you please at least briefly explain how I would get started with sklearn-genetic with my RF algorithm on a Boston House Pricing like dataset? I'm really into the the idea of GA for my master's thesis and I can't really go back at this point :) Thanks in advance |
Hi @emirtarik , I hope you are doing great pip install sklearn-genetic-opt[all] Let me know if this fixes the problem |
Thanks for the quick reply @rodrigo-arenas. Now it makes sense. Sorry about that misunderstanding. This did fix my problem however I'm still having the problems with Thanks again |
No problem @emirtarik I just ran the whole notebook without issues, if you are using Jupyter notebooks directly, make sure you installed the package in the right environment and that you restarted the kernel I'd also suggest you make sure to use a virtual environment so dependencies you might have with other projects don't mix up |
Yes, it was an import error but I think it was related to the python environment on my work computer because after trying on my Jupyter server and local python, I gave it a try on Colab and it worked! I'll use it there instead. Thanks a lot @rodrigo-arenas, maybe I can contribute on the docs once I understand more of how this works. On a side note, do you know how this would run on sparse matrices? I have a lot of categorical variables to use which are all encoded. |
No problem, I'm glad you made it work. For sparse matrices, it's not something quite related to the package itself, in the sense that it doesn't have an explicit algorithm to give special treatment to this kind of dataset, the direct impact is that you might need to increase the number of individuals and generations to explore all the space In the other hand, you can just see it as a regular machine learning problem, you could for example use some preprocessing steps, like a t-sne or PCA algorithm to reduce the number of dimensions in your dataset; you can also try to not one-hot encode all the variables but use different techniques (depending on the nature on your data) that doesn't create a new column for every new value, I hope it helps |
Hi again @rodrigo-arenas, Following your suggestions, I was able to work with a labeled dataset. I was hesitant to do this since my categoricals are not exactly ordinal, therefore I was afraid that this would complicate interpretation. For instance, you wouldn't do this in a linear approach as to not bring any meaning to the marginal increase in categories under a single variable. With sparse matrices as in a one-hot encoded dataset, I was having 'nan' returns so I had to find another way (still not 100% sure though so I will check with my advisor). I'm still having low fitness scores but this is highly likely to be related to the limits of the dataset I'm currently working with. About dimension reduction, I was kinda hoping that the GA would provide an unconventional dimension reduction technique, as in I would be able to see which features are most important in choosing optimized new generations. Which brings me to today's question :) Do you think there would be a way to look at gene frequencies used in the GASearch process? I would want to compare it to the classic RF feature importances graph obtained by using MDI, or simply compare it with some coefficients obtained by my linear models. A good example would be this paper. I realize this is getting out of topic for this issue entry and I apologize but maybe it will help others looking through the docs and issues with similar problems. Also, this is currently the only way I am aware of to reach you :) |
Hi @emirtarik About the second part, of the gene frequencies, you can check exactly which hyperparameters the model tried using at each step in the case of
|
Hi, as given in the CONTRIBUTING.md, herewith I say, that I'm working on this issue. I will likely not fix it but I can probably make it better |
Hi @GuiTaek for sure, just let us know which sections you'll be working on, so other people don't overwrite it |
You're welcome. For now, as I haven't ever used this library (came from good first issue tag) I would like to tackle the first greater page https://sklearn-genetic-opt.readthedocs.io/en/stable/tutorials/basic_usage.html. I don't know, maybe later more Edit: Would you like more atomar pull request or would you rather prefer that I combine everything to one pull request? |
Hi @GuiTaek yeah for sure, you can start on that one. In this case, it would be great one Pull Request per page, so we keep subjects separated Thanks |
Hi @rodrigo-arenas, then I'll collect every suggestion I have for one page and make a pull request. May I also touch content? E.g. I think, it is possible to improve the example as the max-score doesn't increase much. It would be better advertising if it increases gradually from low to high. I already have an example I'm not satisfied though, as it throws warnings. |
Hi @GuiTaek , yes the examples can be improved, just take into account that the ones shown in the tutorials section, usually are pretty simple, so the users can get started right away, what I mean is, for example, if the tutorial is about adapters, I'd showcase how to setup that parameter, and not adding callbacks, loggers that might mix two subjects. On the other hand, if you see the jupyter notebooks examples, I think there is a big oporunity to make them better, on those notebooks, there is no problem to modify them and add complex features in a single notebook, since those are meant to showcase all the library capabilities at one place I hope it makes sense |
Hi @rodrigo-arenas, hope you're doing well. I'm trying to use the The first is that I'm trying to use labeled data instead of encoded and this directly results in the error below.
The second is when I try to do this with encoded data, I am well able to do it, however it asks me to turn my sparse matrix into an array using Thanks a lot |
Hi @emirtarik , as you mentioned you can't pass a labeled dataset, not especially because of this package, but because scikit-learn won't work with such a structure, so there is really nothing I can do from this library, this must be solved in a pipeline with some encoding as a preprocessing step. The Please for future questions, make sure to create a new bug or question, if it's not related to this issue (documentation improvement), so we don't mix different subjects in this thread Greetings |
OK, I'll consider that, that it should be easy. |
Unfortunately I cannot make a draft pull request and request review. I'd like to have review as I have quite a bundle of changes and I am particulary unsure about the whole example thing: Is it OK, that I use intentionally a "wrong" range to show the powers of this library? Is it clear that a user have to change it according to what I have written? See also the draft pull request as well as the commits. I'm not finished though, as there is more on this page I haven't touched. |
I made a full pull request as I feared that you can't see the pull request. |
Hi @GuiTaek thanks for notifying me, I just saw the PR I'll be reviewing it this weekend |
Hi @rodrigo-arenas had a lot of university, but it looks like you merged it, I didn't expect that to be honest! Thank you very much! Sorry for late response |
I open this issue for newcomers who would like to contribute to an open-source project
The idea is to improve the current docs and add more examples using the library, you can see the current docs files here
You could also add external articles to the package showcasing some applications, see these for example
Here is the stable docs
The text was updated successfully, but these errors were encountered: