Improve documentation and examples #101

rodrigo-arenas · 2022-06-16T13:47:27Z

I open this issue for newcomers who would like to contribute to an open-source project

The idea is to improve the current docs and add more examples using the library, you can see the current docs files here

You could also add external articles to the package showcasing some applications, see these for example

Here is the stable docs

emirtarik · 2022-07-19T13:16:52Z

Hi @rodrigo-arenas

I'm having an issue when I'm trying to replicate the Boston House Pricing Prediction notebook.

I'm not sure if the package names are outdated or I made a mistake installing them but I get the following error when I'm importing the packages in the first block:

from sklearn_genetic import GASearchCV

ModuleNotFoundError: No module named 'sklearn_genetic'

In fact, none of the sklearn_genetic imports seem to work. I've checked this issue from the sklearn-genetic repository but it's not exactly the same problem.

I have no trouble with installing sklearn-genetic and I have:

Python==3.7.3
sklearn==0.23.1
deap==1.3

Is it an issue with the documentation? If so can you please at least briefly explain how I would get started with sklearn-genetic with my RF algorithm on a Boston House Pricing like dataset? I'm really into the the idea of GA for my master's thesis and I can't really go back at this point :)

Thanks in advance

rodrigo-arenas · 2022-07-19T13:22:51Z

Hi @emirtarik , I hope you are doing great
I think it might be a misunderstanding, the sklearn-genetic package has nothing to do with sklearn-genetic-opt (this package), they just happen to share some part of the name, make sure you are installing the right one using

pip install sklearn-genetic-opt[all]

Let me know if this fixes the problem

emirtarik · 2022-07-19T14:06:14Z

Thanks for the quick reply @rodrigo-arenas.

Now it makes sense. Sorry about that misunderstanding. This did fix my problem however I'm still having the problems with .space .plots .callbacks. Do you know why these might be missing?

Thanks again

rodrigo-arenas · 2022-07-19T14:38:26Z

No problem @emirtarik
Can you share what error are you getting? is it an import error?

I just ran the whole notebook without issues, if you are using Jupyter notebooks directly, make sure you installed the package in the right environment and that you restarted the kernel

I'd also suggest you make sure to use a virtual environment so dependencies you might have with other projects don't mix up

emirtarik · 2022-07-19T15:46:09Z

Yes, it was an import error but I think it was related to the python environment on my work computer because after trying on my Jupyter server and local python, I gave it a try on Colab and it worked! I'll use it there instead.

Thanks a lot @rodrigo-arenas, maybe I can contribute on the docs once I understand more of how this works.

On a side note, do you know how this would run on sparse matrices? I have a lot of categorical variables to use which are all encoded.

rodrigo-arenas · 2022-07-19T15:55:12Z

No problem, I'm glad you made it work.
And for sure, new contributors are welcome!

For sparse matrices, it's not something quite related to the package itself, in the sense that it doesn't have an explicit algorithm to give special treatment to this kind of dataset, the direct impact is that you might need to increase the number of individuals and generations to explore all the space

In the other hand, you can just see it as a regular machine learning problem, you could for example use some preprocessing steps, like a t-sne or PCA algorithm to reduce the number of dimensions in your dataset; you can also try to not one-hot encode all the variables but use different techniques (depending on the nature on your data) that doesn't create a new column for every new value, I hope it helps

emirtarik · 2022-07-21T15:04:12Z

Hi again @rodrigo-arenas,

Following your suggestions, I was able to work with a labeled dataset. I was hesitant to do this since my categoricals are not exactly ordinal, therefore I was afraid that this would complicate interpretation. For instance, you wouldn't do this in a linear approach as to not bring any meaning to the marginal increase in categories under a single variable. With sparse matrices as in a one-hot encoded dataset, I was having 'nan' returns so I had to find another way (still not 100% sure though so I will check with my advisor). I'm still having low fitness scores but this is highly likely to be related to the limits of the dataset I'm currently working with.

About dimension reduction, I was kinda hoping that the GA would provide an unconventional dimension reduction technique, as in I would be able to see which features are most important in choosing optimized new generations. Which brings me to today's question :) Do you think there would be a way to look at gene frequencies used in the GASearch process? I would want to compare it to the classic RF feature importances graph obtained by using MDI, or simply compare it with some coefficients obtained by my linear models. A good example would be this paper.

I realize this is getting out of topic for this issue entry and I apologize but maybe it will help others looking through the docs and issues with similar problems. Also, this is currently the only way I am aware of to reach you :)

rodrigo-arenas · 2022-07-21T19:07:25Z

Hi @emirtarik
I understand, as you mentioned, the encoding strategy might have those impacts

About the second part, of the gene frequencies, you can check exactly which hyperparameters the model tried using at each step in the case of GASearchCV, or the features it selected in GAFeatureSelectionCV, you have different options:

You can explore the logbook object which contains all this information.
You can also check the cv_results_ object, for example, check this notebook
You can plot the sampled space of hyperparameters, using this function

GuiTaek · 2022-07-25T18:43:44Z

Hi, as given in the CONTRIBUTING.md, herewith I say, that I'm working on this issue. I will likely not fix it but I can probably make it better

rodrigo-arenas · 2022-07-25T19:02:24Z

Hi @GuiTaek for sure, just let us know which sections you'll be working on, so other people don't overwrite it
Thanks!

GuiTaek · 2022-07-30T11:16:43Z

You're welcome. For now, as I haven't ever used this library (came from good first issue tag) I would like to tackle the first greater page https://sklearn-genetic-opt.readthedocs.io/en/stable/tutorials/basic_usage.html. I don't know, maybe later more

Edit: Would you like more atomar pull request or would you rather prefer that I combine everything to one pull request?

rodrigo-arenas · 2022-07-30T18:44:39Z

Hi @GuiTaek yeah for sure, you can start on that one. In this case, it would be great one Pull Request per page, so we keep subjects separated

Thanks

GuiTaek · 2022-07-31T14:26:44Z

Hi @rodrigo-arenas, then I'll collect every suggestion I have for one page and make a pull request. May I also touch content? E.g. I think, it is possible to improve the example as the max-score doesn't increase much. It would be better advertising if it increases gradually from low to high. I already have an example I'm not satisfied though, as it throws warnings.

rodrigo-arenas · 2022-07-31T16:19:49Z

Hi @GuiTaek , yes the examples can be improved, just take into account that the ones shown in the tutorials section, usually are pretty simple, so the users can get started right away, what I mean is, for example, if the tutorial is about adapters, I'd showcase how to setup that parameter, and not adding callbacks, loggers that might mix two subjects.

On the other hand, if you see the jupyter notebooks examples, I think there is a big oporunity to make them better, on those notebooks, there is no problem to modify them and add complex features in a single notebook, since those are meant to showcase all the library capabilities at one place

I hope it makes sense

emirtarik · 2022-08-08T10:16:34Z

Hi @rodrigo-arenas, hope you're doing well.

I'm trying to use the GAFeatureSelectionCV as you suggested to understand the importance of attributes in my dataset, however I have a two-sided problem in this regard.

The first is that I'm trying to use labeled data instead of encoded and this directly results in the error below.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-34-81501a102f7c>](https://localhost:8080/#) in <module>()
      1 # Train and select the features
----> 2 evolved_selection.fit(x_train_labeled_ga, y_train_ga)

4 frames
[/usr/local/lib/python3.7/dist-packages/deap/tools/support.py](https://localhost:8080/#) in record(self, **infos)
    336         """
    337         apply_to_all = {k: v for k, v in infos.items() if not isinstance(v, dict)}
--> 338         for key, value in infos.items():
    339             if isinstance(value, dict):
    340                 chapter_infos = value.copy()

RuntimeError: dictionary changed size during iteration

The second is when I try to do this with encoded data, I am well able to do it, however it asks me to turn my sparse matrix into an array using np.toarray(). After doing so, I am unsure on how to interpret the resulting .best_features_ array. Can you please give a brief explanation on how I can pair this with my set of variables? I'm imagining something in the lines of OneHotEncoder.get_feature_names().

Thanks a lot

rodrigo-arenas · 2022-08-08T16:46:19Z

Hi @emirtarik , as you mentioned you can't pass a labeled dataset, not especially because of this package, but because scikit-learn won't work with such a structure, so there is really nothing I can do from this library, this must be solved in a pipeline with some encoding as a preprocessing step.

The best_features_ attribute returns one value per each input column, so you must know what each column of your dataset means in order to interpret it. If the only transformation you are doing is a one-hot encoding, then can use for example the get_feature_names_out() method of the encoder, to map the names of each column, then what best_features_ means is a value of True if that column was selected, False otherwise.

Please for future questions, make sure to create a new bug or question, if it's not related to this issue (documentation improvement), so we don't mix different subjects in this thread

Greetings

GuiTaek · 2022-08-13T09:24:51Z

Hi @GuiTaek , yes the examples can be improved, just take into account that the ones shown in the tutorials section, usually are pretty simple, so the users can get started right away, what I mean is, for example, if the tutorial is about adapters, I'd showcase how to setup that parameter, and not adding callbacks, loggers that might mix two subjects.

On the other hand, if you see the jupyter notebooks examples, I think there is a big oporunity to make them better, on those notebooks, there is no problem to modify them and add complex features in a single notebook, since those are meant to showcase all the library capabilities at one place

I hope it makes sense

OK, I'll consider that, that it should be easy.

GuiTaek · 2022-08-28T17:29:40Z

Unfortunately I cannot make a draft pull request and request review. I'd like to have review as I have quite a bundle of changes and I am particulary unsure about the whole example thing: Is it OK, that I use intentionally a "wrong" range to show the powers of this library? Is it clear that a user have to change it according to what I have written? See also the draft pull request as well as the commits. I'm not finished though, as there is more on this page I haven't touched.

GuiTaek · 2022-09-10T12:36:30Z

I made a full pull request as I feared that you can't see the pull request.

rodrigo-arenas · 2022-09-10T13:55:52Z

Hi @GuiTaek thanks for notifying me, I just saw the PR I'll be reviewing it this weekend

GuiTaek · 2022-12-27T17:13:52Z

Hi @rodrigo-arenas had a lot of university, but it looks like you merged it, I didn't expect that to be honest! Thank you very much! Sorry for late response

rodrigo-arenas added documentation Improvements or additions to documentation help wanted Extra attention is needed good first issue Good for newcomers up-for-grabs labels Jun 16, 2022

rodrigo-arenas mentioned this issue Nov 8, 2022

Contributing to this project #113

Closed

Chailex mentioned this issue Apr 9, 2023

Changes made to documentation #132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve documentation and examples #101

Improve documentation and examples #101

rodrigo-arenas commented Jun 16, 2022

emirtarik commented Jul 19, 2022

rodrigo-arenas commented Jul 19, 2022

emirtarik commented Jul 19, 2022

rodrigo-arenas commented Jul 19, 2022

emirtarik commented Jul 19, 2022 •

edited

Loading

rodrigo-arenas commented Jul 19, 2022 •

edited

Loading

emirtarik commented Jul 21, 2022

rodrigo-arenas commented Jul 21, 2022

GuiTaek commented Jul 25, 2022

rodrigo-arenas commented Jul 25, 2022

GuiTaek commented Jul 30, 2022 •

edited

Loading

rodrigo-arenas commented Jul 30, 2022

GuiTaek commented Jul 31, 2022

rodrigo-arenas commented Jul 31, 2022

emirtarik commented Aug 8, 2022

rodrigo-arenas commented Aug 8, 2022

GuiTaek commented Aug 13, 2022

GuiTaek commented Aug 28, 2022

GuiTaek commented Sep 10, 2022

rodrigo-arenas commented Sep 10, 2022

GuiTaek commented Dec 27, 2022

Improve documentation and examples #101

Improve documentation and examples #101

Comments

rodrigo-arenas commented Jun 16, 2022

emirtarik commented Jul 19, 2022

rodrigo-arenas commented Jul 19, 2022

emirtarik commented Jul 19, 2022

rodrigo-arenas commented Jul 19, 2022

emirtarik commented Jul 19, 2022 • edited Loading

rodrigo-arenas commented Jul 19, 2022 • edited Loading

emirtarik commented Jul 21, 2022

rodrigo-arenas commented Jul 21, 2022

GuiTaek commented Jul 25, 2022

rodrigo-arenas commented Jul 25, 2022

GuiTaek commented Jul 30, 2022 • edited Loading

rodrigo-arenas commented Jul 30, 2022

GuiTaek commented Jul 31, 2022

rodrigo-arenas commented Jul 31, 2022

emirtarik commented Aug 8, 2022

rodrigo-arenas commented Aug 8, 2022

GuiTaek commented Aug 13, 2022

GuiTaek commented Aug 28, 2022

GuiTaek commented Sep 10, 2022

rodrigo-arenas commented Sep 10, 2022

GuiTaek commented Dec 27, 2022

emirtarik commented Jul 19, 2022 •

edited

Loading

rodrigo-arenas commented Jul 19, 2022 •

edited

Loading

GuiTaek commented Jul 30, 2022 •

edited

Loading