-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploration of the Vehicles dataset based on #2 #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this super well explored and extensive PR! It's really nice to see the amount of depth you went into in this exploration, and you've done a great job documenting both the narrative of the analysis and your insights and conclusions.
It was a great idea to try out the voting classifier. Interestingly, however, it currently doesn't really improve on the best performing individual models. This is probably because it includes many of the worse performing ones as well and also because of the high correlations between models (ie combining models probably doesn't introduce much diversity). Given this result, I would leave that for later. Alternatively you could try using only the top 3 models, say.
I'm a little surprised that the performance scores for the tuned model are so close to those of the untuned models. To me that would warrant a little further investigation to make sure everything is working properly. However, given the breadth of the model space if you want to investigate further, I recommend focusing on the top few performing models only.
I'm requesting changes for a few specific points below, after which this will be ready to merge. If you want to continue pursuing model exploration, I recommend looking into the misclassified points. Are the top models all misclassifying the same points? Is that why they show high correlations? If you want to look into this, you can split off that work into a separate PR and tie it in with #7 or #9.
- Please make sure to use the correct relative path to the dataset (I think it needs an additional
..
) - Please apply
black
formatting to the two modules (mainly to wrap the long lines). - The pair plot is difficult to read because of the size. I would see if there's a way to make the axis labels more readable.
- Please include some comments explaining how you chose your train-test split proportions, and also why you decided to use the
ShuffleSplit
for CV instead of, say, K-fold CV. This is of particular interest as this project is about exploring these choices. - When you run cross-validation in
run_models
, you should be using the training data only, rather than the full dataset. That way, the test data is held out till the end for an independent performance evaluation. The "test" set split off byShuffleSplit
in CV functions as a third validation set. Similarly, all hyperparameter tuning should be on the training set only. - Similarly, when you normalize the data you should be computing the mean and SD using the training set only. With CV you further would compute them using the training split only. This can be accomplished easily using a Pipeline and StandardScaler.
dev/Saumya/algorithm_helpers.py
Outdated
#set name and parameters | ||
MLA_name = alg.__class__.__name__ | ||
MLA_compare.loc[row_index, 'MLA Name'] = MLA_name | ||
MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning individual cells of the DF like this works fine but the code is a bit cumbersome. Personally I would do something like create a dict for each row, append each row to a list, and convert to DF at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was great advice! Once adjusted to this, I am doing this in all my notebooks. It is super easy to append rows (just adding the lists) while keeping the code maintainable :)
dev/Saumya/algorithm_helpers.py
Outdated
|
||
#save MLA predictions that could be used later | ||
alg.fit(X, y) | ||
MLA_predict[MLA_name] = alg.predict(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when you do this you are getting training set predictions. What would be better is to split off a validation set at the start and compute predictions on that - that would give better estimates of the correlations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was a mistake on part. This actually caused my correlation heatmap to give faulty correlation values from predictions and I could not understand why it was happening. Thank you, it is fixed now.
dev/Saumya/algorithm_helpers.py
Outdated
#Removed models without attribute 'predict_proba' (required for vote classifier) and models with a 1.0 correlation to another model | ||
#Using default hyper-parameters | ||
vote_est = [ | ||
#Ensemble Methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe turn get_algorithms
output into a dict, and just list the keys here, so as to avoid reinitializing the models. Not really a performance issues, more code consistency and maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the code looks a lot better now. Thank you :)
dev/Saumya/algorithm_helpers.py
Outdated
|
||
|
||
start_total = time.perf_counter() | ||
for clf, param in zip (vote_est, grid_param): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make grid_param
into a dict using the same keys as vote_est
to make it explicit, rather than relying on zip. This will make code updates easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I later realized that even if the lists are inconsistent zip would do its work silently, results would be wrong and I would not even know where the error lies. Thank you for pointing this out!
@dzeber thank you so much for such a detailed review. I was pondering upon this for the last couple of days and tried addressing most of the issues you have pointed out along with working on other aspects of it. I thought at first that I'll push individual changes and reply to the individual reviews but it seems like a better idea would be to get everything implemented by Tuesday and make all the comments thereafter so that you can just take a single look once you get a chance :) |
@SaumyaSinghal please let us know if you are still actively working on this PR. |
@mlopatka I am really sorry for the delay and inactivity. Please remove the inactive flag. We were suddenly told to evacuate our college hostels in wake of COVID-19 and that caused a lot of hassle, shifting and procuring documents throughout last week. |
@SaumyaSinghal No worries at all! I hope you are keeping healthy and safe. |
@mlopatka thank you so much. This cheered me up :D |
@SaumyaSinghal please resolve the merge conflicts. |
Here I have made an analysis of the vehicles.csv dataset. My exploration can be summarised as follows:
The maximum accuracy I got was ~84% using Quadratic Discriminant Analysis.
Although this is aimed at #2, I think this touches some of the other issues too. I’ll open separate PRs for them soon.
I will keep improving and documenting but I’d be very happy to get a review of my approach :)
I have also tried to modularise and re-use as much code as possible but I’d highly appreciate any suggestion towards that (and also regarding any python tricks to shorten my code, I’m sure I have missed out on a few :D).