Exploration of the Vehicles dataset based on #2 #36

SaumyaSinghal · 2020-03-09T21:44:08Z

Here I have made an analysis of the vehicles.csv dataset. My exploration can be summarised as follows:

Load the dataset
Basic exploration and visualisation of the data
Train and test on an ensemble of classifiers (trees, SVMs, gaussians, discriminant analysis, etc.) using cross-validation
Tables and plots of the accuracies and predictions
Voting classifier using all the classifiers individually used till now
Hyper-parameter tuning using grid search
Train and test on individual models as well as the voting model with tuned hyper-parameters
Basic comparison between results.

The maximum accuracy I got was ~84% using Quadratic Discriminant Analysis.

Although this is aimed at #2, I think this touches some of the other issues too. I’ll open separate PRs for them soon.

I will keep improving and documenting but I’d be very happy to get a review of my approach :)
I have also tried to modularise and re-use as much code as possible but I’d highly appreciate any suggestion towards that (and also regarding any python tricks to shorten my code, I’m sure I have missed out on a few :D).

dzeber

Thanks for this super well explored and extensive PR! It's really nice to see the amount of depth you went into in this exploration, and you've done a great job documenting both the narrative of the analysis and your insights and conclusions.

It was a great idea to try out the voting classifier. Interestingly, however, it currently doesn't really improve on the best performing individual models. This is probably because it includes many of the worse performing ones as well and also because of the high correlations between models (ie combining models probably doesn't introduce much diversity). Given this result, I would leave that for later. Alternatively you could try using only the top 3 models, say.

I'm a little surprised that the performance scores for the tuned model are so close to those of the untuned models. To me that would warrant a little further investigation to make sure everything is working properly. However, given the breadth of the model space if you want to investigate further, I recommend focusing on the top few performing models only.

I'm requesting changes for a few specific points below, after which this will be ready to merge. If you want to continue pursuing model exploration, I recommend looking into the misclassified points. Are the top models all misclassifying the same points? Is that why they show high correlations? If you want to look into this, you can split off that work into a separate PR and tie it in with #7 or #9.

Please make sure to use the correct relative path to the dataset (I think it needs an additional ..)
Please apply black formatting to the two modules (mainly to wrap the long lines).
The pair plot is difficult to read because of the size. I would see if there's a way to make the axis labels more readable.
Please include some comments explaining how you chose your train-test split proportions, and also why you decided to use the ShuffleSplit for CV instead of, say, K-fold CV. This is of particular interest as this project is about exploring these choices.
When you run cross-validation in run_models, you should be using the training data only, rather than the full dataset. That way, the test data is held out till the end for an independent performance evaluation. The "test" set split off by ShuffleSplit in CV functions as a third validation set. Similarly, all hyperparameter tuning should be on the training set only.
Similarly, when you normalize the data you should be computing the mean and SD using the training set only. With CV you further would compute them using the training split only. This can be accomplished easily using a Pipeline and StandardScaler.

dzeber · 2020-03-12T21:34:03Z

dev/Saumya/algorithm_helpers.py

+        #set name and parameters
+        MLA_name = alg.__class__.__name__
+        MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
+        MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())


Assigning individual cells of the DF like this works fine but the code is a bit cumbersome. Personally I would do something like create a dict for each row, append each row to a list, and convert to DF at the end.

This was great advice! Once adjusted to this, I am doing this in all my notebooks. It is super easy to append rows (just adding the lists) while keeping the code maintainable :)

dzeber · 2020-03-12T21:35:56Z

dev/Saumya/algorithm_helpers.py

+
+        #save MLA predictions that could be used later
+        alg.fit(X, y)
+        MLA_predict[MLA_name] = alg.predict(X)


Note that when you do this you are getting training set predictions. What would be better is to split off a validation set at the start and compute predictions on that - that would give better estimates of the correlations.

Yes this was a mistake on part. This actually caused my correlation heatmap to give faulty correlation values from predictions and I could not understand why it was happening. Thank you, it is fixed now.

dzeber · 2020-03-12T22:11:50Z

dev/Saumya/algorithm_helpers.py

+    #Removed models without attribute 'predict_proba' (required for vote classifier) and models with a 1.0 correlation to another model
+    #Using default hyper-parameters
+    vote_est = [
+        #Ensemble Methods


Maybe turn get_algorithms output into a dict, and just list the keys here, so as to avoid reinitializing the models. Not really a performance issues, more code consistency and maintenance.

That's right, the code looks a lot better now. Thank you :)

dzeber · 2020-03-12T22:13:03Z

dev/Saumya/algorithm_helpers.py

+
+
+    start_total = time.perf_counter()
+    for clf, param in zip (vote_est, grid_param):


I would make grid_param into a dict using the same keys as vote_est to make it explicit, rather than relying on zip. This will make code updates easier.

Yes I later realized that even if the lists are inconsistent zip would do its work silently, results would be wrong and I would not even know where the error lies. Thank you for pointing this out!

SaumyaSinghal · 2020-03-15T08:24:30Z

@dzeber thank you so much for such a detailed review. I was pondering upon this for the last couple of days and tried addressing most of the issues you have pointed out along with working on other aspects of it.

I thought at first that I'll push individual changes and reply to the individual reviews but it seems like a better idea would be to get everything implemented by Tuesday and make all the comments thereafter so that you can just take a single look once you get a chance :)

mlopatka · 2020-03-20T18:09:28Z

@SaumyaSinghal please let us know if you are still actively working on this PR.
I'll add a possible inactive tag for now, if you flag me here we'll review and remove the flag.

SaumyaSinghal · 2020-03-20T18:34:30Z

@mlopatka I am really sorry for the delay and inactivity. Please remove the inactive flag. We were suddenly told to evacuate our college hostels in wake of COVID-19 and that caused a lot of hassle, shifting and procuring documents throughout last week.
Things have settled now, I am at home and have been able to resume from yesterday. I'll try to finish this as fast as possible. I am parallelly working on the other issues too and I hope to finish some of them soon.

mlopatka · 2020-03-20T18:38:30Z

@SaumyaSinghal No worries at all! I hope you are keeping healthy and safe.
I'm using the inactive tags as a way to help @dzeber and I spend our time most effectively. It has been removed from this PR and I'll loop back for a review as soon as I can.

SaumyaSinghal · 2020-03-20T18:51:30Z

@mlopatka thank you so much. This cheered me up :D

mlopatka · 2020-03-31T00:04:29Z

@SaumyaSinghal please resolve the merge conflicts.
There is a wealth of advice on how to do this in the gitter channel as several of your fellow outreachy applicants have come across this kind of issue.

dzeber suggested changes Mar 12, 2020

View reviewed changes

mlopatka added the Inactive? pootentially oorphaned PRs (after 7 days inactivity these willl be closed wothout merge) label Mar 20, 2020

mlopatka removed the Inactive? pootentially oorphaned PRs (after 7 days inactivity these willl be closed wothout merge) label Mar 20, 2020

This was referenced Mar 28, 2020

Extensive traversal of cross-validation folds #95

Merged

Visualization of evaluation metrics #134

Merged

SaumyaSinghal closed this Mar 31, 2020

SaumyaSinghal force-pushed the master branch from 30ccac8 to ff5ade8 Compare March 31, 2020 19:06

SaumyaSinghal mentioned this pull request Mar 31, 2020

Complete Startup task #141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploration of the Vehicles dataset based on #2 #36

Exploration of the Vehicles dataset based on #2 #36

SaumyaSinghal commented Mar 9, 2020 •

edited

Loading

dzeber left a comment

dzeber Mar 12, 2020

SaumyaSinghal Mar 27, 2020

dzeber Mar 12, 2020

SaumyaSinghal Mar 27, 2020

dzeber Mar 12, 2020

SaumyaSinghal Mar 27, 2020

dzeber Mar 12, 2020

SaumyaSinghal Mar 27, 2020

SaumyaSinghal commented Mar 15, 2020

mlopatka commented Mar 20, 2020

SaumyaSinghal commented Mar 20, 2020

mlopatka commented Mar 20, 2020

SaumyaSinghal commented Mar 20, 2020

mlopatka commented Mar 31, 2020



		start_total = time.perf_counter()
		for clf, param in zip (vote_est, grid_param):

Exploration of the Vehicles dataset based on #2 #36

Exploration of the Vehicles dataset based on #2 #36

Conversation

SaumyaSinghal commented Mar 9, 2020 • edited Loading

dzeber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaumyaSinghal commented Mar 15, 2020

mlopatka commented Mar 20, 2020

SaumyaSinghal commented Mar 20, 2020

mlopatka commented Mar 20, 2020

SaumyaSinghal commented Mar 20, 2020

mlopatka commented Mar 31, 2020

SaumyaSinghal commented Mar 9, 2020 •

edited

Loading