Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploration of the Vehicles dataset based on #2 #36

Closed
wants to merge 0 commits into from

Conversation

SaumyaSinghal
Copy link
Contributor

@SaumyaSinghal SaumyaSinghal commented Mar 9, 2020

Here I have made an analysis of the vehicles.csv dataset. My exploration can be summarised as follows:

  • Load the dataset
  • Basic exploration and visualisation of the data
  • Train and test on an ensemble of classifiers (trees, SVMs, gaussians, discriminant analysis, etc.) using cross-validation
  • Tables and plots of the accuracies and predictions
  • Voting classifier using all the classifiers individually used till now
  • Hyper-parameter tuning using grid search
  • Train and test on individual models as well as the voting model with tuned hyper-parameters
  • Basic comparison between results.

The maximum accuracy I got was ~84% using Quadratic Discriminant Analysis.

Although this is aimed at #2, I think this touches some of the other issues too. I’ll open separate PRs for them soon.

I will keep improving and documenting but I’d be very happy to get a review of my approach :)
I have also tried to modularise and re-use as much code as possible but I’d highly appreciate any suggestion towards that (and also regarding any python tricks to shorten my code, I’m sure I have missed out on a few :D).

Copy link
Contributor

@dzeber dzeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this super well explored and extensive PR! It's really nice to see the amount of depth you went into in this exploration, and you've done a great job documenting both the narrative of the analysis and your insights and conclusions.

It was a great idea to try out the voting classifier. Interestingly, however, it currently doesn't really improve on the best performing individual models. This is probably because it includes many of the worse performing ones as well and also because of the high correlations between models (ie combining models probably doesn't introduce much diversity). Given this result, I would leave that for later. Alternatively you could try using only the top 3 models, say.

I'm a little surprised that the performance scores for the tuned model are so close to those of the untuned models. To me that would warrant a little further investigation to make sure everything is working properly. However, given the breadth of the model space if you want to investigate further, I recommend focusing on the top few performing models only.

I'm requesting changes for a few specific points below, after which this will be ready to merge. If you want to continue pursuing model exploration, I recommend looking into the misclassified points. Are the top models all misclassifying the same points? Is that why they show high correlations? If you want to look into this, you can split off that work into a separate PR and tie it in with #7 or #9.

  • Please make sure to use the correct relative path to the dataset (I think it needs an additional ..)
  • Please apply black formatting to the two modules (mainly to wrap the long lines).
  • The pair plot is difficult to read because of the size. I would see if there's a way to make the axis labels more readable.
  • Please include some comments explaining how you chose your train-test split proportions, and also why you decided to use the ShuffleSplit for CV instead of, say, K-fold CV. This is of particular interest as this project is about exploring these choices.
  • When you run cross-validation in run_models, you should be using the training data only, rather than the full dataset. That way, the test data is held out till the end for an independent performance evaluation. The "test" set split off by ShuffleSplit in CV functions as a third validation set. Similarly, all hyperparameter tuning should be on the training set only.
  • Similarly, when you normalize the data you should be computing the mean and SD using the training set only. With CV you further would compute them using the training split only. This can be accomplished easily using a Pipeline and StandardScaler.

#set name and parameters
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning individual cells of the DF like this works fine but the code is a bit cumbersome. Personally I would do something like create a dict for each row, append each row to a list, and convert to DF at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was great advice! Once adjusted to this, I am doing this in all my notebooks. It is super easy to append rows (just adding the lists) while keeping the code maintainable :)


#save MLA predictions that could be used later
alg.fit(X, y)
MLA_predict[MLA_name] = alg.predict(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that when you do this you are getting training set predictions. What would be better is to split off a validation set at the start and compute predictions on that - that would give better estimates of the correlations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this was a mistake on part. This actually caused my correlation heatmap to give faulty correlation values from predictions and I could not understand why it was happening. Thank you, it is fixed now.

#Removed models without attribute 'predict_proba' (required for vote classifier) and models with a 1.0 correlation to another model
#Using default hyper-parameters
vote_est = [
#Ensemble Methods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe turn get_algorithms output into a dict, and just list the keys here, so as to avoid reinitializing the models. Not really a performance issues, more code consistency and maintenance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, the code looks a lot better now. Thank you :)



start_total = time.perf_counter()
for clf, param in zip (vote_est, grid_param):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make grid_param into a dict using the same keys as vote_est to make it explicit, rather than relying on zip. This will make code updates easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I later realized that even if the lists are inconsistent zip would do its work silently, results would be wrong and I would not even know where the error lies. Thank you for pointing this out!

@SaumyaSinghal
Copy link
Contributor Author

@dzeber thank you so much for such a detailed review. I was pondering upon this for the last couple of days and tried addressing most of the issues you have pointed out along with working on other aspects of it.

I thought at first that I'll push individual changes and reply to the individual reviews but it seems like a better idea would be to get everything implemented by Tuesday and make all the comments thereafter so that you can just take a single look once you get a chance :)

@mlopatka
Copy link
Contributor

@SaumyaSinghal please let us know if you are still actively working on this PR.
I'll add a possible inactive tag for now, if you flag me here we'll review and remove the flag.

@mlopatka mlopatka added the Inactive? pootentially oorphaned PRs (after 7 days inactivity these willl be closed wothout merge) label Mar 20, 2020
@SaumyaSinghal
Copy link
Contributor Author

@mlopatka I am really sorry for the delay and inactivity. Please remove the inactive flag. We were suddenly told to evacuate our college hostels in wake of COVID-19 and that caused a lot of hassle, shifting and procuring documents throughout last week.
Things have settled now, I am at home and have been able to resume from yesterday. I'll try to finish this as fast as possible. I am parallelly working on the other issues too and I hope to finish some of them soon.

@mlopatka mlopatka removed the Inactive? pootentially oorphaned PRs (after 7 days inactivity these willl be closed wothout merge) label Mar 20, 2020
@mlopatka
Copy link
Contributor

@SaumyaSinghal No worries at all! I hope you are keeping healthy and safe.
I'm using the inactive tags as a way to help @dzeber and I spend our time most effectively. It has been removed from this PR and I'll loop back for a review as soon as I can.

@SaumyaSinghal
Copy link
Contributor Author

@mlopatka thank you so much. This cheered me up :D

@mlopatka
Copy link
Contributor

@SaumyaSinghal please resolve the merge conflicts.
There is a wealth of advice on how to do this in the gitter channel as several of your fellow outreachy applicants have come across this kind of issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants