Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2018-08-23 workshop notes #14

Closed
21 of 30 tasks
agitter opened this issue Aug 24, 2018 · 1 comment
Closed
21 of 30 tasks

2018-08-23 workshop notes #14

agitter opened this issue Aug 24, 2018 · 1 comment

Comments

@agitter
Copy link
Collaborator

agitter commented Aug 24, 2018

Here are some of my notes and possible revisions from the pilot workshop. We can discuss these in person before implementing any changes.

Agenda and slides:

  • Look for places to add more interactivity in the initial slides. Could ask about ML examples in their area after showing the examples in the slides.
  • Discuss the relationship between the classifiers we present their regression analogs.
  • Could expand the setup guide and ask participants to try installing the software before the workshop.
  • Use the same dataset in the initial slides, notebook, and software example to avoid having to explain multiple datasets early on.
  • Show the notebook at the end of the workshop and illustrate the ML pipeline in the software.
  • Define folds on a slide.
  • Add more examples of why train/validate/test split is needed. Move the data splitting and cross validation discussion even earlier. Perhaps introduce overfitting at this point.
  • Note the other cross-validation strategies in the slides and link to the vocabulary guide.
  • Add discussion in GitHub Issues as another next step in the slides.
  • Annotate the y and y hat notation in logistic regression.
  • Consider showing and example of how a trained logistic regression model makes a prediction y hat.
  • Add useful discussion points to the notes section of the slides to help new instructors lead the workshop. (Instructor material for slides #20)
  • Update gender in decision tree example.
  • Provide hints at which datasets and settings to use to explore the questions.
  • After the example ML papers, go into more detail for one: features, class labels, classifier, what was learned and why it matters.
  • Work on a correspondence between a real biological problem and a 2d toy example.
  • Reference Google crash course for a possible ordering.
  • Another possible ordering: ML motivation with examples, test out 1-2 classifiers in the software, learn about them in more depth in the slides, revisit classifiers in software with knowledge of the hyperparameters, overfitting/underfitting and cross-validation, compare selecting on training set only (hold out 0%) versus cross-validation, then finish with data loading other classifiers.

Software:

  • Mac OS opens the wrapper script in an editor instead of executing it. Need alternative instructions for launching the software.
  • Need more guidance for running the software on Windows when Anaconda is not on the PATH. (Improving the wrapper script for Windows #21)
  • Determine why Windows does not launch the GUI the first time the batch script is run. (Improving the wrapper script for Windows #21)
  • Add a note about the warning Windows shows about running a batch script from an unknown publisher. (Improving the wrapper script for Windows #21)
  • Add note about common NumPy or other warnings that can be safely ignored.
  • Clear the unlabeled data after loading a new labeled dataset
  • Neural networks do not provide a class weight. This is because scikit-learn does not implement it yet. See pull request https://github.com/scikit-learn/scikit-learn/pull/11723 for progress. (Neural networks class weight gitter-lab/ml4bio#8)
  • Create an issue with the error message a student received on Mac OS. (Error message #15)

Data and guides:

  • Document what pre-processing was done in the neurotoxicity dataset to reduce the features to 1000 genes. This is in the paper.
  • Update the performance guide to explain the performance of a random classifier and how the area depends on the class imbalance. (Random PR curve for performance guide #22)
  • Consider adding a toy dataset that is imbalanced and non linearly separable to help explore different performance measures.
  • Data cleaning and pre-processing guide with examples (data carpentry resources?)
@agitter
Copy link
Collaborator Author

agitter commented Dec 5, 2018

@fmu2 completed almost all of these suggestions from the 2018-08-23 workshop. I created new specific issues for the remaining comments we may want to address. The others can be safely ignored in my opinion.

@agitter agitter closed this as completed Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant