Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of datasets and focus dataset type #12

Open
lakikowolfe opened this issue Aug 10, 2020 · 2 comments
Open

Reduce number of datasets and focus dataset type #12

lakikowolfe opened this issue Aug 10, 2020 · 2 comments

Comments

@lakikowolfe
Copy link
Collaborator

TA used a wide variety of datasets. In the last class he switches between many different datasets to illustrate his points. These datasets vary widely in the type of data they are capturing.

We want to reduce the number of datasets used. Ideally working with only one or two datasets with a biological focus throughout the course.

@lakikowolfe
Copy link
Collaborator Author

lakikowolfe commented Aug 10, 2020

Data audit

  • How the data was used
  • dtypes
  • missing data?
  • Include dummy datasets TA made

Class 1

  • Commute Time Dataset
    • Feature engineering and EDA
    • No missing data
    • Generic dataset with both categorical and numeric data

Class 2

  • Commute Time Dataset
    • Viz of single variables and relationships, linear regression, mean squared error, random forests

Class 3

  • Dummy dataset of 0 and 1 as an example of categorical data
  • Dummy dataset of two random clouds of points to illustrate decision boundaries
  • Tennis dataset
    • all categorical variables, target variable is yes/no played tennis
  • Iris dataset
    • All numeric variables except for target variable (categorical: species)
  • Dummy dataset for random forest

Class 4

  • Dummy data to show the curse of dimensionality
  • Iris dataset to show the benefits of PCA
    • Pair plot
    • PCA
  • Dummy data to superimpose the first component line over a series of random points
  • Dummy data and custom code to illustrate eiganvectors
  • Centered faces dataset: "Eigenfaces"
  • Dummy dataset of clusters to show K means
  • Arrests data
    • four numeric vars
  • NCI60 for PCA and hierarchical clustering

@lakikowolfe
Copy link
Collaborator Author

Tennis dataset from class 3 can be replaced by Ted's OHSU cvd dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant