A repository of code and data for the USCB sklearn workshop on February 7th & 8th, 2019
Kaai can arrive early (8:00 AM) to informally talk about using pandas and numpy for reading and processing data if there is interest
-
Lecture - Basics of machine learning (1-2 hrs)
- Training and test set
- Model types, & model complexity
- Basic data types
- Data imbalance
- Performance metrics
- Advanced data types (NLP, computer-vision, time-series, etc.)
-
Code along - First pass: band gap regression (1 hr)
- download data
- perform a train-test split
- convert formulae to features
- perform linear regression (linear learner)
- perform random-forest regression (non-linear)
- generate error metrics & figures
-
Code along - Second pass: band gap regression (1.5 hrs)
- remove duplicates and make sure data looks okay
- perform a train-test split
- convert formula to features
- discuss model parameter selection
- implement cross-validation
- implement grid-search
- generate error metrics & figures
-
Individual work - Aflow regression: predicting bluk modulus (until the end of day)
- Implement code (Taylor/Kaai will be avaliable for questions)
- Fill out individual code sections. Answers are revealed before moving on (4 or 5 parts).
- Implement code (Taylor/Kaai will be avaliable for questions)
Homework: Think about data that might be interesting to learn on. We will talk about your ideas in the morning.
-
Code along (do you want this many coding examples?) - A quick classification problem: predicting crystal structure (1 hr)
- Augment same code structure for classification
- generate error metrics & review recal vs precision.
-
Individual work - metal/non-metal band gap classification.
- Work through full ML work flow while
- Coaching - Discuss research ideas and data with Taylor/Kaai
* using matplotlib to make publication quality figures