The following analysis uses publicly available data from the TIMSS official website. The program is written to analyze all of the available data, but the scope of this repository is limited to 10 countries for the purposes of storage and run time.
- How does a student's environment at home, in the classroom, and at school affect academic understanding?
- Are there specific teacher behaviors that lead to improved understanding in specific disciplines?
- What can students and teachers do to improve student academic understanding?
The project will cover a variety of analyses regarding data collected. The project will include:
- prediction of student score based on student attitudes and demographics, school characteristics
- prediction of student scores based on teacher attitudes and practices
- recommendation engine for additional study problems for a given student (or group of students from a school or country)
pandas
numpy
pyreadstat
matplotlib
seaborn
glob
random
re
statsmodels
sklearn
xgboost
- gather and organize data for variety of SPSS files (
glob
,pandas.read_spss
) - gather item information and metadata (
pandas.read_excel
) - drop unnecessary/irrelevant columns (
pandas.drop
) - simplify and combine assessment scores for easier application
- average student assessment scores to a single number
- explore data structure (
pandas.info()
,pandas.describe()
) - visualize basic relationships (
matplotlib
,seaborn
) - record initial observation and plan for data preparation
- perform data preparation on each dataset
- drop irrelevant information for that specific data (
pandas.drop
) - convert datatypes for appropriate analysis (
pandas.astype
) - give columns more descriptive names (
pandas.rename
) - perform any necessary transformations (
pandas.apply
) - average columns into like groups (
pandas.groupby
) - join data from other tables when helpful to the analysis (
pandas.join
)
- drop irrelevant information for that specific data (
- store processed data (
pandas.to_csv
)
Regression
- identifying variables of interest
- create dummy variables for string objects (
pandas.get_dummies
) - split the data for training and testing (
sklearn.model_selection.train_test_split
) - create and train the linear regression models (
statsmodels.api.OLS
) - explore relative weights of variables in regression models
- create predictions and confidence intervals for test dataset (
statsmodels.api.OLS.predict
,statsmodels.api.OLS.get_prediction
) - evaluate effectiveness of models and confidence intervals
- visualize models (
seaborn
,matplotlib
)
Most models accounted for ~50% of the variation in average test scores for the subject of interest (student, school, teacher). This was adequate considering that school and teacher variables were not included in the student regression model; therefore, it would be nearly impossible for one of the models to account for all variables contributing to the achievement of a student, school, or teacher.
All 80% confidence intervals achieved the appropriate accuracy. This means that ranges of assessment scores generated by the model are accurate for approximately 80% of the data. The other 20% of the data are not captured due to being extreme or unlikely results. For example, most of the students score around 500, so it is difficult to generate representative models for students who score closer to 300 or 700 (the relative extremes).
Classification
- identifying and gathering variables of interest
- split the data for training and testing (
sklearn.model_selection.train_test_split
) - create base models
sklearn.ensemble.RandomForestClassifier
xgboost.XGBClassifier
sklearn.neighbors.KNeighborsClassifier
- set up parameter grids for optimization
- fit and optimize base model with grid search of parameters (
sklearn.model_selection.GridSearchCV
) - display results and accuracy (
sklearn.metrics.confusion_matrix
) - compare similarity between different models
The relatively vague nature of the classifications of school socio-background left some interpretation issues for the models. For example, there are no clear objective boundaries between a classification of "More Affluent" and "Neither More Affluent nor More Disadvantaged", so it is unclear when the model should jump between those two classifications.
In the end, the optimized XGBoost Classifier had the highest accuracy (54%) and the lowest number of significant errors (misidentifying "More Affluent" as "More Disadvantaged" and vice versa only 22 times out of 236 total test values).
Overall, the model could be improved by including relevant averages for student demographics within each school (immigration and academic language data, home resources) to provide a richer set of inputs on which to base the classification.
Recommendation
- gather assessment item information
- drop irrelevant information to construct user_item dataframe
- create collection of helper functions to:
- identify similar users (
find_similar_users
) - gather specific item information (
get_item_info
) - gather items assessed by a user (
get_user_items
) - create ranked user-to-user recommendations for new items (
user_user_recs
) - implement FunkSVD to estimate score on previously unknown items (
FunkSVD
) - estimate scores for unassessed items for a user (
user_item_predict
) - implement custom sigmoid activation function for smoother predictions (
sig
)
- identify similar users (
- test recommendation engine
The recommendation system takes into account user assessment history to find users with similar experience. From there, it finds new assessment items from the similar users and ranks them based on prevalence (number of users who assessed the item) and difficulty (average assessment score). The system then selects 10 items to recommend to the user as new study items.
With the additional analysis of FunkSVD and a sigmoid activation function, the recommendations estimate the user's performance on the recommended assessment items. The user can then decide how to best use their resources to prepare for future assessments. A user could focus on reviewing items with high predictions (items with a high chance of success), low predictions (items with a low chance of success), or items with an unclear prediction (items on which a user could have more influence on the outcome).
The recommended items are significant and clearly based on numerical evidence, but the predicted scores are not as reliable. Due to the unpredictable nature of estimating unknowns with a random sample of 99 other students with the FunkSVD algorithm, the predicted scores for the 10 recommended assessment items can vary significantly depending on the sample. For example, in the tests run for comparison, there were 3 items out of 10 that had significantly different predictions. Therefore, users should use the predictions of the recommendation system loosely as they decide which problems to study first.