assignment projects for Machine Learning course
TODO:
- Text Classification
[] Change the labels from suicide/non-suicide to depression/non-depression
[] Preprocess text
[] Feature Extraction - Word Embedding (gloVe / word2Vec)
[] Creating the model
- Classification using DASS and Demographic Data
This part of the assignment makes use of the public Depression Anxiety Stress Scales Responses dataset on Kaggle. To work with the notebook, create a the following folder structure, data/dass_data
where the downloaded and extracted data is placed in the dass_data
folder.
[] EDA to check for data quality and check for column correlation with the target variable
- Currently checked columns are relatively clean other than the "major" column which requires some cleaning.
- Current correlation test shows that the individual question scoring has high correlation with the severity and demography data shows promise despite having lower correlation.
- The age column has some odd data which needs to be handled.
[] Feature Engineering
- Currently used features are the individual scores for each question and some of the demographical columns.
- "major" column currently has over 5000 unique values even after replacing NaN values with "None". Requires standardization and cleaning.
- Current feature scaling uses the minmax scaler. Need to check whether categorical features need to be scaled/can be scaled differently.
[] Models to Test
- Logistic Regression, SVM, xgboost, decision trees
- Logistic regression seems to perform well on test set, try cross validation and new data to further test generalisation
- Run the command
pip install -r requirements.txt
to install required modules. - Download the models folder and place it in the root directory of this repo.
- Start the API by running the
app.py
script. - Run the application using the following sequence of commands:
cd app streamlit run app.py
- Azri Anwar Azri's Github
- Tengku Naim Tengku's Github
- Khairol Hazeeq Khairol's Github
- Afiq Irfan Afiq's Github