Welcome to my portfolio repository, where I apply machine learning techniques to analyze mental health datasets, specifically focusing on student anxiety and depression. My aim is to demonstrate how data-driven approaches can provide valuable insights into mental health issues.
In this project, I explore various machine learning models using the Students Anxiety and Depression dataset from Kaggle.
- Models used: Linear SVM, Logistic Regression, RandomForest
- Accuracy achieved: Over 95% for both SVM and Logistic Regression
- Confusion Matrix: Improved results for Linear SVM and Logistic Regression compared to the original model
- Added Linear SVM and Logistic Regression models.
- Enhanced the confusion matrix results.
- Applied additional data analysis techniques for better data understanding.
- Python, Scikit-learn, Matplotlib, Seaborn
Note: The dataset cleaning was not performed by me, but I did add new features and made improvements in model training and evaluation.
This project uses TensorFlow and Scikit-learn to detect suicidal and depressive comments from three subreddits.
- Subreddits Used:
- r/SuicideWatch -> Label: Suicide
- r/depression -> Label: Depression
- r/teenagers -> Label: Non-Depression
To understand the dataset, I performed the following visualizations:
- Distribution of labels across subreddits.
- Word clouds to show the most frequent words associated with each class.
- Correlation heatmaps to detect relationships between variables.
Add graphs and charts here with the code used to generate them.
The dataset was preprocessed using the following steps:
- Tokenization and removal of stopwords.
- Normalization by converting all text to lowercase.
- Lemmatization to reduce words to their base form.
- Removal of outliers based on word count distributions.
Several deep learning architectures were tested, including:
- LSTM (Long Short-Term Memory) Networks: Suitable for sequential text data.
- Convolutional Neural Networks (CNNs): For text classification.
- BERT-based Model: State-of-the-art performance for NLP tasks.
For each model, I used the following setup:
- Optimizers: Adam with learning rate of 0.001
- Loss function: Binary Cross-Entropy
- Batch size: 64
- Epochs: 10
The performance of each model was evaluated using:
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrix for visualizing misclassifications.
Here's a summary of the results:
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
LSTM | 92% | 90% | 88% | 89% |
CNN | 93% | 91% | 89% | 90% |
BERT | 95% | 93% | 92% | 93% |
Add a confusion matrix image here or another evaluation graph.
Feel free to explore the code and make contributions!
Contact Information:
- Email: [email protected]
- LinkedIn: Nicolas Vargas