The Company Relationship Analysis Tool is an AI-driven project that leverages historical stock price movements to uncover relationships between US companies. We experiment with clustering companies based on industry sectors and stock price correlations, for stock price movements predictions with the hope of improving accuracy and uncovering inter-company relationships.
- Identify Intercompany Relationships: Highlight partnerships, shared dependencies, or financial risks.
- Enhance Stock Price Prediction: Use advanced clustering and regression techniques to refine predictions.
- Visualize Relationships: Employ heatmaps and hierarchical clustering dendrograms to represent correlations and relationships.
- Business Impact:
- Investment Insights: Detect partnerships and acquisition opportunities.
- Risk Mitigation: Uncover vulnerabilities in shared dependencies.
- Market Expansion: Facilitate partnerships or mergers.
- Customer Insights: Identify shared customer bases for targeted marketing.
- Crisis Management: Anticipate cascading effects during disruptions.
- Data Source: Historical stock price data from Yahoo Finance for all S&P 500 companies.
- Time Period: 4 years of data (September 2020 - September 2024).
- Features:
- Daily percent returns calculated to capture relative price movements.
- Preprocessing:
- Aggregated data into prices, volume, and daily returns.
- Removed rows with NaNs to ensure data integrity.
-
Heatmaps:
- Created absolute correlation heatmaps to visualize similarities between companies.
- Generated dissimilarity heatmaps to explore distances between company clusters.
-
Hierarchical Clustering:
- Used dissimilarity matrices and various linkage methods (e.g., single, complete, ward) to form clusters.
- Visualized clusters with dendrograms to understand company relationships.
-
Silhouette Analysis:
- Evaluated cluster quality by measuring cohesion and separation.
- Experimented with different numbers of clusters and linkage methods for optimal results.
- Baseline Model: Predicts the mean of the training set.
- XGBoost Models:
- 3 statregies:
- Trained on all companies and all stock data.
- Trained a model per cluster (companies clustered by their industry).
- Trained a model per cluster (companies clustered by hierarchical clustering).
- Used Monday-Thursday data as features to predict Friday returns.
- 3 statregies:
-
Heatmap Insights:
- Companies within the same sector (e.g., technology, healthcare) show higher correlations.
- Cross-sector relationships highlight unique interdependencies.
-
Clustering Performance:
- Hierarchical Clustering:
- Dendrograms provided a detailed view of intercompany relationships.
- Optimal clusters identified using silhouette analysis.
- Industry-Based Clustering:
- Produced intuitive results but lacked predictive improvement.
- Hierarchical Clustering:
-
Model Evaluation:
- Baseline Model:
- RMSE: 4.605
- No Clustering:
- RMSE: 1.797, R²: 0.85
- Sector-Based Clustering:
- Average RMSE: 1.786, R²: 0.833
- Hierarchical Clustering:
- RMSE: 1.779, R²: 0.453
- Baseline Model:
Key Insight: While clustering provided valuable insights into company relationships, it did not significantly improve predictive accuracy. The simplest model (no clustering) performed best overall.
- Model Improvements:
- Fine-tune and train the no-clustering model further.
- Experiment with LightGBM and CatBoost for faster training and improved accuracy.
- Enhanced Data Integration:
- Incorporate news and social media sentiment analysis.
- Implement dynamic retraining with updated data.
- User Interaction:
- Develop interactive interfaces for analysts, including "what-if" scenario tools and company relationship visualization.
- Project Overview
- Methodology
- Results and Key Findings
- Potential Next Steps
- Installation
- Usage
- Contributing
- License
- Credits and Acknowledgments
- Python 3.8+
- Jupyter Notebook
- Required libraries:
pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
yfinance
- Clone the repository:
git clone https://github.com/your-repo-url.git cd your-repo-folder
- Set up a virtual environment (optional but recommended):
python3 -m venv env source env/bin/activate # On Windows, use `env\Scripts\activate`
- Install dependencies:
pip install -r requirements.txt
- Launch Jupyter Notebook:
jupyter notebook
- Open
preprocess.ipynb
and follow the steps to preprocess stock price data. - Ensure you use Yahoo Finance or other APIs for downloading historical stock price data.
- Use
matrices_and_heatmaps.ipynb
for exploratory data analysis. - Visualize correlation matrices and hierarchical clusters.
- Train models using:
naive_xgboost.ipynb
for a single model on all data.task3a-linkage-silhouette.ipynb
to experiment with clustering methods and silhouette analysis.
- Evaluate model performance using RMSE and R² metrics.
- Visualize and analyze results from clustering and XGBoost models in
task2_matrices_and_heatmaps_with_task3a.ipynb
.
Apache License 2.0
- Sara Deshmukh (Rutgers University - New Brunswick)
- Victoria Kim (Virginia Tech)
- Alaina Lin (Brown University)
- Chelsey Parker (Georgia State University)
- Raj Rana (Stevens Institute of Technology)
- Kassie Papasotiriou
- Annita Vapsi
- Antony Papadimitriou
- Samy Lokanandi
- Jesse Dylan Ward
- Yahoo Finance API: For retrieving historical stock price data.
- Python Libraries:
pandas
: Data manipulation and analysis.numpy
: Numerical computing.scikit-learn
: Machine learning and data