This project aims to develop an effective model for predicting taxi fares during rush hour periods in New York City. By leveraging advanced clustering and regression techniques, this analysis provides valuable insights that can benefit ride-hailing companies like Uber and InDrive.
The primary goal of this project was to predict taxi fares accurately during high-demand rush hour periods. The analysis includes advanced data preprocessing, clustering, and regression techniques to model and forecast fares based on various features like location-specific demand, distance to airports, and pickup patterns.
- Source: Official New York Transportation Data.
- Size: Over 3 million rows.
- Key Features:
- Pickup and Drop-off Locations
- Fare Amount
- Passenger Count
- Distance Metrics
- Time and Date of Pickup
- Type casting for optimal memory usage.
- Cleaning invalid or missing entries.
- Feature engineering to encode characteristics of pickup locations, including:
- Demand
- Demand Volatility
- Distance to Airport
- Used matplotlib for histograms, boxplots, and time-series visualizations.
- Visualized fare-related variables on a New York City map using GeoPandas.
- Models used:
- KMeans
- DBSCAN
- Spectral Clustering
- Gaussian Mixture Model
- Evaluation Metrics:
- Silhouette Score
- Davies-Bouldin Index
- Models Applied:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Support Vector Regressor (SVR)
- Random Forest
- Gradient Boosting
- XGBoost
- Simple and Stochastic Hill Climbing
- Simulated Annealing
- Optimal Weighted Ensemble
- Stacking
- Clustering revealed distinct pickup location patterns that significantly influenced fare amounts.
- Ensemble methods outperformed individual regression models in predicting fares during rush hours.
- Feature-engineered variables like demand volatility and airport distances enhanced prediction accuracy.
- Languages: Python
- Libraries:
- Data Manipulation:
Pandas
,NumPy
- Visualization:
Matplotlib
,GeoPandas
- Clustering and Regression:
scikit-learn
,XGBoost
- Ensemble Techniques: Custom implementations
- Data Manipulation:
- Incorporate real-time traffic data to enhance prediction accuracy.
- Explore deep learning models like RNNs and LSTMs for time-series predictions.
- Expand the dataset to include multi-city data for broader applicability.