New York Taxi Fare Analysis

This project aims to develop an effective model for predicting taxi fares during rush hour periods in New York City. By leveraging advanced clustering and regression techniques, this analysis provides valuable insights that can benefit ride-hailing companies like Uber and InDrive.

Project Overview

The primary goal of this project was to predict taxi fares accurately during high-demand rush hour periods. The analysis includes advanced data preprocessing, clustering, and regression techniques to model and forecast fares based on various features like location-specific demand, distance to airports, and pickup patterns.

Dataset

Source: Official New York Transportation Data.
Size: Over 3 million rows.
Key Features:
- Pickup and Drop-off Locations
- Fare Amount
- Passenger Count
- Distance Metrics
- Time and Date of Pickup

Data Preprocessing

Type casting for optimal memory usage.
Cleaning invalid or missing entries.
Feature engineering to encode characteristics of pickup locations, including:
- Demand
- Demand Volatility
- Distance to Airport

Methodology

Exploratory Data Analysis (EDA)

Used matplotlib for histograms, boxplots, and time-series visualizations.
Visualized fare-related variables on a New York City map using GeoPandas.

Clustering

Models used:
- KMeans
- DBSCAN
- Spectral Clustering
- Gaussian Mixture Model
Evaluation Metrics:
- Silhouette Score
- Davies-Bouldin Index

Regression

Models Applied:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Support Vector Regressor (SVR)
- Random Forest
- Gradient Boosting
- XGBoost

Ensemble Techniques

Simple and Stochastic Hill Climbing
Simulated Annealing
Optimal Weighted Ensemble
Stacking

Key Findings

Clustering revealed distinct pickup location patterns that significantly influenced fare amounts.
Ensemble methods outperformed individual regression models in predicting fares during rush hours.
Feature-engineered variables like demand volatility and airport distances enhanced prediction accuracy.

Tools and Technologies

Languages: Python
Libraries:
- Data Manipulation: Pandas, NumPy
- Visualization: Matplotlib, GeoPandas
- Clustering and Regression: scikit-learn, XGBoost
- Ensemble Techniques: Custom implementations

Future Work

Incorporate real-time traffic data to enhance prediction accuracy.
Explore deep learning models like RNNs and LSTMs for time-series predictions.
Expand the dataset to include multi-city data for broader applicability.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
Clustering_NYC.ipynb		Clustering_NYC.ipynb
Data Joining.ipynb		Data Joining.ipynb
NYC Taxi Zones.geojson		NYC Taxi Zones.geojson
NYC_DataCleaning.ipynb		NYC_DataCleaning.ipynb
README.md		README.md
Regression.ipynb		Regression.ipynb
ScrapingWeather.ipynb		ScrapingWeather.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

New York Taxi Fare Analysis

Table of Contents

Project Overview

Dataset

Data Preprocessing

Methodology

Exploratory Data Analysis (EDA)

Clustering

Regression

Ensemble Techniques

Key Findings

Tools and Technologies

Future Work

About

Releases

Packages

Languages

haverstein/NewYork-Taxi-Fare-Analysis

Folders and files

Latest commit

History

Repository files navigation

New York Taxi Fare Analysis

Table of Contents

Project Overview

Dataset

Data Preprocessing

Methodology

Exploratory Data Analysis (EDA)

Clustering

Regression

Ensemble Techniques

Key Findings

Tools and Technologies

Future Work

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages