Skip to content

A project on clustering the New York zones and then applying regression techniques to model the fare price of taxi rides

Notifications You must be signed in to change notification settings

haverstein/NewYork-Taxi-Fare-Analysis

Repository files navigation

New York Taxi Fare Analysis

This project aims to develop an effective model for predicting taxi fares during rush hour periods in New York City. By leveraging advanced clustering and regression techniques, this analysis provides valuable insights that can benefit ride-hailing companies like Uber and InDrive.


Table of Contents

  1. Project Overview
  2. Dataset
  3. Methodology
  4. Key Findings
  5. Tools and Technologies
  6. Usage
  7. Future Work

Project Overview

The primary goal of this project was to predict taxi fares accurately during high-demand rush hour periods. The analysis includes advanced data preprocessing, clustering, and regression techniques to model and forecast fares based on various features like location-specific demand, distance to airports, and pickup patterns.


Dataset

  • Source: Official New York Transportation Data.
  • Size: Over 3 million rows.
  • Key Features:
    • Pickup and Drop-off Locations
    • Fare Amount
    • Passenger Count
    • Distance Metrics
    • Time and Date of Pickup

Data Preprocessing

  • Type casting for optimal memory usage.
  • Cleaning invalid or missing entries.
  • Feature engineering to encode characteristics of pickup locations, including:
    • Demand
    • Demand Volatility
    • Distance to Airport

Methodology

Exploratory Data Analysis (EDA)

  • Used matplotlib for histograms, boxplots, and time-series visualizations.
  • Visualized fare-related variables on a New York City map using GeoPandas.

Clustering

  • Models used:
    • KMeans
    • DBSCAN
    • Spectral Clustering
    • Gaussian Mixture Model
  • Evaluation Metrics:
    • Silhouette Score
    • Davies-Bouldin Index

Regression

  • Models Applied:
    • Linear Regression
    • Ridge Regression
    • Lasso Regression
    • Support Vector Regressor (SVR)
    • Random Forest
    • Gradient Boosting
    • XGBoost

Ensemble Techniques

  • Simple and Stochastic Hill Climbing
  • Simulated Annealing
  • Optimal Weighted Ensemble
  • Stacking

Key Findings

  • Clustering revealed distinct pickup location patterns that significantly influenced fare amounts.
  • Ensemble methods outperformed individual regression models in predicting fares during rush hours.
  • Feature-engineered variables like demand volatility and airport distances enhanced prediction accuracy.

Tools and Technologies

  • Languages: Python
  • Libraries:
    • Data Manipulation: Pandas, NumPy
    • Visualization: Matplotlib, GeoPandas
    • Clustering and Regression: scikit-learn, XGBoost
    • Ensemble Techniques: Custom implementations

Future Work

  • Incorporate real-time traffic data to enhance prediction accuracy.
  • Explore deep learning models like RNNs and LSTMs for time-series predictions.
  • Expand the dataset to include multi-city data for broader applicability.

About

A project on clustering the New York zones and then applying regression techniques to model the fare price of taxi rides

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published