Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct timestamp calculations in flight datasets & add generation script #626

Merged
merged 6 commits into from
Nov 11, 2024

Conversation

dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Nov 10, 2024

Flights Dataset Improvements & Generation Script

Resolves #607

Overview

This PR addresses known issues with the flights datasets and introduces a versatile generation script.

Key Changes

Timestamp Calculation Fix

Affects: flights-2k/5k/10k/20k.json

Fixed incorrect departure date calculations by properly handling date boundary cases. Previously, timestamps were incorrectly combining:

  • Scheduled flight date
  • Actual departure time

Example Issues:

  • Late-night delays crossing midnight
Scheduled: Jan 1, 11:45 PM + 45min delay
Old: Jan 1, 12:30 AM (incorrect)
New: Jan 2, 12:30 AM (correct)
  • Early-morning early departures
Scheduled: Jan 2, 12:30 AM - 45min early
Old: Jan 2, 11:45 PM (incorrect)
New: Jan 1, 11:45 PM (correct)

Implementation Note: 2k/5k/10k/20k datasets corrected in-place rather than through resampling to minimize impact on existing applications. Minor distance variations (~1 mile) may appear due to source data reconciliation.

In-place correction script (for reference)
import pandas as pd
import glob
import zipfile
import datetime as dt
import json

def load_source_data(zip_path):
    """Load and combine CSV files from ZIP archives."""
    dfs = []
    for zip_file in glob.glob(zip_path):
        with zipfile.ZipFile(zip_file) as z:
            for filename in z.namelist():
                if filename.endswith('.csv'):
                    df_temp = pd.read_csv(z.open(filename), encoding='ISO-8859-1')
                    dfs.append(df_temp)
    
    return pd.concat(dfs, ignore_index=True)

def process_time_columns(df):
    """Process and standardize time-related columns."""
    # Select relevant columns
    columns = ['FlightDate', 'CRSDepTime', 'DepTime', 'DepDelay', 
               'ArrDelay', 'Distance', 'Origin', 'Dest', 'Cancelled']
    df = df[columns].dropna(subset=['DepDelay', 'ArrDelay'])

    # Process actual departure times
    df['DepTime'] = df['DepTime'].fillna(0).astype(int)
    df['DepTime'] = df['DepTime'].apply(lambda x: '0000' if x == 2400 else f"{x:04d}")
    df['DepTimeFormatted'] = pd.to_datetime(df['DepTime'], format='%H%M').dt.strftime('%H:%M:%S')
    df['RepoDateMatch'] = pd.to_datetime(df['FlightDate']) + pd.to_timedelta(df['DepTimeFormatted'])

    # Process scheduled departure times
    df['CRSDepTime'] = df['CRSDepTime'].fillna(0).astype(int)
    df['CRSDepTime'] = df['CRSDepTime'].apply(lambda x: '0000' if x == 2400 else f"{x:04d}")
    df['CRSDepTimeFormatted'] = pd.to_datetime(df['CRSDepTime'], format='%H%M').dt.strftime('%H:%M:%S')
    df['ScheduledDepDateTime'] = pd.to_datetime(df['FlightDate']) + pd.to_timedelta(df['CRSDepTimeFormatted'])

    # Calculate actual departure datetime
    df['ActualDepDateTime'] = df['ScheduledDepDateTime'] + pd.to_timedelta(df['DepDelay'], unit='minutes')
    
    return df

def merge_with_target_datasets(source_df, target_dfs):
    """Merge processed source data with target datasets."""
    processed_dfs = {}
    
    for df_name, target_df in target_dfs.items():
        # Create temporary delay column
        temp_df = target_df.copy()
        temp_df['delay_int'] = temp_df['delay'].astype(int)
        
        # Merge with source data
        merged_df = temp_df.merge(
            source_df[['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest', 
                      'ActualDepDateTime', 'Distance']].drop_duplicates(
                ['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest'], 
                keep='first'
            ),
            left_on=['date', 'delay_int', 'origin', 'destination'],
            right_on=['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest'],
            how='left'
        )
        
        # Clean up merged dataframe
        processed_dfs[df_name] = merged_df.drop(
            ['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest', 'delay_int'], 
            axis=1
        ).rename(columns={'Distance': 'DistanceFromSource'})
        
        # Update date column with actual departure time
        processed_dfs[df_name]['date'] = processed_dfs[df_name]['ActualDepDateTime']
        processed_dfs[df_name] = processed_dfs[df_name][
            ['date', 'delay', 'DistanceFromSource', 'origin', 'destination']
        ]
    
    return processed_dfs

def export_processed_data(processed_dfs, output_path_template):
    """Export processed dataframes to JSON files."""
    for df_name, df in processed_dfs.items():
        # Prepare data for export
        export_df = df.copy()
        export_df['date'] = export_df['date'].dt.strftime('%Y/%m/%d %H:%M')
        export_df['delay'] = export_df['delay'].astype(int)
        export_df = export_df.dropna(subset=['DistanceFromSource'])
        export_df['DistanceFromSource'] = export_df['DistanceFromSource'].astype(int)
        export_df = export_df.rename(columns={'DistanceFromSource': 'distance'})
        
        # Export to JSON
        output_path = output_path_template.format(df_name)
        with open(output_path, 'w') as f:
            json.dump(export_df.to_dict('records'), f, 
                     ensure_ascii=False, separators=(',',':'))

def main():
    # Load source data
    source_df = load_source_data('scripts/tmp/*.zip')
    
    # Process time columns
    processed_source_df = process_time_columns(source_df)
    
    # Load target datasets
    target_dfs = {
        'flights-2k': pd.read_json('data/flights-2k.json'),
        'flights-5k': pd.read_json('data/flights-5k.json'),
        'flights-10k': pd.read_json('data/flights-10k.json'),
        'flights-20k': pd.read_json('data/flights-20k.json')
    }
    
    # Merge datasets
    processed_dfs = merge_with_target_datasets(processed_source_df, target_dfs)
    
    # Export processed data
    export_processed_data(processed_dfs, 'data/new{}.json')

if __name__ == '__main__':
    main()

Dataset Completeness

Affects: flights-3m.csv

  • Restored to full 3 million rows
  • Implements same timestamp calculation fix as above

New Generation Script

New file: flights.py

A comprehensive tool for processing U.S. DOT (BTS) On-Time Flight Performance data with enhanced flexibility and features. Creates date formats used in this repo.

Output Format Options

Format Example Description
MMDDHHMM 06142330 June 14, 23:30
ISO 2023/06/14 23:30 ISO-style datetime
Decimal 23.75 23:45 in decimal form

Key Features

  • Customization Options

  • Row count control

  • Date range filtering

  • Column selection

  • Random sampling with seed

  • Date change flagging

  • Data Processing

  • Proper date boundary handling

  • Multiple output formats (CSV/JSON)

  • Comprehensive statistics generation

📚 Next Steps

Update docs. Confirm that .arrow version of flights data is formatted as intended.

@domoritz
Copy link
Member

@dsmedia
Copy link
Collaborator Author

dsmedia commented Nov 10, 2024

Thank you. Should we delete https://github.com/vega/vega-datasets/blob/main/scripts/flights.js?

Flights.js would be used to generate random subsets from the (already processed) flights-3m.csv, while flights.py generates datasets from original source data not available in the repo. An advantage of keeping the python version is that it can be used to generate even larger datasets than 3mm and can also be used to process newer data. I think the flights.js can be retired. It's undoubtedly convenient to have a script that can work with local files, but all else equal I favor generation scripts that can work back from original sources.

@domoritz
Copy link
Member

Ah, makes sense. Thanks.

@domoritz
Copy link
Member

I want to look at the actual diffs in the files. It's a bit tricky to do that in the web interface.

@domoritz
Copy link
Member

Why did some of the distances change?

Screenshot 2024-11-11 at 13 26 36

Other than that, looks good.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Nov 11, 2024

Why did some of the distances change?

Great question! Looking into the 20k dataset, I found that 1,440 flights (7%) show slightly different distances (each by 1 mile). Notably, the changes all involved flights to or from six airports: SJC, ONT, BWI, OAK, MCO, LAX. Without knowing exactly when or how the source data was originally pulled, it's difficult to identify the cause.

some analysis in python
import requests
import pandas as pd
from collections import defaultdict

# URLs for the JSON files
original_url = "https://raw.githubusercontent.com/vega/vega-datasets/main/data/flights-20k.json"
revised_url = "https://raw.githubusercontent.com/vega/vega-datasets/91ddb14165126ade0baa318da5e676095d09a129/data/flights-20k.json"

# Fetch and load both JSONs
original_data = requests.get(original_url).json()
revised_data = requests.get(revised_url).json()

def normalize_pair(airport1, airport2):
    """Return airports in alphabetical order to normalize pairs"""
    return tuple(sorted([airport1, airport2]))

def get_airport_stats(flights_data, different_pairs, excluded_airports=set()):
    """Calculate stats for airports, excluding specified airports"""
    # Get all airports involved in changed pairs, excluding the ones we've already processed
    affected_airports = set()
    for pair in different_pairs:
        if not (pair[0] in excluded_airports or pair[1] in excluded_airports):
            affected_airports.update(pair)
    
    affected_airports = affected_airports - excluded_airports
    
    if not affected_airports:
        return None
    
    airport_stats = []
    for airport in affected_airports:
        total_flights = sum(1 for flight in flights_data 
                          if (flight['origin'] == airport or flight['destination'] == airport) and
                          not (flight['origin'] in excluded_airports or flight['destination'] in excluded_airports))
        
        changed_flights = sum(1 for flight in flights_data 
                            if (flight['origin'] == airport or flight['destination'] == airport) and
                            not (flight['origin'] in excluded_airports or flight['destination'] in excluded_airports) and
                            normalize_pair(flight['origin'], flight['destination']) in different_pairs)
        
        if total_flights > 0:
            percent_changed = (changed_flights / total_flights * 100)
            airport_stats.append({
                'airport': airport,
                'total_flights': total_flights,
                'changed_flights': changed_flights,
                'percent_changed': round(percent_changed, 2)
            })
    
    return pd.DataFrame(airport_stats)

# First identify all pairs with different distances and collect change information
original_distances = defaultdict(set)
revised_distances = defaultdict(set)
different_pairs = set()
changes = []

# Get total number of unique airport pairs in original dataset
all_pairs = set()
for flight in original_data:
    pair = normalize_pair(flight['origin'], flight['destination'])
    all_pairs.add(pair)
    original_distances[pair].add(flight['distance'])

for flight in revised_data:
    pair = normalize_pair(flight['origin'], flight['destination'])
    revised_distances[pair].add(flight['distance'])

total_pairs = len(all_pairs)

for pair in set(original_distances.keys()) | set(revised_distances.keys()):
    orig_dist = list(original_distances[pair])[0] if pair in original_distances else None
    rev_dist = list(revised_distances[pair])[0] if pair in revised_distances else None
    if orig_dist != rev_dist:
        different_pairs.add(pair)
        changes.append({
            'pair': pair,
            'original_distance': orig_dist,
            'revised_distance': rev_dist,
            'change': rev_dist - orig_dist if (orig_dist and rev_dist) else None
        })

affected_pairs = len(different_pairs)

# Overall statistics
total_flights = len(original_data)
total_changed_flights = sum(1 for flight in original_data 
                          if normalize_pair(flight['origin'], flight['destination']) in different_pairs)

print(f"\nOverall Statistics:")
print(f"Total flights in dataset: {total_flights}")
print(f"Total flights with changed distances: {total_changed_flights}")
print(f"Percentage of flights affected: {(total_changed_flights/total_flights)*100:.2f}%")

print(f"\nAirport Pair Statistics:")
print(f"Total unique airport pairs in dataset: {total_pairs}")
print(f"Airport pairs with changed distances: {affected_pairs}")
print(f"Percentage of airport pairs affected: {(affected_pairs/total_pairs)*100:.2f}%")

# Distribution of changes
changes_df = pd.DataFrame(changes)
# Count the number of unique airport pairs for each type of change
pair_change_distribution = changes_df['change'].value_counts()
print("\nChange Distribution (number of unique airport pairs):")
print(f"+1 mile: {pair_change_distribution.get(1, 0)} pairs ({(pair_change_distribution.get(1, 0)/total_pairs)*100:.2f}% of all pairs)")
print(f"-1 mile: {pair_change_distribution.get(-1, 0)} pairs ({(pair_change_distribution.get(-1, 0)/total_pairs)*100:.2f}% of all pairs)")

# Count actual flights affected by each type of change
flight_changes = {1: 0, -1: 0}
for flight in original_data:
    pair = normalize_pair(flight['origin'], flight['destination'])
    if pair in different_pairs:
        change = list(revised_distances[pair])[0] - list(original_distances[pair])[0]
        flight_changes[change] += 1

print("\nFlights affected by each type of change:")
print(f"+1 mile: {flight_changes[1]} flights ({(flight_changes[1]/total_flights)*100:.2f}% of all flights)")
print(f"-1 mile: {flight_changes[-1]} flights ({(flight_changes[-1]/total_flights)*100:.2f}% of all flights)")

# Iterative analysis
excluded_airports = set()
running_total = 0
iteration = 1

while True:
    df = get_airport_stats(original_data, different_pairs, excluded_airports)
    
    if df is None or df.empty:
        break
        
    df = df.sort_values('changed_flights', ascending=False)
    top_airport = df.iloc[0]
    running_total += top_airport['changed_flights']
    
    print(f"\nIteration {iteration} - Most affected remaining airport:")
    print(f"Airport: {top_airport['airport']}")
    print(f"Changed flights: {top_airport['changed_flights']} ({(top_airport['changed_flights']/total_changed_flights)*100:.1f}% of all changes)")
    print(f"Running total: {running_total} ({(running_total/total_changed_flights)*100:.1f}% of all changes)")
    print(f"Total flights for this airport (excluding previous airports): {top_airport['total_flights']}")
    print(f"Percent changed: {top_airport['percent_changed']}%")
    
    excluded_airports.add(top_airport['airport'])
    iteration += 1
...and the output

Overall Statistics:
Total flights in dataset: 20000
Total flights with changed distances: 1440
Percentage of flights affected: 7.20%

Airport Pair Statistics:
Total unique airport pairs in dataset: 332
Airport pairs with changed distances: 18
Percentage of airport pairs affected: 5.42%

Change Distribution (number of unique airport pairs):
+1 mile: 13 pairs (3.92% of all pairs)
-1 mile: 5 pairs (1.51% of all pairs)

Flights affected by each type of change:
+1 mile: 1162 flights (5.81% of all flights)
-1 mile: 278 flights (1.39% of all flights)

Iteration 1 - Most affected remaining airport:
Airport: SJC
Changed flights: 666 (46.2% of all changes)
Running total: 666 (46.2% of all changes)
Total flights for this airport (excluding previous airports): 1046
Percent changed: 63.67%

Iteration 2 - Most affected remaining airport:
Airport: ONT
Changed flights: 370 (25.7% of all changes)
Running total: 1036 (71.9% of all changes)
Total flights for this airport (excluding previous airports): 742
Percent changed: 49.87%

Iteration 3 - Most affected remaining airport:
Airport: BWI
Changed flights: 223 (15.5% of all changes)
Running total: 1259 (87.4% of all changes)
Total flights for this airport (excluding previous airports): 1774
Percent changed: 12.57%

Iteration 4 - Most affected remaining airport:
Airport: OAK
Changed flights: 135 (9.4% of all changes)
Running total: 1394 (96.8% of all changes)
Total flights for this airport (excluding previous airports): 1493
Percent changed: 9.04%

Iteration 5 - Most affected remaining airport:
Airport: MCO
Changed flights: 31 (2.2% of all changes)
Running total: 1425 (99.0% of all changes)
Total flights for this airport (excluding previous airports): 728
Percent changed: 4.26%

Iteration 6 - Most affected remaining airport:
Airport: LAX
Changed flights: 15 (1.0% of all changes)
Running total: 1440 (100.0% of all changes)
Total flights for this airport (excluding previous airports): 1227
Percent changed: 1.22%

@domoritz
Copy link
Member

Thanks for the analysis. Seems like not super consequential to change this so I think we are good.

@domoritz domoritz merged commit f617597 into vega:main Nov 11, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flights 3m only has 200k rows
2 participants