fix: correct timestamp calculations in flight datasets & add generation script #626

dsmedia · 2024-11-10T19:46:43Z

`Flights` Dataset Improvements & Generation Script

Resolves #607

Overview

This PR addresses known issues with the flights datasets and introduces a versatile generation script.

Key Changes

Timestamp Calculation Fix

Affects: flights-2k/5k/10k/20k.json

Fixed incorrect departure date calculations by properly handling date boundary cases. Previously, timestamps were incorrectly combining:

Scheduled flight date
Actual departure time

Example Issues:

Late-night delays crossing midnight

Scheduled: Jan 1, 11:45 PM + 45min delay
Old: Jan 1, 12:30 AM (incorrect)
New: Jan 2, 12:30 AM (correct)

Early-morning early departures

Scheduled: Jan 2, 12:30 AM - 45min early
Old: Jan 2, 11:45 PM (incorrect)
New: Jan 1, 11:45 PM (correct)

Implementation Note: 2k/5k/10k/20k datasets corrected in-place rather than through resampling to minimize impact on existing applications. Minor distance variations (~1 mile) may appear due to source data reconciliation.

In-place correction script (for reference)

import pandas as pd
import glob
import zipfile
import datetime as dt
import json

def load_source_data(zip_path):
    """Load and combine CSV files from ZIP archives."""
    dfs = []
    for zip_file in glob.glob(zip_path):
        with zipfile.ZipFile(zip_file) as z:
            for filename in z.namelist():
                if filename.endswith('.csv'):
                    df_temp = pd.read_csv(z.open(filename), encoding='ISO-8859-1')
                    dfs.append(df_temp)
    
    return pd.concat(dfs, ignore_index=True)

def process_time_columns(df):
    """Process and standardize time-related columns."""
    # Select relevant columns
    columns = ['FlightDate', 'CRSDepTime', 'DepTime', 'DepDelay', 
               'ArrDelay', 'Distance', 'Origin', 'Dest', 'Cancelled']
    df = df[columns].dropna(subset=['DepDelay', 'ArrDelay'])

    # Process actual departure times
    df['DepTime'] = df['DepTime'].fillna(0).astype(int)
    df['DepTime'] = df['DepTime'].apply(lambda x: '0000' if x == 2400 else f"{x:04d}")
    df['DepTimeFormatted'] = pd.to_datetime(df['DepTime'], format='%H%M').dt.strftime('%H:%M:%S')
    df['RepoDateMatch'] = pd.to_datetime(df['FlightDate']) + pd.to_timedelta(df['DepTimeFormatted'])

    # Process scheduled departure times
    df['CRSDepTime'] = df['CRSDepTime'].fillna(0).astype(int)
    df['CRSDepTime'] = df['CRSDepTime'].apply(lambda x: '0000' if x == 2400 else f"{x:04d}")
    df['CRSDepTimeFormatted'] = pd.to_datetime(df['CRSDepTime'], format='%H%M').dt.strftime('%H:%M:%S')
    df['ScheduledDepDateTime'] = pd.to_datetime(df['FlightDate']) + pd.to_timedelta(df['CRSDepTimeFormatted'])

    # Calculate actual departure datetime
    df['ActualDepDateTime'] = df['ScheduledDepDateTime'] + pd.to_timedelta(df['DepDelay'], unit='minutes')
    
    return df

def merge_with_target_datasets(source_df, target_dfs):
    """Merge processed source data with target datasets."""
    processed_dfs = {}
    
    for df_name, target_df in target_dfs.items():
        # Create temporary delay column
        temp_df = target_df.copy()
        temp_df['delay_int'] = temp_df['delay'].astype(int)
        
        # Merge with source data
        merged_df = temp_df.merge(
            source_df[['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest', 
                      'ActualDepDateTime', 'Distance']].drop_duplicates(
                ['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest'], 
                keep='first'
            ),
            left_on=['date', 'delay_int', 'origin', 'destination'],
            right_on=['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest'],
            how='left'
        )
        
        # Clean up merged dataframe
        processed_dfs[df_name] = merged_df.drop(
            ['RepoDateMatch', 'ArrDelay', 'Origin', 'Dest', 'delay_int'], 
            axis=1
        ).rename(columns={'Distance': 'DistanceFromSource'})
        
        # Update date column with actual departure time
        processed_dfs[df_name]['date'] = processed_dfs[df_name]['ActualDepDateTime']
        processed_dfs[df_name] = processed_dfs[df_name][
            ['date', 'delay', 'DistanceFromSource', 'origin', 'destination']
        ]
    
    return processed_dfs

def export_processed_data(processed_dfs, output_path_template):
    """Export processed dataframes to JSON files."""
    for df_name, df in processed_dfs.items():
        # Prepare data for export
        export_df = df.copy()
        export_df['date'] = export_df['date'].dt.strftime('%Y/%m/%d %H:%M')
        export_df['delay'] = export_df['delay'].astype(int)
        export_df = export_df.dropna(subset=['DistanceFromSource'])
        export_df['DistanceFromSource'] = export_df['DistanceFromSource'].astype(int)
        export_df = export_df.rename(columns={'DistanceFromSource': 'distance'})
        
        # Export to JSON
        output_path = output_path_template.format(df_name)
        with open(output_path, 'w') as f:
            json.dump(export_df.to_dict('records'), f, 
                     ensure_ascii=False, separators=(',',':'))

def main():
    # Load source data
    source_df = load_source_data('scripts/tmp/*.zip')
    
    # Process time columns
    processed_source_df = process_time_columns(source_df)
    
    # Load target datasets
    target_dfs = {
        'flights-2k': pd.read_json('data/flights-2k.json'),
        'flights-5k': pd.read_json('data/flights-5k.json'),
        'flights-10k': pd.read_json('data/flights-10k.json'),
        'flights-20k': pd.read_json('data/flights-20k.json')
    }
    
    # Merge datasets
    processed_dfs = merge_with_target_datasets(processed_source_df, target_dfs)
    
    # Export processed data
    export_processed_data(processed_dfs, 'data/new{}.json')

if __name__ == '__main__':
    main()

Dataset Completeness

Affects: flights-3m.csv

Restored to full 3 million rows
Implements same timestamp calculation fix as above

New Generation Script

New file: flights.py

A comprehensive tool for processing U.S. DOT (BTS) On-Time Flight Performance data with enhanced flexibility and features. Creates date formats used in this repo.

Output Format Options

Format	Example	Description
MMDDHHMM	`06142330`	June 14, 23:30
ISO	`2023/06/14 23:30`	ISO-style datetime
Decimal	`23.75`	23:45 in decimal form

Key Features

Customization Options
Row count control
Date range filtering
Column selection
Random sampling with seed
Date change flagging
Data Processing
Proper date boundary handling
Multiple output formats (CSV/JSON)
Comprehensive statistics generation

📚 Next Steps

Update docs. Confirm that .arrow version of flights data is formatted as intended.

…ross midnight

…k, 20k json

domoritz · 2024-11-10T22:06:41Z

Thank you. Should we delete https://github.com/vega/vega-datasets/blob/main/scripts/flights.js?

dsmedia · 2024-11-10T23:05:51Z

Thank you. Should we delete https://github.com/vega/vega-datasets/blob/main/scripts/flights.js?

Flights.js would be used to generate random subsets from the (already processed) flights-3m.csv, while flights.py generates datasets from original source data not available in the repo. An advantage of keeping the python version is that it can be used to generate even larger datasets than 3mm and can also be used to process newer data. I think the flights.js can be retired. It's undoubtedly convenient to have a script that can work with local files, but all else equal I favor generation scripts that can work back from original sources.

domoritz · 2024-11-10T23:22:48Z

Ah, makes sense. Thanks.

domoritz · 2024-11-11T03:35:38Z

I want to look at the actual diffs in the files. It's a bit tricky to do that in the web interface.

domoritz · 2024-11-11T18:28:07Z

Why did some of the distances change?

Other than that, looks good.

dsmedia · 2024-11-11T20:02:58Z

Why did some of the distances change?

Great question! Looking into the 20k dataset, I found that 1,440 flights (7%) show slightly different distances (each by 1 mile). Notably, the changes all involved flights to or from six airports: SJC, ONT, BWI, OAK, MCO, LAX. Without knowing exactly when or how the source data was originally pulled, it's difficult to identify the cause.

some analysis in python

import requests
import pandas as pd
from collections import defaultdict

# URLs for the JSON files
original_url = "https://raw.githubusercontent.com/vega/vega-datasets/main/data/flights-20k.json"
revised_url = "https://raw.githubusercontent.com/vega/vega-datasets/91ddb14165126ade0baa318da5e676095d09a129/data/flights-20k.json"

# Fetch and load both JSONs
original_data = requests.get(original_url).json()
revised_data = requests.get(revised_url).json()

def normalize_pair(airport1, airport2):
    """Return airports in alphabetical order to normalize pairs"""
    return tuple(sorted([airport1, airport2]))

def get_airport_stats(flights_data, different_pairs, excluded_airports=set()):
    """Calculate stats for airports, excluding specified airports"""
    # Get all airports involved in changed pairs, excluding the ones we've already processed
    affected_airports = set()
    for pair in different_pairs:
        if not (pair[0] in excluded_airports or pair[1] in excluded_airports):
            affected_airports.update(pair)
    
    affected_airports = affected_airports - excluded_airports
    
    if not affected_airports:
        return None
    
    airport_stats = []
    for airport in affected_airports:
        total_flights = sum(1 for flight in flights_data 
                          if (flight['origin'] == airport or flight['destination'] == airport) and
                          not (flight['origin'] in excluded_airports or flight['destination'] in excluded_airports))
        
        changed_flights = sum(1 for flight in flights_data 
                            if (flight['origin'] == airport or flight['destination'] == airport) and
                            not (flight['origin'] in excluded_airports or flight['destination'] in excluded_airports) and
                            normalize_pair(flight['origin'], flight['destination']) in different_pairs)
        
        if total_flights > 0:
            percent_changed = (changed_flights / total_flights * 100)
            airport_stats.append({
                'airport': airport,
                'total_flights': total_flights,
                'changed_flights': changed_flights,
                'percent_changed': round(percent_changed, 2)
            })
    
    return pd.DataFrame(airport_stats)

# First identify all pairs with different distances and collect change information
original_distances = defaultdict(set)
revised_distances = defaultdict(set)
different_pairs = set()
changes = []

# Get total number of unique airport pairs in original dataset
all_pairs = set()
for flight in original_data:
    pair = normalize_pair(flight['origin'], flight['destination'])
    all_pairs.add(pair)
    original_distances[pair].add(flight['distance'])

for flight in revised_data:
    pair = normalize_pair(flight['origin'], flight['destination'])
    revised_distances[pair].add(flight['distance'])

total_pairs = len(all_pairs)

for pair in set(original_distances.keys()) | set(revised_distances.keys()):
    orig_dist = list(original_distances[pair])[0] if pair in original_distances else None
    rev_dist = list(revised_distances[pair])[0] if pair in revised_distances else None
    if orig_dist != rev_dist:
        different_pairs.add(pair)
        changes.append({
            'pair': pair,
            'original_distance': orig_dist,
            'revised_distance': rev_dist,
            'change': rev_dist - orig_dist if (orig_dist and rev_dist) else None
        })

affected_pairs = len(different_pairs)

# Overall statistics
total_flights = len(original_data)
total_changed_flights = sum(1 for flight in original_data 
                          if normalize_pair(flight['origin'], flight['destination']) in different_pairs)

print(f"\nOverall Statistics:")
print(f"Total flights in dataset: {total_flights}")
print(f"Total flights with changed distances: {total_changed_flights}")
print(f"Percentage of flights affected: {(total_changed_flights/total_flights)*100:.2f}%")

print(f"\nAirport Pair Statistics:")
print(f"Total unique airport pairs in dataset: {total_pairs}")
print(f"Airport pairs with changed distances: {affected_pairs}")
print(f"Percentage of airport pairs affected: {(affected_pairs/total_pairs)*100:.2f}%")

# Distribution of changes
changes_df = pd.DataFrame(changes)
# Count the number of unique airport pairs for each type of change
pair_change_distribution = changes_df['change'].value_counts()
print("\nChange Distribution (number of unique airport pairs):")
print(f"+1 mile: {pair_change_distribution.get(1, 0)} pairs ({(pair_change_distribution.get(1, 0)/total_pairs)*100:.2f}% of all pairs)")
print(f"-1 mile: {pair_change_distribution.get(-1, 0)} pairs ({(pair_change_distribution.get(-1, 0)/total_pairs)*100:.2f}% of all pairs)")

# Count actual flights affected by each type of change
flight_changes = {1: 0, -1: 0}
for flight in original_data:
    pair = normalize_pair(flight['origin'], flight['destination'])
    if pair in different_pairs:
        change = list(revised_distances[pair])[0] - list(original_distances[pair])[0]
        flight_changes[change] += 1

print("\nFlights affected by each type of change:")
print(f"+1 mile: {flight_changes[1]} flights ({(flight_changes[1]/total_flights)*100:.2f}% of all flights)")
print(f"-1 mile: {flight_changes[-1]} flights ({(flight_changes[-1]/total_flights)*100:.2f}% of all flights)")

# Iterative analysis
excluded_airports = set()
running_total = 0
iteration = 1

while True:
    df = get_airport_stats(original_data, different_pairs, excluded_airports)
    
    if df is None or df.empty:
        break
        
    df = df.sort_values('changed_flights', ascending=False)
    top_airport = df.iloc[0]
    running_total += top_airport['changed_flights']
    
    print(f"\nIteration {iteration} - Most affected remaining airport:")
    print(f"Airport: {top_airport['airport']}")
    print(f"Changed flights: {top_airport['changed_flights']} ({(top_airport['changed_flights']/total_changed_flights)*100:.1f}% of all changes)")
    print(f"Running total: {running_total} ({(running_total/total_changed_flights)*100:.1f}% of all changes)")
    print(f"Total flights for this airport (excluding previous airports): {top_airport['total_flights']}")
    print(f"Percent changed: {top_airport['percent_changed']}%")
    
    excluded_airports.add(top_airport['airport'])
    iteration += 1

...and the output

Overall Statistics:
Total flights in dataset: 20000
Total flights with changed distances: 1440
Percentage of flights affected: 7.20%

Airport Pair Statistics:
Total unique airport pairs in dataset: 332
Airport pairs with changed distances: 18
Percentage of airport pairs affected: 5.42%

Change Distribution (number of unique airport pairs):
+1 mile: 13 pairs (3.92% of all pairs)
-1 mile: 5 pairs (1.51% of all pairs)

Flights affected by each type of change:
+1 mile: 1162 flights (5.81% of all flights)
-1 mile: 278 flights (1.39% of all flights)

Iteration 1 - Most affected remaining airport:
Airport: SJC
Changed flights: 666 (46.2% of all changes)
Running total: 666 (46.2% of all changes)
Total flights for this airport (excluding previous airports): 1046
Percent changed: 63.67%

Iteration 2 - Most affected remaining airport:
Airport: ONT
Changed flights: 370 (25.7% of all changes)
Running total: 1036 (71.9% of all changes)
Total flights for this airport (excluding previous airports): 742
Percent changed: 49.87%

Iteration 3 - Most affected remaining airport:
Airport: BWI
Changed flights: 223 (15.5% of all changes)
Running total: 1259 (87.4% of all changes)
Total flights for this airport (excluding previous airports): 1774
Percent changed: 12.57%

Iteration 4 - Most affected remaining airport:
Airport: OAK
Changed flights: 135 (9.4% of all changes)
Running total: 1394 (96.8% of all changes)
Total flights for this airport (excluding previous airports): 1493
Percent changed: 9.04%

Iteration 5 - Most affected remaining airport:
Airport: MCO
Changed flights: 31 (2.2% of all changes)
Running total: 1425 (99.0% of all changes)
Total flights for this airport (excluding previous airports): 728
Percent changed: 4.26%

Iteration 6 - Most affected remaining airport:
Airport: LAX
Changed flights: 15 (1.0% of all changes)
Running total: 1440 (100.0% of all changes)
Total flights for this airport (excluding previous airports): 1227
Percent changed: 1.22%

domoritz · 2024-11-11T22:27:00Z

Thanks for the analysis. Seems like not super consequential to change this so I think we are good.

dsmedia added 6 commits November 9, 2024 22:40

fixes dates of edge cases of flights-2k in which delay pushes date ac…

f9ff1be

…ross midnight

fixes dates of edge cases where delays shift dates for flights 5k, 10…

cd9d197

…k, 20k json

updates flights-3m.csv with correct number of rows

18befb1

adds flight-generation script

6ed0f8c

updates flights.py docs

5a994c3

refines docs for flights.py

91ddb14

domoritz merged commit f617597 into vega:main Nov 11, 2024
2 checks passed

dangotbanned mentioned this pull request Nov 17, 2024

flights-3m on v2.10.0 exceeds configured limit #627

Closed

dangotbanned mentioned this pull request Dec 5, 2024

Use a datetime column in flights-3m.parquet #641

Closed

dangotbanned mentioned this pull request Dec 18, 2024

feat: Improve flights.* dataset reproducibility #645

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct timestamp calculations in flight datasets & add generation script #626

fix: correct timestamp calculations in flight datasets & add generation script #626

dsmedia commented Nov 10, 2024

domoritz commented Nov 10, 2024

dsmedia commented Nov 10, 2024

domoritz commented Nov 10, 2024

domoritz commented Nov 11, 2024

domoritz commented Nov 11, 2024

dsmedia commented Nov 11, 2024

domoritz commented Nov 11, 2024

fix: correct timestamp calculations in flight datasets & add generation script #626

fix: correct timestamp calculations in flight datasets & add generation script #626

Conversation

dsmedia commented Nov 10, 2024

Flights Dataset Improvements & Generation Script

Overview

Key Changes

Timestamp Calculation Fix

Dataset Completeness

New Generation Script

Output Format Options

Key Features

📚 Next Steps

domoritz commented Nov 10, 2024

dsmedia commented Nov 10, 2024

domoritz commented Nov 10, 2024

domoritz commented Nov 11, 2024

domoritz commented Nov 11, 2024

dsmedia commented Nov 11, 2024

domoritz commented Nov 11, 2024

`Flights` Dataset Improvements & Generation Script