-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: correct timestamp calculations in flight datasets & add generation script #626
Conversation
Thank you. Should we delete https://github.com/vega/vega-datasets/blob/main/scripts/flights.js? |
Flights.js would be used to generate random subsets from the (already processed) flights-3m.csv, while flights.py generates datasets from original source data not available in the repo. An advantage of keeping the python version is that it can be used to generate even larger datasets than 3mm and can also be used to process newer data. I think the flights.js can be retired. It's undoubtedly convenient to have a script that can work with local files, but all else equal I favor generation scripts that can work back from original sources. |
Ah, makes sense. Thanks. |
I want to look at the actual diffs in the files. It's a bit tricky to do that in the web interface. |
Great question! Looking into the 20k dataset, I found that 1,440 flights (7%) show slightly different distances (each by 1 mile). Notably, the changes all involved flights to or from six airports: SJC, ONT, BWI, OAK, MCO, LAX. Without knowing exactly when or how the source data was originally pulled, it's difficult to identify the cause. some analysis in pythonimport requests
import pandas as pd
from collections import defaultdict
# URLs for the JSON files
original_url = "https://raw.githubusercontent.com/vega/vega-datasets/main/data/flights-20k.json"
revised_url = "https://raw.githubusercontent.com/vega/vega-datasets/91ddb14165126ade0baa318da5e676095d09a129/data/flights-20k.json"
# Fetch and load both JSONs
original_data = requests.get(original_url).json()
revised_data = requests.get(revised_url).json()
def normalize_pair(airport1, airport2):
"""Return airports in alphabetical order to normalize pairs"""
return tuple(sorted([airport1, airport2]))
def get_airport_stats(flights_data, different_pairs, excluded_airports=set()):
"""Calculate stats for airports, excluding specified airports"""
# Get all airports involved in changed pairs, excluding the ones we've already processed
affected_airports = set()
for pair in different_pairs:
if not (pair[0] in excluded_airports or pair[1] in excluded_airports):
affected_airports.update(pair)
affected_airports = affected_airports - excluded_airports
if not affected_airports:
return None
airport_stats = []
for airport in affected_airports:
total_flights = sum(1 for flight in flights_data
if (flight['origin'] == airport or flight['destination'] == airport) and
not (flight['origin'] in excluded_airports or flight['destination'] in excluded_airports))
changed_flights = sum(1 for flight in flights_data
if (flight['origin'] == airport or flight['destination'] == airport) and
not (flight['origin'] in excluded_airports or flight['destination'] in excluded_airports) and
normalize_pair(flight['origin'], flight['destination']) in different_pairs)
if total_flights > 0:
percent_changed = (changed_flights / total_flights * 100)
airport_stats.append({
'airport': airport,
'total_flights': total_flights,
'changed_flights': changed_flights,
'percent_changed': round(percent_changed, 2)
})
return pd.DataFrame(airport_stats)
# First identify all pairs with different distances and collect change information
original_distances = defaultdict(set)
revised_distances = defaultdict(set)
different_pairs = set()
changes = []
# Get total number of unique airport pairs in original dataset
all_pairs = set()
for flight in original_data:
pair = normalize_pair(flight['origin'], flight['destination'])
all_pairs.add(pair)
original_distances[pair].add(flight['distance'])
for flight in revised_data:
pair = normalize_pair(flight['origin'], flight['destination'])
revised_distances[pair].add(flight['distance'])
total_pairs = len(all_pairs)
for pair in set(original_distances.keys()) | set(revised_distances.keys()):
orig_dist = list(original_distances[pair])[0] if pair in original_distances else None
rev_dist = list(revised_distances[pair])[0] if pair in revised_distances else None
if orig_dist != rev_dist:
different_pairs.add(pair)
changes.append({
'pair': pair,
'original_distance': orig_dist,
'revised_distance': rev_dist,
'change': rev_dist - orig_dist if (orig_dist and rev_dist) else None
})
affected_pairs = len(different_pairs)
# Overall statistics
total_flights = len(original_data)
total_changed_flights = sum(1 for flight in original_data
if normalize_pair(flight['origin'], flight['destination']) in different_pairs)
print(f"\nOverall Statistics:")
print(f"Total flights in dataset: {total_flights}")
print(f"Total flights with changed distances: {total_changed_flights}")
print(f"Percentage of flights affected: {(total_changed_flights/total_flights)*100:.2f}%")
print(f"\nAirport Pair Statistics:")
print(f"Total unique airport pairs in dataset: {total_pairs}")
print(f"Airport pairs with changed distances: {affected_pairs}")
print(f"Percentage of airport pairs affected: {(affected_pairs/total_pairs)*100:.2f}%")
# Distribution of changes
changes_df = pd.DataFrame(changes)
# Count the number of unique airport pairs for each type of change
pair_change_distribution = changes_df['change'].value_counts()
print("\nChange Distribution (number of unique airport pairs):")
print(f"+1 mile: {pair_change_distribution.get(1, 0)} pairs ({(pair_change_distribution.get(1, 0)/total_pairs)*100:.2f}% of all pairs)")
print(f"-1 mile: {pair_change_distribution.get(-1, 0)} pairs ({(pair_change_distribution.get(-1, 0)/total_pairs)*100:.2f}% of all pairs)")
# Count actual flights affected by each type of change
flight_changes = {1: 0, -1: 0}
for flight in original_data:
pair = normalize_pair(flight['origin'], flight['destination'])
if pair in different_pairs:
change = list(revised_distances[pair])[0] - list(original_distances[pair])[0]
flight_changes[change] += 1
print("\nFlights affected by each type of change:")
print(f"+1 mile: {flight_changes[1]} flights ({(flight_changes[1]/total_flights)*100:.2f}% of all flights)")
print(f"-1 mile: {flight_changes[-1]} flights ({(flight_changes[-1]/total_flights)*100:.2f}% of all flights)")
# Iterative analysis
excluded_airports = set()
running_total = 0
iteration = 1
while True:
df = get_airport_stats(original_data, different_pairs, excluded_airports)
if df is None or df.empty:
break
df = df.sort_values('changed_flights', ascending=False)
top_airport = df.iloc[0]
running_total += top_airport['changed_flights']
print(f"\nIteration {iteration} - Most affected remaining airport:")
print(f"Airport: {top_airport['airport']}")
print(f"Changed flights: {top_airport['changed_flights']} ({(top_airport['changed_flights']/total_changed_flights)*100:.1f}% of all changes)")
print(f"Running total: {running_total} ({(running_total/total_changed_flights)*100:.1f}% of all changes)")
print(f"Total flights for this airport (excluding previous airports): {top_airport['total_flights']}")
print(f"Percent changed: {top_airport['percent_changed']}%")
excluded_airports.add(top_airport['airport'])
iteration += 1 ...and the outputOverall Statistics: Airport Pair Statistics: Change Distribution (number of unique airport pairs): Flights affected by each type of change: Iteration 1 - Most affected remaining airport: Iteration 2 - Most affected remaining airport: Iteration 3 - Most affected remaining airport: Iteration 4 - Most affected remaining airport: Iteration 5 - Most affected remaining airport: Iteration 6 - Most affected remaining airport: |
Thanks for the analysis. Seems like not super consequential to change this so I think we are good. |
Flights
Dataset Improvements & Generation ScriptOverview
This PR addresses known issues with the
flights
datasets and introduces a versatile generation script.Key Changes
Timestamp Calculation Fix
Affects:
flights-2k/5k/10k/20k.json
Fixed incorrect departure date calculations by properly handling date boundary cases. Previously, timestamps were incorrectly combining:
Example Issues:
Implementation Note:
2k/5k/10k/20k
datasets corrected in-place rather than through resampling to minimize impact on existing applications. Minor distance variations (~1 mile) may appear due to source data reconciliation.In-place correction script (for reference)
Dataset Completeness
Affects:
flights-3m.csv
New Generation Script
New file:
flights.py
A comprehensive tool for processing U.S. DOT (BTS) On-Time Flight Performance data with enhanced flexibility and features. Creates date formats used in this repo.
Output Format Options
06142330
2023/06/14 23:30
23.75
Key Features
Customization Options
Row count control
Date range filtering
Column selection
Random sampling with seed
Date change flagging
Data Processing
Proper date boundary handling
Multiple output formats (CSV/JSON)
Comprehensive statistics generation
📚 Next Steps
Update docs. Confirm that .arrow version of flights data is formatted as intended.