Flights 3m only has 200k rows #607

domoritz · 2024-09-19T12:18:54Z

https://github.com/vega/vega-datasets/blob/main/data/flights-3m.csv seems to only have 200k rows.

wc -l flights-3m.csv
  231084 flights-3m.csv

Added in 1e70098 by @arvind

The text was updated successfully, but these errors were encountered:

dsmedia · 2024-09-20T01:01:37Z

Looks like the count in flights_200k may also be off.

from vega_datasets import data

datasets = ['flights_2k', 'flights_5k', 'flights_10k', 'flights_20k', 'flights_200k', 'flights_3m']

for dataset_name in datasets:
    dataset = getattr(data, dataset_name)()
    row_count = len(dataset)
    print(f"{dataset_name}: {row_count} rows")

Results:

flights_2k: 2000 rows
flights_5k: 5000 rows
flights_10k: 10000 rows
flights_20k: 20000 rows
flights_200k: 231083 rows
flights_3m: 231083 rows

We can regenerate 3m rows using this script, create a csv from the 3m parquet file here or something else?

dsmedia · 2024-11-03T11:13:45Z

Hi @domoritz, would either of these approaches work for regenerating the full 3m dataset? Let me know if you need any clarification or have another preferred solution in mind.

domoritz · 2024-11-03T13:46:09Z

The 200k is the same as https://square.github.io/crossfilter/ and we should make sure to keep that as close as possible.

The data is from 2001 I think so we will need to download it again it seems like.

D select min(FL_DATE), max(FL_DATE) from "flights-10m.parquet";
┌──────────────┬──────────────┐
│ min(FL_DATE) │ max(FL_DATE) │
│     date     │     date     │
├──────────────┼──────────────┤
│ 2006-01-01   │ 2007-06-30   │
└──────────────┴──────────────┘

There is a script to generate the flights data in https://github.com/vega/vega-datasets/blob/main/scripts/flights.js (coming from the commit I mentioned in the original issue).

Please go ahead. Would be great to fix this.

dsmedia · 2024-11-05T23:44:55Z

I wanted to highlight a small inconsistency in how dates and times seem to be handled in the current flights datasets. Taking the first row of the current fights-3m.csv as an example:

Current format:

date,delay,distance,origin,destination
01010001,14,405,MCI,MDW

The date field (01010001) encodes Jan 1, 2001, 00:01 (The year 2001 is fixed for all rows). This appears to differ from the source FAA data:

FlightDate: 2001-01-01
CRSDepTime: 2350 (scheduled departure 23:50)
DepTime: 1.0 (actual departure 00:01)
DepDelay: 11.0 minutes
ArrDelay: 14.0 minutes

This flight actually departed on Jan 2 (next day) at 00:01, 11 minutes after its scheduled time of 23:50 on Jan 1. The current dataset's date encoding doesn't seem to capture this overnight boundary crossing.

After reviewing how this dataset is typically used for flight delay analysis, I believe we should encode the scheduled departure time (CRSDepTime) rather than actual departure time. This aligns with the common use case of analyzing delays based on scheduled flight times, with the delay field (using arrival delay) capturing the actual impact on travelers.

The good news is that, since the 2001 FAA source data is still available, we'll be able to regenerate the dataset from the same source as the original, ensuring consistency with the historical data while potentially addressing these data consistency issues.

domoritz · 2024-11-06T01:30:30Z

Not sure who is using the 3m dataset anyway. I'm happy to change that one. The other ones we want to be more careful with.

dsmedia · 2024-11-06T01:39:54Z

Agreed. One approach would be to 1. Update the 3m dataset with the correct combination of scheduled departure date and scheduled departure time 2. leave the other flights datasets as they are but note in sources.md the potential issue with the blurring of scheduled departure date + actual departure time 3. Add a generation script to demonstrate how to replicate the new 3m dataset from source files using the correct datetime method, and build in the option to build smaller subsets using the correct methodology if desired.

domoritz · 2024-11-06T04:29:22Z

I think there is value in looking at actual departure time rather than scheduled. It would be great if we could fix the date for the ones where we didn't have it right. Same for the 3m one.

dsmedia mentioned this issue Nov 10, 2024

fix: correct timestamp calculations in flight datasets & add generation script #626

Merged

domoritz closed this as completed in #626 Nov 11, 2024

dangotbanned mentioned this issue Nov 13, 2024

flights-3m on v2.10.0 exceeds configured limit #627

Closed

dangotbanned added the bug label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flights 3m only has 200k rows #607

Flights 3m only has 200k rows #607

domoritz commented Sep 19, 2024 •

edited

Loading

dsmedia commented Sep 20, 2024

dsmedia commented Nov 3, 2024

domoritz commented Nov 3, 2024

dsmedia commented Nov 5, 2024

domoritz commented Nov 6, 2024

dsmedia commented Nov 6, 2024

domoritz commented Nov 6, 2024

Flights 3m only has 200k rows #607

Flights 3m only has 200k rows #607

Comments

domoritz commented Sep 19, 2024 • edited Loading

dsmedia commented Sep 20, 2024

dsmedia commented Nov 3, 2024

domoritz commented Nov 3, 2024

dsmedia commented Nov 5, 2024

domoritz commented Nov 6, 2024

dsmedia commented Nov 6, 2024

domoritz commented Nov 6, 2024

domoritz commented Sep 19, 2024 •

edited

Loading