Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flights 3m only has 200k rows #607

Closed
domoritz opened this issue Sep 19, 2024 · 7 comments · Fixed by #626
Closed

Flights 3m only has 200k rows #607

domoritz opened this issue Sep 19, 2024 · 7 comments · Fixed by #626
Labels

Comments

@domoritz
Copy link
Member

domoritz commented Sep 19, 2024

https://github.com/vega/vega-datasets/blob/main/data/flights-3m.csv seems to only have 200k rows.

wc -l flights-3m.csv
  231084 flights-3m.csv

Added in 1e70098 by @arvind

@dsmedia
Copy link
Collaborator

dsmedia commented Sep 20, 2024

Looks like the count in flights_200k may also be off.

from vega_datasets import data

datasets = ['flights_2k', 'flights_5k', 'flights_10k', 'flights_20k', 'flights_200k', 'flights_3m']

for dataset_name in datasets:
    dataset = getattr(data, dataset_name)()
    row_count = len(dataset)
    print(f"{dataset_name}: {row_count} rows")

Results:

flights_2k: 2000 rows
flights_5k: 5000 rows
flights_10k: 10000 rows
flights_20k: 20000 rows
flights_200k: 231083 rows
flights_3m: 231083 rows

We can regenerate 3m rows using this script, create a csv from the 3m parquet file here or something else?

@dsmedia
Copy link
Collaborator

dsmedia commented Nov 3, 2024

Hi @domoritz, would either of these approaches work for regenerating the full 3m dataset? Let me know if you need any clarification or have another preferred solution in mind.

@domoritz
Copy link
Member Author

domoritz commented Nov 3, 2024

The 200k is the same as https://square.github.io/crossfilter/ and we should make sure to keep that as close as possible.

The data is from 2001 I think so we will need to download it again it seems like.

D select min(FL_DATE), max(FL_DATE) from "flights-10m.parquet";
┌──────────────┬──────────────┐
│ min(FL_DATE) │ max(FL_DATE) │
│     date     │     date     │
├──────────────┼──────────────┤
│ 2006-01-01   │ 2007-06-30   │
└──────────────┴──────────────┘

There is a script to generate the flights data in https://github.com/vega/vega-datasets/blob/main/scripts/flights.js (coming from the commit I mentioned in the original issue).

Please go ahead. Would be great to fix this.

@dsmedia
Copy link
Collaborator

dsmedia commented Nov 5, 2024

I wanted to highlight a small inconsistency in how dates and times seem to be handled in the current flights datasets. Taking the first row of the current fights-3m.csv as an example:

Current format:

date,delay,distance,origin,destination
01010001,14,405,MCI,MDW

The date field (01010001) encodes Jan 1, 2001, 00:01 (The year 2001 is fixed for all rows). This appears to differ from the source FAA data:

FlightDate: 2001-01-01
CRSDepTime: 2350 (scheduled departure 23:50)
DepTime: 1.0 (actual departure 00:01)
DepDelay: 11.0 minutes
ArrDelay: 14.0 minutes

This flight actually departed on Jan 2 (next day) at 00:01, 11 minutes after its scheduled time of 23:50 on Jan 1. The current dataset's date encoding doesn't seem to capture this overnight boundary crossing.

After reviewing how this dataset is typically used for flight delay analysis, I believe we should encode the scheduled departure time (CRSDepTime) rather than actual departure time. This aligns with the common use case of analyzing delays based on scheduled flight times, with the delay field (using arrival delay) capturing the actual impact on travelers.

The good news is that, since the 2001 FAA source data is still available, we'll be able to regenerate the dataset from the same source as the original, ensuring consistency with the historical data while potentially addressing these data consistency issues.

@domoritz
Copy link
Member Author

domoritz commented Nov 6, 2024

Not sure who is using the 3m dataset anyway. I'm happy to change that one. The other ones we want to be more careful with.

@dsmedia
Copy link
Collaborator

dsmedia commented Nov 6, 2024

Agreed. One approach would be to 1. Update the 3m dataset with the correct combination of scheduled departure date and scheduled departure time 2. leave the other flights datasets as they are but note in sources.md the potential issue with the blurring of scheduled departure date + actual departure time 3. Add a generation script to demonstrate how to replicate the new 3m dataset from source files using the correct datetime method, and build in the option to build smaller subsets using the correct methodology if desired.

@domoritz
Copy link
Member Author

domoritz commented Nov 6, 2024

I think there is value in looking at actual departure time rather than scheduled. It would be great if we could fix the date for the ones where we didn't have it right. Same for the 3m one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants