-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flights 3m only has 200k rows #607
Comments
Looks like the count in from vega_datasets import data
datasets = ['flights_2k', 'flights_5k', 'flights_10k', 'flights_20k', 'flights_200k', 'flights_3m']
for dataset_name in datasets:
dataset = getattr(data, dataset_name)()
row_count = len(dataset)
print(f"{dataset_name}: {row_count} rows") Results:
We can regenerate 3m rows using this script, create a csv from the 3m parquet file here or something else? |
Hi @domoritz, would either of these approaches work for regenerating the full 3m dataset? Let me know if you need any clarification or have another preferred solution in mind. |
The 200k is the same as https://square.github.io/crossfilter/ and we should make sure to keep that as close as possible. The data is from 2001 I think so we will need to download it again it seems like.
There is a script to generate the flights data in https://github.com/vega/vega-datasets/blob/main/scripts/flights.js (coming from the commit I mentioned in the original issue). Please go ahead. Would be great to fix this. |
I wanted to highlight a small inconsistency in how dates and times seem to be handled in the current flights datasets. Taking the first row of the current fights-3m.csv as an example: Current format:
The date field (01010001) encodes Jan 1, 2001, 00:01 (The year 2001 is fixed for all rows). This appears to differ from the source FAA data: FlightDate: 2001-01-01 This flight actually departed on Jan 2 (next day) at 00:01, 11 minutes after its scheduled time of 23:50 on Jan 1. The current dataset's date encoding doesn't seem to capture this overnight boundary crossing. After reviewing how this dataset is typically used for flight delay analysis, I believe we should encode the scheduled departure time (CRSDepTime) rather than actual departure time. This aligns with the common use case of analyzing delays based on scheduled flight times, with the delay field (using arrival delay) capturing the actual impact on travelers. The good news is that, since the 2001 FAA source data is still available, we'll be able to regenerate the dataset from the same source as the original, ensuring consistency with the historical data while potentially addressing these data consistency issues. |
Not sure who is using the 3m dataset anyway. I'm happy to change that one. The other ones we want to be more careful with. |
Agreed. One approach would be to 1. Update the 3m dataset with the correct combination of scheduled departure date and scheduled departure time 2. leave the other flights datasets as they are but note in sources.md the potential issue with the blurring of scheduled departure date + actual departure time 3. Add a generation script to demonstrate how to replicate the new 3m dataset from source files using the correct datetime method, and build in the option to build smaller subsets using the correct methodology if desired. |
I think there is value in looking at actual departure time rather than scheduled. It would be great if we could fix the date for the ones where we didn't have it right. Same for the 3m one. |
https://github.com/vega/vega-datasets/blob/main/data/flights-3m.csv seems to only have 200k rows.
Added in 1e70098 by @arvind
The text was updated successfully, but these errors were encountered: