By looking at the Rita flight dataset for 2008, this visualization explores the relationship between average flight departure delay time and the hour of day. This also leads us to explore the relationship between delay time and other periods of time, like day of week and months of the year.
I suggest you first view the visualization, and then read the full design process.
After exploring the data, trying out different chart types, playing with the narrative and getting feedback, the best way to communicate the main point was by a simple bar chart. The bar chart shows the average delay time for departing flights as seen by the hours of a day. It's a clear design choice because it easily helps the viewer compare the delay times by the size or height of the bar. Also given it's widespread usage, a viewer is already trained to interpreting these types of plots.
The process of getting feedback was a good lesson on less is more, and to first and foremost, focus on the main message of the visualization.
That's why I chose the bar chart over the more fancy grid visualization as the main plot. The bar chart simply communicates the message faster and clearer. Also, after discovering the curious relationship between flight volume and delay time, I explored other charts like: scatter plot of flight count and delay time and more complex scatter plots using size and colors to add the hour variable. All of these did not add much to the main point, they actually seemed to take away attention from the key point and made users take much more time to understand. So I decided to keep the two bar graphs, one for average flight delays by hour of day, this one as mentioned before is the most important one because it gives the reader an actionable metric he can use to make a decision of choosing flight schedules. To complement and to explain this, I then present the volume flight plot.
Additional design choices around the bar chart were first to add focus to the hours with highest average delay times, which is the main point of the visualization. I accomplished this by changing the color of the bars of interest. Then, to improve the viewers interaction, I also fixed the tooltip message to just include the exact amount of minutes. Again, I found out that less is more, a cluttered tooltip hurts more than helps the viewer.
Visual encodings | Variable |
---|---|
distance x | hour of day |
distance y | average delay time |
color hue | high/low average delay time |
The same points apply to volumes of flights bar chart. Keeping both of them the same chart type helps the viewer better see the relationship between both variables.
After the viewer get's a feel of the delay time between hours of the day, a next natural step is to present them with the grid visualization which combines the bar chart variables and adds another, days of the week.
For the grid visualization, I decided to use a calendar grid comparing how flight delays vary depending on the hour of day and day of week. The inspiration came from a Trulia graph depicting the time of day and day of week most people go house hunting. The idea behind the design is to use hue-colors as a way to communicate the amount of delay time, hotter colors representing longer delays while cooler colors shorter. The reasons for choosing the use of colors and a grid chart is that in just one graph we are able to group the relationship between three variables and make it aesthetically pleasing way. It permits us to look at relationships like: what's the worst day-time of the week regarding flight time delays? Are time delays worst on early Tuesdays or late Fridays? What I found out based on feedback is that this visualization is not so easy to digest, and the color coding is far from clear. To improve these concerns, I changed the color coding from multiple color-hues to different intensities of a single-hue. This definitely helped interpret the grid better, there is simply less room for the viewer to mis-understand. Also, the tooltip is a really helpful interaction for the same reason, the user can explore the data much more freely and get concrete values instead of guessing what the 3'rd level of color intensity means.
Visual encodings | Variable |
---|---|
distance x | hour of day |
distance y | day of week |
color hue | flight delay time |
The grid visualization confirms the message the bar charts first gives, and it also tells us something interesting:
there is not much difference in average delay times between days of week
The next chart to be explored is a line chart, which shows the average delay times by month of year. A line chart was chosen to present this data so as to show the time relationship month by month. After exploring multiple years, the month variations stays roughly the same, and so it appears to be seasonal. To give focus on the months with highest delay times, I decided to change the color of the months with highest value, which turns out to be February (not quite sure why), June through August (summer vacations) and December (holiday).
Visual encodings | Variable |
---|---|
distance x | month of year |
distance y | average delay time |
color hue | high/low average delay time |
The layout changed a lot based on the narrative of the visualization. You can appreciate this in the sketches below. The final layout was chosen because it was the one which best communicated the core message. I chose an author driven approach to explore the visualization and tried to follow a natural flow of curiosity and practicality (usefulness) of the information. By these principles, the bar chart came as clear first pick to present the most information in a fast and clear manner.
I also decided to keep all graphs because I believe that each one gives an important piece of advice. Even though the most clear one is the hour of day plot, the grid visualization also tells us that it does not matter which day of week you travel, delay time stays roughly the same. The month graph also gives us an idea of the seasonality of flights and maybe a hint of delay dependent on weather as well.
This is the first sketch I did to get an overview idea of the visualizations I was planning to use.
This first implementation took some time. The biggest problems where basically formatting the data so that the site uses an aggregate sample because the original files are 600mb or larger. So I implemented process_data.py
script to take care of that.
For the second version, the biggest changes based on feedback were:
- Include a legend for the first graph, the message was not clear.
- Add chart titles
- Fixed an issue with the bar chart including NaN values
- Removed non-existing 2009 data from line chart
For the third version, the biggest changes based on feedback were:
- Fixed issue with some labels in first visualization being different sizes
- Updated the labels in legend
- Updated the x and y axis labels for both lower charts.
For the fourth version, there were some big changes based on Udacity coach feedback:
- General improvement of main point, story and structure
- All charts are using same dataset for 2008 and using average delay time
- Changed hours on grid visualization to be consistent with other graph, 24 hour format
- Improved titles and subtitles
- In line chart, moved December after November
- Added tooltip to grid visualization and better explained color coding
- Added comments to HTML and CSS and explained commented out code.
- Added Pixelapse feedback (number 4)
- Improved pop overs for bar and line chart
- Changed color palette of grid visualization
For the final version, a new graph was added and narrative improved:
- Fixed issue of average of averages by always doing a weighted average.
- Added new chart depicting volumes of flights
- Improved narrative of visualization.
After hearing the feedback and thinking about the message, I believe that these three graphs are the right design choices.
Getting good feedback is king! It is the only way to validate that the visualization is accomplishing it's purpose and communicating the message as intended. A good communication principle tells us that the message that really matters is the one being received by the viewer in this case, not the one we believe we are sending.
After multiple iterations, I found out that keeping focus on the main point and checking if the feedback was consistent with the objective is what design is all about.
To that point, the bar chart is the one that most clearly communicates the message. After the viewer 'gets' educated by the first graph, we can take them to a more elaborate plot, the grid visualization. Finally, complementing the message we present the line plot to finish the delay-time period exploration.
I believe that the three charts clearly show:
- The worst flight delays tend to happen between 11pm and 3am
- December holidays and summer have a huge peak in flight delays
- There is not much difference between days of the week
An important note, in the descriptions used in the visualization, people seem to prefer the 12-hour clock format, even though the plots use a 24-hour clock format. This inconsistency is quite interesting. I suspect it's due to the fact that when reading we are trained to rapidly interpret the 12-hour format, we are more accustomed to it. But when analyzing a graph, we might prefer an increasing axis labels, as we expect that from plots. It appears that there are two different expectations for the clock format depending on context.
To validate and review my visualizations, I conducted 4 interviews. Most were completed through email and one using Pixelapse. Below you can find the comments. I also included part of the Udacity coach's feedback.
What do the colors of the first graphic mean? Both graphs below don't clearly show the relationship they are tying to show, it takes time to understand. I liked a lot the interaction that both small graphs have. What does DepDelay mean? The graph in the bottom left does not have units of the time being measured (months, days, years)? What is the purpose of the graphs? I like the colors and design of the graphs, but I feel they lack context. Months 5-9 increase DepDelay, surely because of the increased amount of flights in summer time.
Visualization. Do you understand the message? yes, a correlation is being made between the departure delays and the time of the delay with the day of the week, the hour and also the month of the year. What do you notice in the visualization? departure delays have a tendency to happen more during the afternoon, there are months with highest delays (must be high demand of travel), also on Monday and weekends. What questions do you have about the data? it will be interesting to correlate the data with the amount of demand for travel, given that delays must happen by the amount of passengers, bags, etc... that make all the process longer. Also months that have a high delay on flights are high demand months given holidays, so is it more that the airlines are overselling their flights that is causing the delays, or given the high amount of air traffic delay is prone to happen. What relationships do you notice? high amount of delays happen on late hours of the day and holiday months. What do you think is the main takeaway from this visualization? departure delays are mainly on late hours of the day and specially on Monday. Also there is a relationship year by year on the months where delays are higher so this data could help the passenger to plan ahead what hour of the day is best to travel and also to have in mind in which month delays are to be expected. Additionally is powerful information for airports to better organize air traffic and most of all for the airlines to plan ahead their flight schedule and if no major change can be made to preempt the delays they have to budget money to pay fees to passengers when delays affect their connecting flights or plans. Is there something you don’t understand in the graphic? I got a little confused with the second graphic, given that the x-axis only says time and is being based in a 24 hour clock vs the previews chart that is based in a 12 hour clock.
The description for the colors for delay bar for the graph Departure delay by hour of day and day of week could improve. I would put on the left, "No delay" and on the right "Delayed 12,042 minutes" I like the design of each graph. The content now gives a great sight into what each graph is about. One change that might be good is to have an abbreviation of each month (jan, feb, etc...) instead of numbers for the graph Departure delay time (minutes) by month Overall, I see improvement with each version.
When i opened the project, what i saw at first was the title about flight delays in US. Then i skip some information and start trying to understand what graphics are saying. I started with the colorful square based graphic. I understood pretty well it´s information. Then with the blue bars i didn’t understand quickly so i skip it. By this time i started to read a little bit more other texts than could help me to read the graphics. Then i watched the third chart where i started to ask if there are a pattern of this data year by year. I saw that each year this numbers increased. finally i returned to read the second chart that in this time was more easy to understand. Yes, after i read all information it was clear. In first place that the first and second chart are very related. and for conclusion that there is a big amount of time lost in some way and this phenomenon happens depending on the time, day or month. The question i had was that if this situation is getting better or worse thru time. between the first and second chart. I didn’t understand why in the first chart the hour 1a is bigger than the others. No, after i read all information it was clear. It took me like 10min watching all the information in my phone.
You should work further on the data and your use of it in order to make a visualization that is easy to understand. The way that delays have been aggregated in each of the three plots seems to be different and in each case it is difficult to interpret. The note stating the aggregation (above the legend) is very small and hard to read. It is not very easy to understand what 12,000 minutes of delays means in practice. It is currently not possible to understand whether large total delays are caused by long delays or simply many flights in total. It would be better to show the average delay per flight. (Don’t forget to look carefully at which average would be best for this - would a mean or a median be more appropriate?) This would make the graphic easier to relate to a real-life situation. It also disentangles the effect of the number of flights at a given time (hour/day of week/month of year) from the lengths of the delays at that time. This could lead to an interesting investigation of whether delays occur more often at busy times of day. Depending on your findings, this could make a more compelling story.
- add a dynamic grid table, with the ability to change the data for a given year
- add summary circle dashboard on top
- To process the original RITA files, I used R, there is a file called
Processing.Rmd
. - The data used is
data/2008-DateTime.csv
via D3.js.
http://sebasibarguen.github.io/udacity-nanodegree-visualization/