Slide deck that summarizes implementation and results found here
A more detailed write-up is found here
I utilized available flight location data to find the most heavily traveled segments of the skies that corrospond to defined routes that planes generally use.
ADS-B Exchange provides snapshots of global flight data every 5 seconds. They provide this data for free download from the first of every month. Available here: https://www.adsbexchange.com/data/#sample
I downloaded the ASD-B snapshot data from the 1st of the past 12 months (available as gzip archives), and extracted the raw json files. (See code)
I wrote a python script to load each JSON file into memory and write the relevant fields to a CSV file. Notably, I only wrote records to the CSV files that had non-null values for each of those relevant fields; otherwise the record was discarded. (See code)
This step was reworked in MapReduce (Step 5). Original description:
I transformed the snapshot data into spacial flight segments. Look at 2 consecutive snapshots, and find all planes that are in both snapshots. The flight segment for that plane is the location data (coordinates) from both snapshots: Loaction1<latitude, longitude>, Location2<latitude, longitude>
. When considering the segements between snapshots, 5 seconds is too little time for planes to move a useful distance to be tracked. So, I chose to iterate through the snapshots in steps of 5 minutes to allow for significant flight segments to form. (I also iterate through the snapshots with every offset between 0 and 300 seconds to ensure I do not miss flight segments because they didn't happen during a time divisable by 5 minutes. For example, you can count by 5 starting at 0,1,2,3,4 as to hit every number between 0 and 100.)
For latitude and longitude, 2 degress of precision means the location is accurate to within 1.11 kilometers (according to http://wiki.gis.com/wiki/index.php/Decimal_degrees). This was a jusgment call in conjunction with selecting 5 minute intervals for flight segments. The precision of coordinates and length of interval could be an area for optimization.
I used a chain job to run several mappers before and after my main reducer step.
I wrote a MapReduce job (mapper only) to round the latitude and longitude coordinates of each records to 2 decimal places. I chose to do this because routing to 2 decimals places means the coordinates are accurate within 500 meters of error. This is sufficient for looking at 5 minute segments (discussed further in next step) so that the flight path is represented while allowing for similar paths (within 500 meters) to be aggregated together.
(See code)
I wrote a MapReduce job (mapper + reducer) to transform the “snapshots,” an aircraft’s current location at a timestamp, to “segments,” an aircraft’s start and end location over a specific interval. (I chose 5 minute intervals somewhat arbitrarily after trying different values and examining the segments that resulted from the transformation process.)
I wrote a MapReduce job (mapper only) to remove flight segments where the start and end coordinates were the same, which meant that the aircraft was stationary. (This occurs often when plane are on the ground at an airport shortly before takeoff/after landing, but the radar is still tracking the plane’s location.)
(See code)
I loaded the output of the last MapReduce job into Hive as an external table (simply pointing Hive to the HDFS directory of Step 4’s output). I also created a “metadata” table in Hive from the output of Step 1 that contained static attributes of a flight (such as aircraft type). This was a “SELECT DISTINCT” query that yielded 1 row of metadata for every flight (identified by hex id).
CREATE EXTERNAL TABLE at_seg(id STRING, seg STRING, dist FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE
LOCATION '/user/ntk9692/air-traffic/mr_chain/output-final2';
insert overwrite directory '/user/ntk9692/air-traffic/hex-metadata-1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE
SELECT hex, aircraft FROM snapshots GROUP BY hex, aircraft;
CREATE EXTERNAL TABLE hex_meta(id STRING, aircraft STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE
LOCATION '/user/ntk9692/air-traffic/hex-metadata-1';
I executed several Hive queries on the joined data from both table described in Step 5. These queries returned the top 100 flight segments for certain conditions (such as in the NYC area) which were determined by counting the frequency of that segment in the dataset of all segments.
SELECT seg, count(*) as num_flights FROM at_seg
GROUP BY seg ORDER BY num_flights DESC LIMIT 10;
SELECT seg, count(*) as num_flights FROM at_seg
WHERE dist > 0.1
GROUP BY seg ORDER BY num_flights DESC LIMIT 1000;
hive -e 'use ntk9692; SELECT seg, count(*) as num_flights FROM at_seg WHERE dist > 0.1 GROUP BY seg ORDER BY num_flights DESC LIMIT 1000;' | sed 's/[\t]/,/g' > top-1000.csv
JOIN hex_meta ON (at_seg.id = hex_meta.id)
WHERE dist > 0.1 AND (aircraft LIKE "%B74%" OR aircraft LIKE "%A380%")
GROUP BY seg ORDER BY num_flights DESC LIMIT 1000;
hive -e 'use ntk9692; SELECT seg, count(*) as num_flights FROM at_seg JOIN hex_meta ON (at_seg.id = hex_meta.id) WHERE dist > 0.1 AND (aircraft LIKE "%B74%" OR aircraft LIKE "%A380%") GROUP BY seg ORDER BY num_flights DESC LIMIT 1000;' | sed 's/[\t]/,/g' > top-1000-heavy.csv
SELECT seg, count(*) as num_flights FROM at_seg
JOIN hex_meta ON (at_seg.id = hex_meta.id)
WHERE dist > 0.1 AND (aircraft LIKE "%B73%" OR aircraft LIKE "%B75%" OR aircraft LIKE "%A32%" OR aircraft LIKE "%A2%" OR aircraft LIKE "%A31%")
GROUP BY seg ORDER BY num_flights DESC LIMIT 1000;
hive -e 'use ntk9692; SELECT seg, count(*) as num_flights FROM at_seg JOIN hex_meta ON (at_seg.id = hex_meta.id) WHERE dist > 0.1 AND (aircraft LIKE "%B73%" OR aircraft LIKE "%B75%" OR aircraft LIKE "%A32%" OR aircraft LIKE "%A2%" OR aircraft LIKE "%A31%") GROUP BY seg ORDER BY num_flights DESC LIMIT 1000;' | sed 's/[\t]/,/g' > top-1000-standard.csv
I wrote a python script to transform the flight segments returned by the queries into geoJSON, which can be imported directly into ArcGIS online, a web-based mapping software. The maps with resulting segments are included in the next sections.
(See code)
Maps generated in ArcGIS
https://www.arcgis.com/home/webmap/viewer.html?webmap=4e52e534d7d04ba9b706a066876cde7f https://www.arcgis.com/home/webmap/viewer.html?webmap=57d15b5f7a304ebab9dba322d9c34700