From aae39571981c84c7fc29eac0e27209fe22f9f933 Mon Sep 17 00:00:00 2001 From: Alexey Milovidov Date: Fri, 29 Mar 2024 23:39:55 +0100 Subject: [PATCH] Add README --- README.md | 342 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 330 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 92f9538..8beef51 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # ADS-B Massive Visualizer -This website allows you to aggregate and visualize massive amounts of air traffic data. The data is hosted in a ClickHouse database and queried on the fly. You can tune the visualizations with custom SQL queries and drill-down from 50 billion records down to individual data records. +This website is a technology demo that allows you to aggregate and visualize massive amounts of air traffic data. The data is hosted in a ClickHouse database and queried on the fly. You can tune the visualizations with custom SQL queries and drill-down from 50 billion records down to individual data records. ## Examples @@ -173,13 +173,329 @@ But DC-3 is unanimously the best: ## Data Source -ADS-B (Automatic Dependent Surveillance–Broadcast) is a radio protocol that is used to broadcast the data such as coordinates, altitude and velocity, and plenty of other interesting data. It is broadcast by "transponders" installed on airplanes (and not only planes). This protocol is unencrypted and there is no restrictions for collecting, using, or redistributing this data. Most of passenger airplanes are obligated to broadcast this data, and in certain countries even gliders, drones, and airport ground vehicles. Military and private light aircrafts might broadcast or not broadcast the data. +ADS-B (Automatic Dependent Surveillance–Broadcast) is a radio protocol that is used to broadcast the data such as coordinates, altitude and velocity, and plenty of other interesting data. It is broadcast by "transponders" installed on airplanes (and not only planes). This protocol is unencrypted and there are no restrictions for collecting, using, or redistributing this data. Most of passenger airplanes are obligated to broadcast this data, and in certain countries even gliders, drones, and airport ground vehicles. Military and private light aircrafts might broadcast or not broadcast the data. It is possible to collect this data out of thin air using your own radio receiver (e.g. in the form of SDR), although your receiver will see the data only in a certain range of your location. There are platforms for sharing and exchange of this data. Some of these platforms invite participants to share the data but restrict its redistribution by providing a commercial access to it. While the source data, broadcast by airplanes is essentially public domain, the companies may produce and license derivative works from this data. We use the data from two sources: [ADSB.lol](https://www.adsb.lol/) (full historical data is provided without restrictions: 30..50 million records per day, available since 2023) and [ADSB-Exchange](https://www.adsbexchange.com/products/historical-data/) (only provides samples of data from first day of each month: around 1.2 billion records per day with better coverage). +## Implementation Details + +The website is implemented as a single HTML page. It does not use JavaScript frameworks and the source code is not minified, so you can easily [read it](https://github.com/ClickHouse/adsb.exposed/blob/main/index.html). + +### Rendering the Map + +It uses a [Leaflet](https://github.com/Leaflet/Leaflet/) library to display the map. It adds two layers on the map. The background layer uses tiles from OpenStreetMap for a usual geographic map. The main layer overlays the visualization on top of the background map. + +The visualisation layer uses a `GridLayer` with a custom callback function `createTile` which generates Canvas elements on the fly: + +``` +L.GridLayer.ClickHouse = L.GridLayer.extend({ + createTile: function(coords, done) { + let tile = L.DomUtil.create('canvas', 'leaflet-tile'); + tile.width = 1024; + tile.height = 1024; + render(this.options.table, this.options.priority, coords, tile).then(err => done(err, tile)); + return tile; + } +}); + +const layer_options = { + tileSize: 1024, + minZoom: 2, + maxZoom: 19, + minNativeZoom: 2, + maxNativeZoom: 16, + attribution: '© Alexey Milovidov, ClickHouse, Inc. (data: adsb.lol, adsbexchange.com)' +}; +``` + +It uses tiles of 1024x1024 size for high resolution and to lower the number of requests to the database. + +The rendering function performs a request to ClickHouse using its HTTP API with the JavaScript's `fetch` function: + +``` +const query_id = `${uuid}-${query_sequence_num}-${table}-${coords.z - 2}-${coords.x}-${coords.y}`; +const hosts = getHosts(key); +const url = host => `${host}/?user=website&default_format=RowBinary` + + `&query_id=${query_id}&replace_running_query=1` + + `¶m_table=${table}¶m_sampling=${[0, 100, 10, 1][priority]}` + + `¶m_z=${coords.z - 2}¶m_x=${coords.x}¶m_y=${coords.y}`; + +progress_update_period = 1; +const response = await Promise.race(hosts.map(host => fetch(url(host), { method: 'POST', body: sql }))); +``` + +The SQL query can be edited by a user on the fly using a form on the page to adjust the visualization. It is a parameterized query, accepting tile coordinates (x, y) and zoom level as parameters. + +The query returns RGBA values of each pixel of the image in the `RowBinary` format (1024x1024 pixels, 1048576 rows, 4 bytes each, 4 MiB in total for each tile). It uses `zstd` compression in HTTP response as long as the browser supports it. It was a nice observation that `zstd` compression over raw pixels bitmap works better than `PNG`! (not surprising, though). + +Most of the times compress several times. But anyway, hundreds of megabytes have to be transferred over network. That's why the service can feel slow on bad Internet connections. + +``` +let ctx = tile.getContext('2d'); +let image = ctx.createImageData(1024, 1024, {colorSpace: 'display-p3'}); +let arr = new Uint8ClampedArray(buf); + +for (let i = 0; i < 1024 * 1024 * 4; ++i) { image.data[i] = arr[i]; } + +ctx.putImageData(image, 0, 0, 0, 0, 1024, 1024); +``` + +The data is put on the canvas and returned. We use the "Display P3" color space to have a wider gamut in supporting browsers. + +We use three different tables with different levels of detail: `planes_mercator` contains 100% of data, `planes_mercator_sample10` contains 10% of data and `planes_mercator_sample100` contains 1% of data. The loading starts with 1% sample to provide instant response even while rendering the whole world. After loading the first level of detail, it continues of the next level of 10%, and then it continues with 100% of data. This gives a nice effect of progressive loading. + +The image data is also cached on the client simply in a JavaScript object: + +``` +if (!cached_tiles[key]) cached_tiles[key] = []; + +/// If there is a higer-detail tile, skip rendering of this level of detal. +if (cached_tiles[key][priority + 1]) return; + +buf = cached_tiles[key][priority]; +``` + +The downside is that after browsing for a certain time, the page will eat too much memory. + +### Database and Queries + +The database is pretty small by ClickHouse standards. It has 44.47 billion rows in the `planes_mercator` table as of Mar 29th 2024 and continuously updated with the new records. It takes 1.6 TB of disk space. + +The table schema is as follows (you can read it in the [setup.sql](https://github.com/ClickHouse/adsb.exposed/blob/main/setup.sql) source): +``` +CREATE TABLE planes_mercator +( + mercator_x UInt32 MATERIALIZED 0xFFFFFFFF * ((lon + 180) / 360), + mercator_y UInt32 MATERIALIZED 0xFFFFFFFF * (1/2 - log(tan((lat + 90) / 360 * pi())) / 2 / pi()), + + INDEX idx_x (mercator_x) TYPE minmax, + INDEX idx_y (mercator_y) TYPE minmax, + + time DateTime64(3), + date Date, + icao String, + r String, + t LowCardinality(String), + dbFlags Int32, + noRegData Bool, + ownOp LowCardinality(String), + year UInt16, + desc LowCardinality(String), + lat Float64, + lon Float64, + altitude Int32, + ground_speed Float32, + track_degrees Float32, + flags UInt32, + vertical_rate Int32, + aircraft_alert Int64, + aircraft_alt_geom Int64, + aircraft_gva Int64, + aircraft_nac_p Int64, + aircraft_nac_v Int64, + aircraft_nic Int64, + aircraft_nic_baro Int64, + aircraft_rc Int64, + aircraft_sda Int64, + aircraft_sil Int64, + aircraft_sil_type LowCardinality(String), + aircraft_spi Int64, + aircraft_track Float64, + aircraft_type LowCardinality(String), + aircraft_version Int64, + aircraft_category Enum8( + 'A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', + 'B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', + 'C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', + 'D0', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', + ''), + aircraft_emergency Enum8('', 'none', 'general', 'downed', 'lifeguard', 'minfuel', 'nordo', 'unlawful', 'reserved'), + aircraft_flight LowCardinality(String), + aircraft_squawk String, + aircraft_baro_rate Int64, + aircraft_nav_altitude_fms Int64, + aircraft_nav_altitude_mcp Int64, + aircraft_nav_modes Array(Enum8('althold', 'approach', 'autopilot', 'lnav', 'tcas', 'vnav')), + aircraft_nav_qnh Float64, + aircraft_geom_rate Int64, + aircraft_ias Int64, + aircraft_mach Float64, + aircraft_mag_heading Float64, + aircraft_oat Int64, + aircraft_roll Float64, + aircraft_tas Int64, + aircraft_tat Int64, + aircraft_true_heading Float64, + aircraft_wd Int64, + aircraft_ws Int64, + aircraft_track_rate Float64, + aircraft_nav_heading Float64, + source LowCardinality(String), + geometric_altitude Int32, + geometric_vertical_rate Int32, + indicated_airspeed Int32, + roll_angle Float32 +) ENGINE = MergeTree ORDER BY (mortonEncode(mercator_x, mercator_y), time); +``` + +It contains `lat`, `lon` columns with coordinates, and we use `MATERIALIZED` columns to automatically convert them to coordinates in the [Web-Mercator projection](https://en.wikipedia.org/wiki/Web_Mercator_projection), which is used by the Leaflet software and most of the maps on the Internet. The mercator coordinates are stored in UInt32 which makes it easy to do arithmetics with tile coordinates and zoom levels in a SQL query. + +We create level-of-detail tables with [Materialized Views](https://clickhouse.com/docs/en/guides/developer/cascading-materialized-views), so they are calculated automatically: + +``` +CREATE TABLE planes_mercator_sample10 AS planes_mercator; +CREATE TABLE planes_mercator_sample100 AS planes_mercator; + +CREATE MATERIALIZED VIEW view_sample10 TO planes_mercator_sample10 AS SELECT * FROM planes_mercator WHERE rand() % 10 = 0; +CREATE MATERIALIZED VIEW view_sample100 TO planes_mercator_sample100 AS SELECT * FROM planes_mercator WHERE rand() % 100 = 0; +``` + +We use a service on our staging environment in [ClickHouse Cloud](https://clickhouse.com/cloud). The staging environment is used to test new ClickHouse versions and new types of infrastructure that we implement. For example, we can try different type of machines, different scales of the service, or we can test new features, such as distributed cache, that are in development. Also, the staging environment uses fault injection: we interrupt network connections with a certain probability to ensure that the service operates normally. It uses chaos engineering: we terminate various machines of clickhouse-server and clickhouse-keeper at random, and also randomly scale the service back and forth to a different number of machines. This is how this project facilitates the development and testing of our service. + +We also created a backup service and do requests to both of them concurrently. Whichever service returns first, it will be used. This is how we can avoid downtime while still using our staging environment. + +Let's take a look at an SQL query for the "Boeing vs. Airbus" visualization: + +``` +WITH + bitShiftLeft(1::UInt64, {z:UInt8}) AS zoom_factor, + bitShiftLeft(1::UInt64, 32 - {z:UInt8}) AS tile_size, + + tile_size * {x:UInt16} AS tile_x_begin, + tile_size * ({x:UInt16} + 1) AS tile_x_end, + + tile_size * {y:UInt16} AS tile_y_begin, + tile_size * ({y:UInt16} + 1) AS tile_y_end, + + mercator_x >= tile_x_begin AND mercator_x < tile_x_end + AND mercator_y >= tile_y_begin AND mercator_y < tile_y_end AS in_tile, + + bitShiftRight(mercator_x - tile_x_begin, 32 - 10 - {z:UInt8}) AS x, + bitShiftRight(mercator_y - tile_y_begin, 32 - 10 - {z:UInt8}) AS y, + + y * 1024 + x AS pos, + + count() AS total, + sum(desc LIKE 'BOEING%') AS boeing, + sum(desc LIKE 'AIRBUS%') AS airbus, + sum(NOT (desc LIKE 'BOEING%' OR desc LIKE 'AIRBUS%')) AS other, + + greatest(1000000 DIV {sampling:UInt32} DIV zoom_factor, total) AS max_total, + greatest(1000000 DIV {sampling:UInt32} DIV zoom_factor, boeing) AS max_boeing, + greatest(1000000 DIV {sampling:UInt32} DIV zoom_factor, airbus) AS max_airbus, + greatest(1000000 DIV {sampling:UInt32} DIV zoom_factor, other) AS max_other, + + pow(total / max_total, 1/5) AS transparency, + + 255 * (1 + transparency) / 2 AS alpha, + pow(boeing, 1/5) * 256 DIV (1 + pow(max_boeing, 1/5)) AS red, + pow(airbus, 1/5) * 256 DIV (1 + pow(max_airbus, 1/5)) AS green, + pow(other, 1/5) * 256 DIV (1 + pow(max_other, 1/5)) AS blue + +SELECT round(red)::UInt8, round(green)::UInt8, round(blue)::UInt8, round(alpha)::UInt8 +FROM {table:Identifier} +WHERE in_tile +GROUP BY pos ORDER BY pos WITH FILL FROM 0 TO 1024*1024 +``` + +It uses the `WITH` clause to define aliases for various expressions. It uses type-safe parameters (x, y, z, sampling, table) for [parameterized queries](https://clickhouse.com/docs/en/sql-reference/syntax#defining-and-using-query-parameters). + +The first part of the query calculates the condition `in_tile` which is used in the `WHERE` section to filter the data in the requested tile. Then it calculates the colors: alpha, red, green, blue. They are adjusted by the `pow` function for better uniformity, clamped to the 0..255 range, and converted to `UInt8`. The sampling parameter is used for adjustment in a way so that queries with lower level of detail will return pictures with mostly the same relative colors. We group by the pixel coordinate `pos` and also use the [`WITH FILL` modifier](https://clickhouse.com/docs/en/sql-reference/statements/select/order-by#order-by-expr-with-fill-modifier) in the ORDER BY to fill zeros in the pixel positions that have no data. As a result, we will get an RGBA bitmap with the exact 1024x1024 size. + +### Reports + +If you select an area with the right mouse button, or using a selection tool, it will generate a report from the database for the selection area. This is entirely straightforward. For example, here is a query for the top aircraft types: + +``` +const sql_types = ` + WITH mercator_x >= {left:UInt32} AND mercator_x < {right:UInt32} + AND mercator_y >= {top:UInt32} AND mercator_y < {bottom:UInt32} AS in_tile + SELECT t, anyIf(desc, desc != '') AS desc, count() AS c + FROM {table:Identifier} + WHERE t != '' AND ${condition} + GROUP BY t + ORDER BY c DESC + LIMIT 100`; +``` + +The report is calculated for flight numbers, aircraft types, registration (tail numbers), and owners. You can click on any item and it will apply filter to the main SQL query. For example, click on `A388` and it will show you a visualization for Airbus 380-800. + +As a bonus, if you move the cursor over an aircraft type, it will go to Wikipedia API and try to find a picture of this aircraft. It often finds something else on Wikipedia, though. + +### Saved Queries + +You can edit a query and then share a link. The query is converted to a 128-bit hash and saved in the same ClickHouse database: + +``` +async function saveQuery(text) { + const sql = `INSERT INTO saved_queries (text) FORMAT RawBLOB`; + const hosts = getHosts(null); + const url = host => `${host}/?user=website_saved_queries&query=${encodeURIComponent(sql)}`; + const response = await Promise.all(hosts.map(host => fetch(url(host), { method: 'POST', body: text }))); +} + +async function loadQuery(hash) { + const sql = `SELECT text FROM saved_queries WHERE hash = unhex({hash:String}) LIMIT 1`; + const hosts = getHosts(null); + const url = host => `${host}/?user=website_saved_queries&default_format=JSON¶m_hash=${hash}`; + const response = await Promise.race(hosts.map(host => fetch(url(host), { method: 'POST', body: sql }))); + const data = await response.json(); + return data.data[0].text; +} +``` + +We use a different user `website_saved_queries` with different access control and quotas for these queries. + +### Progress Bar + +It is nice to display a progress bar with the amount of data processed in rows and bytes. + +``` +const sql = `SELECT + sum(read_rows) AS r, + sum(total_rows_approx) AS t, + sum(read_bytes) AS b, + r / max(elapsed) AS rps, + b / max(elapsed) AS bps, + formatReadableQuantity(r) AS formatted_rows, + formatReadableSize(b) AS formatted_bytes, + formatReadableQuantity(rps) AS formatted_rps, + formatReadableSize(bps) AS formatted_bps + FROM clusterAllReplicas(default, system.processes) + WHERE user = 'website' AND startsWith(query_id, {uuid:String})`; + + const hosts = getHosts(uuid); + const url = host => `${host}/?user=website_progress&default_format=JSON¶m_uuid=${uuid}`; + + let responses = await Promise.all(hosts.map(host => fetch(url(host), { method: 'POST', body: sql }))); + let datas = await Promise.all(responses.map(response => response.json())); +``` + +We just select from `system.processes` across all servers of the cluster. Actually it displays not the precise progress, because there are many tiles requested in parallel and many queries, with some of them finished and some in progress. The query will see only in-progress queries, so the number of total processed records will be lower than the actual. + +We also color the progress bar differently when we are loading the first level of detail, the second level of detail, etc. + +### Cache Locality + +The service in ClickHouse Cloud can use multiple replicas and by default, the requests are routed to an arbitrary healthy replica. If a query is going to process a large amount of data, it will be parallelized across many replicas, but if it is a small query it can use just a single replica. The data is stored in AWS S3 and each replica pod also has locally attached SSD which is used for the cache, and consequently, the page cache in memory also makes an effect. + +To achieve better performance, we use a "sticky connections" feature of ClickHouse Cloud: + +``` +let prefix = ''; +if (stick) { + const hash = sipHash128(JSON.stringify(stick)).substr(0, 3); + prefix = `${hash}.sticky.`; +} +const cloud_host = `https://${prefix}kvzqttvc2n.eu-west-1.aws.clickhouse-staging.com`; +``` + +It maps a query to a pseudorandom host name, e.g. `xyz.sticky.kvzqttvc2n.eu-west-1.aws.clickhouse-staging.com` so that the same query will be mapped to the same hostname. The load balancer maps this host name to a replica using a consistent hashing algorithm. + + ## References ### Similar Projects @@ -224,13 +540,15 @@ This blog post shows how to analyze a single day of data using a Python script. ### List Of Exchanges -- adsb.lol -- adsb.fi -- adsb.im -- adsb.one -- asdbhub.org -- adsbexchange.com -- flightaware.com -- flightradar24.com -- opensky-network.org -- openflights.org +- https://adsb.lol/ - the best resource, provides historical data for download with no strings attached; +- https://adsb.fi/ - advertises open and unfiltered access, but does not provide batch data access; +- https://adsb.one/ - advertises being dedicated to public interest, but does not provide batch data access; +- https://adsb.im/ - does not provide batch data access; +- https://www.adsbhub.org/ +- https://adsbexchange.com/ - a large data archive, but no longer provides the full data publicly after the acquisition; +- https://flightaware.com/ +- https://flightradar24.com/ - a popular website, good coverage, but no public data access; +- https://opensky-network.org/ - supposed to be open for research purpuses, but they didn't answer to email; +- https://openflights.org/ + +I want to help you with your database infrastructure. Please [contact me here](https://clickhouse.com/company/contact?loc=adsb).