-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(route/duckdb): change blogs link and author #17856
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Successfully generated as following: http://localhost:1200/duckdb/news - Success ✔️<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>DuckDB News</title>
<link>https://duckdb.org/news/</link>
<atom:link href="http://localhost:1200/duckdb/news" rel="self" type="application/rss+xml"></atom:link>
<description>DuckDB News - Powered by RSSHub</description>
<generator>RSSHub</generator>
<webMaster>[email protected] (RSSHub)</webMaster>
<language>en</language>
<lastBuildDate>Tue, 10 Dec 2024 13:24:55 GMT</lastBuildDate>
<ttl>5</ttl>
<item>
<title>The DuckDB Avro Extension</title>
<description><div class="content">
<div class="contentwidth">
<h1>The DuckDB Avro Extension</h1>
<div class="infoline">
<div class="icon">
<img src="https://duckdb.org/images/blog/authors/hannes_muehleisen.jpg" alt="Author Avatar" referrerpolicy="no-referrer">
</div>
<div>
<span class="author">Hannes Mühleisen</span>
<div class="publishedinfo">
<span>Published on</span>
<span class="date">2024-12-09</span>
</div>
</div>
</div>
<div class="excerpt">
<p><em>TL;DR: DuckDB now supports reading Avro files through the <code class="language-plaintext highlighter-rouge">avro</code> Community Extension.</em></p>
</div>
<h2 id="the-apache-avro-format">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-apache-avro-format">The Apache™ Avro™ Format</a>
</h2>
<p><a href="https://avro.apache.org/">Avro</a> is a binary format for record data. Like many innovations in the data space, Avro was <a href="https://vimeo.com/7362534">developed</a> by <a href="https://en.wikipedia.org/wiki/Doug_Cutting">Doug Cutting</a> as part of the Apache Hadoop project <a href="https://github.com/apache/hadoop/commit/8296413d4988c08343014c6808a30e9d5e441bfc">in around 2009</a>. Avro gets its name – somewhat obscurely – from a defunct <a href="https://en.wikipedia.org/wiki/Avro">British aircraft manufacturer</a>. The company famously built over 7,000 <a href="https://en.wikipedia.org/wiki/Avro_Lancaster">Avro Lancaster heavy bombers</a> under the challenging conditions of World War 2. But we digress.</p>
<p>The Avro format is yet another attempt to solve the dimensionality reduction problem that occurs when transforming a complex <em>multi-dimensional data structure</em> like tables (possibly with nested types) to a <em>single-dimensional storage layout</em> like a flat file, which is just a sequence of bytes. The most fundamental question that arises here is whether to use a columnar or a row-major layout. Avro uses a row-major layout, which differentiates it from its famous cousin, the <a href="https://parquet.apache.org/">Apache™ Parquet™</a> format. There are valid use cases for a row-major format: for example, appending a few rows to a Parquet file is difficult and inefficient because of Parquet's columnar layout and due to the fact the Parquet metadata is stored <em>at the back</em> of the file. In a row-major format like Avro with the metadata <em>up top</em>, we can “just” add those rows to the end of the files and we're done. This enables Avro to handle appends of a few rows somewhat efficiently.</p>
<p>Avro-encoded data can appear in several ways, e.g., in <a href="https://en.wikipedia.org/wiki/Remote_procedure_call">RPC messages</a> but also in files. In the following, we focus on files since those survive long-term.</p>
<h3 id="header-block">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#header-block">Header Block</a>
</h3>
<p>Avro “object container” files are encoded using a comparatively simple binary <a href="https://avro.apache.org/docs/++version++/specification/#object-container-files">format</a>: each file starts with a <strong>header block</strong> that first has the <a href="https://en.wikipedia.org/wiki/List_of_file_signatures">magic bytes</a> <code class="language-plaintext highlighter-rouge">Obj1</code>. Then, a metadata “map” (a list of string-bytearray key-value pairs) follows. The map is only strictly required to contain a single entry for the <code class="language-plaintext highlighter-rouge">avro.schema</code> key. This key contains the Avro file schema encoded as JSON. Here is an example for such a schema:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"namespace"</span><span class="p">:</span><span class="w"> </span><span class="s2">"example.avro"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"record"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"User"</span><span class="p">,</span><span class="w">
</span><span class="nl">"fields"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"name"</span><span class="p">,</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"favorite_number"</span><span class="p">,</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"int"</span><span class="p">,</span><span class="w"> </span><span class="s2">"null"</span><span class="p">]},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"favorite_color"</span><span class="p">,</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"string"</span><span class="p">,</span><span class="w"> </span><span class="s2">"null"</span><span class="p">]}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>The Avro schema defines a record structure. Records can contain scalar data fields (like <code class="language-plaintext highlighter-rouge">int</code>, <code class="language-plaintext highlighter-rouge">double</code>, <code class="language-plaintext highlighter-rouge">string</code>, etc.) but also more complex types like records (similar to <a href="https://duckdb.org/docs/sql/data_types/struct.html">DuckDB <code class="language-plaintext highlighter-rouge">STRUCT</code>s</a>), unions and lists. As a sidenote, it is quite strange that a data format for the definition of record structures would fall back to another format like JSON to describe itself, but such are the oddities of Avro.</p>
<h3 id="data-blocks">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#data-blocks">Data Blocks</a>
</h3>
<p>The header concludes with 16 randomly chosen bytes as a “sync marker”. The header is followed by an arbitrary amount of <strong>data blocks</strong>: each data block starts with a record count, followed by a size and a byte array containing the actual records. Optionally, the bytes can be compressed with deflate (gzip), which will be known from the header metadata.</p>
<p>The data bytes can only be decoded using the schema. The <a href="https://avro.apache.org/docs/++version++/specification/#object-container-files">object file specification</a> contains the details on how each type is encoded. For example, in the example schema we know each value is a record of three fields. The root-level record will encode its entries in the order they are declared. There are no actual bytes required for this. First we will be reading the <code class="language-plaintext highlighter-rouge">name</code> field. Strings consist of a length followed by the string bytes. Like other formats (e.g., Thrift), Avro uses <a href="https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding">variable-length integers with zigzag encoding</a> to store lengths and counts and the like. After reading the string, we can proceed to <code class="language-plaintext highlighter-rouge">favorite_number</code>. This field is a union type (encoded with the <code class="language-plaintext highlighter-rouge">[]</code> syntax). This union can have values of two types, <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">null</code>. The <code class="language-plaintext highlighter-rouge">null</code> type is a bit odd, it can only be used to encode the fact that a value is missing. To decode the <code class="language-plaintext highlighter-rouge">favorite_number</code> fields, we first read an <code class="language-plaintext highlighter-rouge">int</code> that encodes which choice of the union was used. Afterward, we use the “normal” decoders to read the values (e.g., <code class="language-plaintext highlighter-rouge">int</code> or <code class="language-plaintext highlighter-rouge">null</code>). The same can be done for <code class="language-plaintext highlighter-rouge">favorite_color</code>. Each data block again ends with the sync marker. The sync marker can be used to verify that the block was fully written and that there is no garbage in the file.</p>
<h2 id="the-duckdb-avro-community-extension">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-duckdb-avro-community-extension">The DuckDB <code class="language-plaintext highlighter-rouge">avro</code> Community Extension</a>
</h2>
<p>We have developed a DuckDB Community Extension that enables DuckDB to <em>read</em> <a href="https://avro.apache.org/">Apache Avro™</a> files.</p>
<p>The extension does not contain Avro <em>write</em> functionality. This is on purpose, by not providing a writer we hope to decrease the amount of Avro files in the world over time.</p>
<h3 id="installation--loading">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#installation--loading">Installation &amp; Loading</a>
</h3>
<p>Installation is simple through the DuckDB Community Extension repository, just type</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSTALL</span> <span class="n">avro</span> <span class="k">FROM</span> <span class="n">community</span><span class="p">;</span>
<span class="k">LOAD</span> <span class="n">avro</span><span class="p">;</span>
</code></pre></div></div>
<p>in a DuckDB instance near you. There is currently no build for Wasm because of dependencies (sigh).</p>
<h3 id="the-read_avro-function">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-read_avro-function">The <code class="language-plaintext highlighter-rouge">read_avro</code> Function</a>
</h3>
<p>The extension adds a single DuckDB function, <code class="language-plaintext highlighter-rouge">read_avro</code>. This function can be used like so:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'some_example_file.avro'</span><span class="p">);</span>
</code></pre></div></div>
<p>This function will expose the contents of the Avro file as a DuckDB table. You can then use any arbitrary SQL constructs to further transform this table.</p>
<h3 id="file-io">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#file-io">File IO</a>
</h3>
<p>The <code class="language-plaintext highlighter-rouge">read_avro</code> function is integrated into DuckDB's file system abstraction, meaning you can read Avro files directly from e.g., HTTP or S3 sources. For example:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'http://blobs.duckdb.org/data/userdata1.avro'</span><span class="p">);</span>
<span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'s3://my-example-bucket/some_example_file.avro'</span><span class="p">);</span>
</code></pre></div></div>
<p>should “just” work.</p>
<p>You can also <a href="https://duckdb.org/docs/sql/functions/pattern_matching.html#globbing"><em>glob</em> multiple files</a> in a single read call or pass a list of files to the functions:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'some_example_file_*.avro'</span><span class="p">);</span>
<span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">([</span><span class="s1">'some_example_file_1.avro'</span><span class="p">,</span> <span class="s1">'some_example_file_2.avro'</span><span class="p">]);</span>
</code></pre></div></div>
<p>If the filenames somehow contain valuable information (as is unfortunately all-too-common), you can pass the <code class="language-plaintext highlighter-rouge">filename</code> argument to <code class="language-plaintext highlighter-rouge">read_avro</code>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'some_example_file_*.avro'</span><span class="p">,</span> <span class="k">filename</span> <span class="o">=</span> <span class="k">true</span><span class="p">);</span>
</code></pre></div></div>
<p>This will result in an additional column in the result set that contains the actual filename of the Avro file.</p>
<h3 id="schema-conversion">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#schema-conversion">Schema Conversion</a>
</h3>
<p>This extension automatically translates the Avro Schema to the DuckDB schema. <em>All</em> Avro types can be translated, except for <em>recursive type definitions</em>, which DuckDB does not support.</p>
<p>The type mapping is very straightforward except for Avro's “unique” way of handling <code class="language-plaintext highlighter-rouge">NULL</code>. Unlike other systems, Avro does not treat <code class="language-plaintext highlighter-rouge">NULL</code> as a possible value in a range of e.g., <code class="language-plaintext highlighter-rouge">INTEGER</code> but instead represents <code class="language-plaintext highlighter-rouge">NULL</code> as a union of the actual type with a special <code class="language-plaintext highlighter-rouge">NULL</code> type. This is different to DuckDB, where any value can be <code class="language-plaintext highlighter-rouge">NULL</code>. Of course DuckDB also supports <code class="language-plaintext highlighter-rouge">UNION</code> types, but this would be quite cumbersome to work with.</p>
<p>This extension <em>simplifies</em> the Avro schema where possible: an Avro union of any type and the special null type is simplified to just the non-null type. For example, an Avro record of the union type <code class="language-plaintext highlighter-rouge">["int", "null"]</code> (like <code class="language-plaintext highlighter-rouge">favorite_number</code> in the <a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#header-block">example</a>) becomes a DuckDB <code class="language-plaintext highlighter-rouge">INTEGER</code>, which just happens to be <code class="language-plaintext highlighter-rouge">NULL</code> sometimes. Similarly, an Avro union that contains only a single type is converted to the type it contains. For example, an Avro record of the union type <code class="language-plaintext highlighter-rouge">["int"]</code> also becomes a DuckDB <code class="language-plaintext highlighter-rouge">INTEGER</code>.</p>
<p>The extension also “flattens” the Avro schema. Avro defines tables as root-level “record” fields, which are the same as DuckDB <code class="language-plaintext highlighter-rouge">STRUCT</code> fields. For more convenient handling, this extension turns the entries of a single top-level record into top-level columns.</p>
<h3 id="implementation">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#implementation">Implementation</a>
</h3>
<p>Internally, this extension uses the “official” <a href="https://avro.apache.org/docs/++version++/api/c/">Apache Avro C API</a>, albeit with some minor patching to allow reading Avro files from memory.</p>
<h3 id="limitations--next-steps">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#limitations--next-steps">Limitations &amp; Next Steps</a>
</h3>
<p>In the following, we disclose the limitations of the <code class="language-plaintext highlighter-rouge">avro</code> DuckDB extension along with our plans to mitigate them in the future:</p>
<ul>
<li>
<p>The extension currently does not make use of <strong>parallelism</strong> when reading either a single (large) Avro file or when reading a list of files. Adding support for parallelism in the latter case is on the roadmap.</p>
</li>
<li>
<p>There is currently no support for projection or filter <strong>pushdown</strong>, but this is also planned at a later stage.</p>
</li>
<li>
<p>There is currently no support for the Wasm or the Windows-MinGW builds of DuckDB due to issues with the Avro library dependency (sigh again). We plan to fix this eventually.</p>
</li>
<li>
<p>As mentioned above, DuckDB cannot express recursive type definitions that Avro has. This is unlikely to ever change.</p>
</li>
<li>
<p>There is no support to allow users to provide a separate Avro schema file. This is unlikely to change, all Avro files we have seen so far had their schema embedded.</p>
</li>
<li>
<p>There is currently no support for the <code class="language-plaintext highlighter-rouge">union_by_name</code> flag that other readers in DuckDB support. This is planned for the future.</p>
</li>
</ul>
<h2 id="conclusion">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#conclusion">Conclusion</a>
</h2>
<p>The new <code class="language-plaintext highlighter-rouge">avro</code> Community Extension for DuckDB enables DuckDB to read Avro files directly as if they were tables. If you have a bunch of Avro files, go ahead and try it out! We'd love to <a href="https://github.com/hannes/duckdb_avro/issues">hear from you</a> if you run into any issues.</p>
</div>
</div>
<div class="toc_sidebar">
<div class="toc_menu">
<h5>In this article</h5>
<ul id="toc" class="section-nav">
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-apache-avro-format">The Apache™ Avro™ Format</a>
<ul>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#header-block">Header Block</a></li>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#data-blocks">Data Blocks</a></li>
</ul>
</li>
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-duckdb-avro-community-extension">The DuckDB avro Community Extension</a>
<ul>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#installation--loading">Installation &amp; Loading</a></li>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-read_avro-function">The read_avro Function</a></li>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#file-io">File IO</a></li>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#schema-conversion">Schema Conversion</a></li>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#implementation">Implementation</a></li>
<li class="toc-entry toc-h3"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#limitations--next-steps">Limitations &amp; Next Steps</a></li>
</ul>
</li>
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#conclusion">Conclusion</a></li>
</ul>
</div>
</div>
</description>
<link>https://duckdb.org/2024/12/09/duckdb-avro-extension.html</link>
<guid isPermaLink="false">https://duckdb.org/2024/12/09/duckdb-avro-extension.html</guid>
<pubDate>Mon, 09 Dec 2024 00:00:00 GMT</pubDate>
<author>Hannes Mühleisen</author>
</item>
<item>
<title>DuckDB: Running TPC-H SF100 on Mobile Phones</title>
<description><div class="content">
<div class="contentwidth">
<h1>DuckDB: Running TPC-H SF100 on Mobile Phones</h1>
<div class="infoline">
<div class="icon">
</div>
<div>
<span class="author">Gabor Szarnyas, Laurens Kuiper, Hannes Mühleisen</span>
<div class="publishedinfo">
<span>Published on</span>
<span class="date">2024-12-06</span>
</div>
</div>
</div>
<div class="excerpt">
<p><em>TL;DR: DuckDB runs on mobile platforms such as iOS and Android, and completes the TPC-H benchmark faster than state-of-the-art research systems on big iron machines 20 years ago.</em></p>
</div>
<p>A few weeks ago, we set out to perform a series of experiments to answer two simple questions:</p>
<ol>
<li>Can DuckDB complete the TPC-H queries on the SF100 data set when running on a new smartphone?</li>
<li>If so, can DuckDB complete a run in less than 400 seconds, i.e., faster than the system in the research paper that originally introduced vectorized query processing?</li>
</ol>
<p>These questions took us on an interesting quest.
Along the way, we had a lot of fun and learned the difference between a cold run and a <em>really cold</em> run.
Read on to find out more.</p>
<h2 id="a-song-of-dry-ice-and-fire">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#a-song-of-dry-ice-and-fire">A Song of Dry Ice and Fire</a>
</h2>
<p>Our first attempt was to use an iPhone, namely an <a href="https://www.gsmarena.com/apple_iphone_16_pro-13315.php">iPhone 16 Pro</a>.
This phone has 8 GB memory and a 6-core CPU with 2 performance cores (running at 4.05 GHz) and 4 efficiency cores (running at 2.42 GHz).</p>
<p>We implemented the application using the <a href="https://duckdb.org/docs/api/swift.html">DuckDB Swift client</a> and loaded the benchmark on the phone, all 30 GB of it.
We quickly found that the iPhone can indeed run the workload without any problems – except that it heated up during the workload. This prompted the phone to perform thermal throttling, slowing down the CPU to reduce heat production. Due to this, DuckDB took 615.1 seconds. Not bad but not enough to reach our goal.</p>
<p>The results got us thinking: what if we improve the cooling of the phone? To this end, we purchased a box of dry ice, which has a temperature below -50 degrees Celsius, and put the phone in the box for the duration of the experiments.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/ice-cooled-iphone-1.jpg" alt="iPhone in a box of dry ice, running TPC-H" width="600px" referrerpolicy="no-referrer"></div>
<div align="center">iPhone in a box of dry ice, running TPC-H. Don't try this at home.</div>
<p>This helped a lot: DuckDB completed in 478.2 seconds. This is a more than 20% improvement – but we still didn't manage to be under 400 seconds.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/ice-cooled-iphone-2.jpg" alt="The phone with icing on it, a few minutes after finishing the benchmark" width="300px" referrerpolicy="no-referrer"></div>
<div align="center">The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!</div>
<h2 id="do-androids-dream-of-electric-ducks">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#do-androids-dream-of-electric-ducks">Do Androids Dream of Electric Ducks?</a>
</h2>
<p>In our next experiment, we picked up a <a href="https://www.gsmarena.com/samsung_galaxy_s24_ultra-12771.php">Samsung Galaxy S24 Ultra phone</a>, which runs Android 14. This phone is full of interesting hardware. First, it has an 8-core CPU with 4 different core types (1×3.39 GHz, 3×3.10 GHz, 2×2.90 GHz and 2×2.20 GHz). Second, it has a huge amount of RAM – 12 GB to be precise. Finally, its cooling system includes a <a href="https://www.sammobile.com/news/galaxy-s24-sustain-performance-bigger-vapor-chamber/">vapor chamber</a> for improved heat dissipation.</p>
<p>We ran DuckDB in the <a href="https://termux.dev/en/">Termux terminal emulator</a>. We compiled DuckDB <a href="https://duckdb.org/docs/api/cli/overview.html">CLI client</a> from source following the <a href="https://duckdb.org/docs/dev/building/build_instructions.html#android">Android build instructions</a> and ran the experiments from the command line.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/duckdb-termux-android-emulator.png" alt="Screenshot of DuckDB in Termux, running in the Android emulator" width="600px" referrerpolicy="no-referrer"></div>
<div align="center">DuckDB in Termux, running in the Android emulator</div>
<p>In the end, it wasn't even close. The Android phone completed the benchmark in 235.0 seconds, outperforming our baseline by around 40%.</p>
<h2 id="never-was-a-cloudy-day">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#never-was-a-cloudy-day">Never Was a Cloudy Day</a>
</h2>
<p>The results got us thinking: how do the results stack up among cloud servers? We picked two x86-based cloud instances in AWS EC2 with instance-attached NVMe storage.</p>
<p>The details of these benchmarks are far less interesting than those of the previous ones. We booted up the instances with Ubuntu 24.04 and ran DuckDB in the command line. We found that an <a href="https://instances.vantage.sh/aws/ec2/r6id.large"><code class="language-plaintext highlighter-rouge">r6id.large</code> instance</a> (2 vCPUs with 16 GB RAM) completes the queries in 570.8 seconds, which is roughly on-par with an air-cooled iPhone. However, an <a href="https://instances.vantage.sh/aws/ec2/r6id.xlarge"><code class="language-plaintext highlighter-rouge">r6id.xlarge</code></a> (4 vCPUs with 32 GB RAM) completes the benchmark in 166.2 seconds, faster than any result we achieved on phones.</p>
<h2 id="summary-of-duckdb-results">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#summary-of-duckdb-results">Summary of DuckDB Results</a>
</h2>
<p>The table contains a summary of the DuckDB benchmark results.</p>
<table>
<thead>
<tr>
<th>Setup</th>
<th style="text-align: right">CPU cores</th>
<th style="text-align: right">Memory</th>
<th style="text-align: right">Runtime</th>
</tr>
</thead>
<tbody>
<tr>
<td>iPhone 16 Pro (air-cooled)</td>
<td style="text-align: right">6</td>
<td style="text-align: right">8 GB</td>
<td style="text-align: right">615.1 s</td>
</tr>
<tr>
<td>iPhone 16 Pro (dry ice-cooled)</td>
<td style="text-align: right">6</td>
<td style="text-align: right">8 GB</td>
<td style="text-align: right">478.2 s</td>
</tr>
<tr>
<td>Samsung Galaxy S24 Ultra</td>
<td style="text-align: right">8</td>
<td style="text-align: right">12 GB</td>
<td style="text-align: right">235.0 s</td>
</tr>
<tr>
<td>AWS EC2 <code class="language-plaintext highlighter-rouge">r6id.large</code></td>
<td style="text-align: right">2</td>
<td style="text-align: right">16 GB</td>
<td style="text-align: right">570.8 s</td>
</tr>
<tr>
<td>AWS EC2 <code class="language-plaintext highlighter-rouge">r6id.xlarge</code></td>
<td style="text-align: right">4</td>
<td style="text-align: right">32 GB</td>
<td style="text-align: right">166.2 s</td>
</tr>
</tbody>
</table>
<h2 id="historical-context">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#historical-context">Historical Context</a>
</h2>
<p>So why did we set out to run these experiments in the first place?</p>
<p>Just a few weeks ago, <a href="https://cwi.nl/">CWI</a>, the birthplace of DuckDB, held a ceremony for the <a href="https://www.cwi.nl/en/events/dijkstra-awards/cwi-lectures-dijkstra-fellowship/">Dijkstra Fellowship</a>.
The fellowship was awarded to Marcin Żukowski for his pioneering role in the development of database management systems and his successful entrepreneurial career that resulted in systems such as <a href="https://en.wikipedia.org/wiki/Actian_Vector">VectorWise</a> and <a href="https://en.wikipedia.org/wiki/Snowflake_Inc.">Snowflake</a>.</p>
<p>A lot of ideas that originate in Marcin's research are used in DuckDB. Most importantly, <em>vectorized query processing</em> allows DuckDB to be both fast and portable at the same time.
With his co-authors Peter Boncz and Niels Nes, he first described this paradigm in the CIDR 2005 paper <a href="https://www.cidrdb.org/cidr2005/papers/P19.pdf">“MonetDB/X100: Hyper-Pipelining Query Execution”</a>.</p>
<blockquote>
<p>The terms <em>vectorization,</em> <em>hyper-pipelining,</em> and <em>superscalar</em> refer to the same idea: processing data in slices, which turns out to be a good compromise between row-at-a-time or column-at-a-time. DuckDB's query engine uses the same principle.</p>
</blockquote>
<p>This paper was published in January 2005, so it's safe to assume that it was finalized in late 2004 – almost exactly 20 years ago!</p>
<p>If we read the paper, we learn that the experiments were carried out on an HP workstation equipped with 12 GB of memory (the same amount as the Samsung phone has today!).
It also had an Itanium CPU and looked like this:</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/hp-itanium-workstation.jpg" alt="The Itanium2 workstation used in original the experiments" width="600px" referrerpolicy="no-referrer"></div>
<div align="center">The Itanium2 workstation used in original the experiments (source: <a href="https://commons.wikimedia.org/wiki/File:HP-HP9000-ZX6000-Itanium2-Workstation_11.jpg">Wikimedia</a>)</div>
<blockquote>
<p>Upon its release in 2001, the <a href="https://en.wikipedia.org/wiki/Itanium">Itanium</a> was aimed at the high-end market with the goal of eventually replacing the then-dominant x86 architecture with a new instruction set that focused heavily on <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">SIMD (single instruction, multiple data)</a>. While this ambition did not work out, the Itanium was the state-of-the-art architecture of its day. Due to the focus on the server market, the Itanium CPUs had a large amount of cache: the <a href="https://www.intel.com/content/www/us/en/products/sku/27982/intel-itanium-processor-1-30-ghz-3m-cache-400-mhz-fsb/specifications.html">1.3 GHz Itanium2 model used in the experiments</a> had 3 MB of L2 cache, while Pentium 4 CPUs released around that time only had 0.5–1 MB.</p>
</blockquote>
<p>The paper provides a detailed breakdown of the runtimes:</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/cidr2005-monetdb-x100-results.png" alt="Benchmark results from the CIDR 2005 paper “MonetDB/X100: Hyper-Pipelining Query Execution”" width="450px" referrerpolicy="no-referrer"></div>
<div align="center">Benchmark results from the paper “MonetDB/X100: Hyper-Pipelining Query Execution”</div>
<p>The total runtime of the TPC-H SF100 queries was 407.9 seconds – hence our baseline for the experiments.
Here is a video of Hannes presenting the results at the event:</p>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/H1N2Jr34jwU?si=7wYychjmxpRWPqcm&amp;start=1617" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="no-referrer" allowfullscreen=""></iframe>
</div>
<p>And here are all results visualized on a plot:</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/tpch-mobile-experiment-runtimes.svg" alt="Plot with the TPC-H SF100 experiment results for MonetDB/X100 and DuckDB" width="750px" referrerpolicy="no-referrer"></div>
<div align="center">TPC-H SF100 total query runtimes for MonetDB/X100 and DuckDB</div>
<h2 id="conclusion">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#conclusion">Conclusion</a>
</h2>
<p>It was a long journey from the original vectorized execution paper to running an analytical database on a phone.
Many key innovations happened that allowed these results, and the big improvement in hardware is just one of them.
Another crucial component is that compiler optimizations became a lot more sophisticated.
Thanks to this, while the MonetDB/X100 system needed to use explicit SIMD, DuckDB can rely on the <a href="https://en.wikipedia.org/wiki/Automatic_vectorization">auto-vectorization</a> of our (carefully constructed) loops.</p>
<p>All that's left is to answer questions that we posed at the beginning of our journey.
Yes, DuckDB can run TPC-H SF100 on a mobile phone.
And yes, in some cases it can even outperform a research prototype running on a high-end machine of 2004 – on a modern smartphone that fits in your pocket.</p>
<p>And with newer hardware, smarter compilers and yet-to-be-discovered database optimizations, future versions are only going to be faster.</p>
</div>
</div>
<div class="toc_sidebar">
<div class="toc_menu">
<h5>In this article</h5>
<ul id="toc" class="section-nav">
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#a-song-of-dry-ice-and-fire">A Song of Dry Ice and Fire</a></li>
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#do-androids-dream-of-electric-ducks">Do Androids Dream of Electric Ducks?</a></li>
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#never-was-a-cloudy-day">Never Was a Cloudy Day</a></li>
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#summary-of-duckdb-results">Summary of DuckDB Results</a></li>
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#historical-context">Historical Context</a></li>
<li class="toc-entry toc-h2"><a href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#conclusion">Conclusion</a></li>
</ul>
</div>
</div>
</description>
<link>https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html</link>
<guid isPermaLink="false">https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html</guid>
<pubDate>Fri, 06 Dec 2024 00:00:00 GMT</pubDate>
<author>Gabor Szarnyas, Laurens Kuiper, Hannes Mühleisen</author>
</item>
<item>
<title>CSV Files: Dethroning Parquet as the Ultimate Storage File Format — or Not?</title>
<description><div class="content">
<div class="contentwidth">
<h1>CSV Files: Dethroning Parquet as the Ultimate Storage File Format — or Not?</h1>
<div class="infoline">
<div class="icon">
<img src="https://duckdb.org/images/blog/authors/pedro_holanda.jpg" alt="Author Avatar" referrerpolicy="no-referrer">
</div>
<div>
<span class="author">Pedro Holanda</span>
<div class="publishedinfo">
<span>Published on</span>
<span class="date">2024-12-05</span>
</div>
</div>
</div>
<div class="excerpt">
<p><em>TL;DR: Data analytics primarily uses two types of storage format files: human-readable text files like CSV and performance-driven binary files like Parquet. This blog post compares these two formats in an ultimate showdown of performance and flexibility, where there can be only one winner.</em></p>
</div>
<h2 id="file-formats">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#file-formats">File Formats</a>
</h2>
<h3 id="csv-files">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#csv-files">CSV Files</a>
</h3>
<p>Data is most <a href="https://www.vldb.org/pvldb/vol17/p3694-saxena.pdf">commonly stored</a> in human-readable file formats, like JSON or CSV files. These file formats are easy to operate on, since anyone with a text editor can simply open, alter, and understand them.</p>
<p>For many years, CSV files have had a bad reputation for being slow and cumbersome to work with. In practice, if you want to operate on a CSV file using your favorite database system, you must follow this recipe:</p>
<ol>
<li>Manually discover its schema by opening the file in a text editor.</li>
<li>Create a table with the given schema.</li>
<li>Manually figure out the dialect of the file (e.g., which character is used for a quote?)</li>
<li>Load the file into the table using a <code class="language-plaintext highlighter-rouge">COPY</code> statement and with the dialect set.</li>
<li>Start querying it.</li>
</ol>
<p>Not only is this process tedious, but parallelizing a CSV file reader is <a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf">far from trivial</a>. This means most systems either process it single-threaded or use a two-pass approach.</p>
<p>Additionally, <a href="https://youtu.be/YrqSp8m7fmk?si=v5rmFWGJtpiU5_PX&amp;t=624">CSV files are wild</a>: although <a href="https://www.ietf.org/rfc/rfc4180.txt">RFC-4180</a> exists as a CSV standard, it is <a href="https://aic.ai.wu.ac.at/~polleres/publications/mitl-etal-2016OBD.pdf">commonly ignored</a>. Systems must therefore be sufficiently robust to handle these files as if they come straight from the wild west.</p>
<p>Last but not least, CSV files are wasteful: data is always laid out as strings. For example, numeric values like <code class="language-plaintext highlighter-rouge">1000000000</code> take 10 bytes instead of 4 bytes if stored as an <code class="language-plaintext highlighter-rouge">int32</code>. Additionally, since the data layout is row-wise, opportunities to apply <a href="https://duckdb.org/2022/10/28/lightweight-compression.html">lightweight columnar compression</a> are lost.</p>
<h3 id="parquet-files">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#parquet-files">Parquet Files</a>
</h3>
<p>Due to these shortcomings, performance-driven file formats like Parquet have gained significant popularity in recent years. Parquet files cannot be opened by general text editors, cannot be easily edited, and have a rigid schema. However, they store data in columns, apply various compression techniques, partition the data into row groups, maintain statistics about these row groups, and define their schema directly in the file.</p>
<p>These features make Parquet a monolith of a file format — highly inflexible but efficient and fast. It is easy to read data from a Parquet file since the schema is well-defined. Parallelizing a scanner is straightforward, as each thread can independently process a row group. Filter pushdown is also simple to implement, as each row group contains statistical metadata, and the file sizes are very small.</p>
<p>The conclusion should be simple: if you have small files and need flexibility, CSV files are fine. However, for data analysis, one should pivot to Parquet files, right? Well, this pivot may not be a hard requirement anymore – read on to find out why!</p>
<h2 id="reading-csv-files-in-duckdb">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#reading-csv-files-in-duckdb">Reading CSV Files in DuckDB</a>
</h2>
<p>For the past few releases, DuckDB has doubled down on delivering not only an easy-to-use CSV scanner but also an extremely performant one. This scanner features its own custom <a href="https://duckdb.org/2023/10/27/csv-sniffer.html">CSV sniffer</a>, parallelization algorithm, buffer manager, casting mechanisms, and state machine-based parser.</p>
<p>For usability, the previous paradigm of manual schema discovery and table creation has been changed. Instead, DuckDB now utilizes a CSV Sniffer, similar to those found in dataframe libraries like Pandas.
This allows for querying CSV files as easily as:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="s1">'path/to/file.csv'</span><span class="p">;</span>
</code></pre></div></div>
<p>Or tables to be created from CSV files, without any prior schema definition with:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t</span> <span class="k">AS</span> <span class="k">FROM</span> <span class="s1">'path/to/file.csv'</span><span class="p">;</span>
</code></pre></div></div>
<p>Furthermore, the reader became one of the fastest CSV readers in analytical systems, as can be seen by the load times of the <a href="https://github.com/ClickHouse/ClickBench/commit/0aba4247ce227b3058d22846ca39826d27262fe0">latest iteration</a> of <a href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQiI6ZmFsc2UsIkFsbG95REIgKHR1bmVkKSI6ZmFsc2UsIkF0aGVuYSAocGFydGl0aW9uZWQpIjpmYWxzZSwiQXRoZW5hIChzaW5nbGUpIjpmYWxzZSwiQXVyb3JhIGZvciBNeVNRTCI6ZmFsc2UsIkF1cm9yYSBmb3IgUG9zdGdyZVNRTCI6ZmFsc2UsIkJ5Q29uaXR5IjpmYWxzZSwiQnl0ZUhvdXNlIjpmYWxzZSwiY2hEQiAoRGF0YUZyYW1lKSI6ZmFsc2UsImNoREIgKFBhcnF1ZXQsIHBhcnRpdGlvbmVkKSI6ZmFsc2UsImNoREIiOmZhbHNlLCJDaXR1cyI6ZmFsc2UsIkNsaWNrSG91c2UgQ2xvdWQgKGF3cykiOmZhbHNlLCJDbGlja0hvdXNlIENsb3VkIChhenVyZSkiOmZhbHNlLCJDbGlja0hvdXNlIENsb3VkIChnY3ApIjpmYWxzZSwiQ2xpY2tIb3VzZSAoZGF0YSBsYWtlLCBwYXJ0aXRpb25lZCkiOmZhbHNlLCJDbGlja0hvdXNlIChkYXRhIGxha2UsIHNpbmdsZSkiOmZhbHNlLCJDbGlja0hvdXNlIChQYXJxdWV0LCBwYXJ0aXRpb25lZCkiOmZhbHNlLCJDbGlja0hvdXNlIChQYXJxdWV0LCBzaW5nbGUpIjpmYWxzZSwiQ2xpY2tIb3VzZSAod2ViKSI6ZmFsc2UsIkNsaWNrSG91c2UiOnRydWUsIkNsaWNrSG91c2UgKHR1bmVkKSI6dHJ1ZSwiQ2xpY2tIb3VzZSAodHVuZWQsIG1lbW9yeSkiOnRydWUsIkNsb3VkYmVycnkiOmZhbHNlLCJDcmF0ZURCIjpmYWxzZSwiQ3J1bmNoeSBCcmlkZ2UgZm9yIEFuYWx5dGljcyAoUGFycXVldCkiOmZhbHNlLCJEYXRhYmVuZCI6dHJ1ZSwiRGF0YUZ1c2lvbiAoUGFycXVldCwgcGFydGl0aW9uZWQpIjpmYWxzZSwiRGF0YUZ1c2lvbiAoUGFycXVldCwgc2luZ2xlKSI6ZmFsc2UsIkFwYWNoZSBEb3JpcyI6ZmFsc2UsIkRyaWxsIjpmYWxzZSwiRHJ1aWQiOmZhbHNlLCJEdWNrREIgKERhdGFGcmFtZSkiOmZhbHNlLCJEdWNrREIgKG1lbW9yeSkiOnRydWUsIkR1Y2tEQiAoUGFycXVldCwgcGFydGl0aW9uZWQpIjpmYWxzZSwiRHVja0RCIjpmYWxzZSwiRWxhc3RpY3NlYXJjaCI6ZmFsc2UsIkVsYXN0aWNzZWFyY2ggKHR1bmVkKSI6ZmFsc2UsIkdsYXJlREIiOmZhbHNlLCJHcmVlbnBsdW0iOmZhbHNlLCJIZWF2eUFJIjpmYWxzZSwiSHlkcmEiOmZhbHNlLCJJbmZvYnJpZ2h0IjpmYWxzZSwiS2luZXRpY2EiOmZhbHNlLCJNYXJpYURCIENvbHVtblN0b3JlIjpmYWxzZSwiTWFyaWFEQiI6ZmFsc2UsIk1vbmV0REIiOmZhbHNlLCJNb25nb0RCIjpmYWxzZSwiTW90aGVyRHVjayI6ZmFsc2UsIk15U1FMIChNeUlTQU0pIjpmYWxzZSwiTXlTUUwiOmZhbHNlLCJPY3RvU1FMIjpmYWxzZSwiT3hsYSI6ZmFsc2UsIlBhbmRhcyAoRGF0YUZyYW1lKSI6ZmFsc2UsIlBhcmFkZURCIChQYXJxdWV0LCBwYXJ0aXRpb25lZCkiOmZhbHNlLCJQYXJhZGVEQiAoUGFycXVldCwgc2luZ2xlKSI6ZmFsc2UsInBnX2R1Y2tkYiAoTW90aGVyRHVjayBlbmFibGVkKSI6ZmFsc2UsInBnX2R1Y2tkYiI6ZmFsc2UsIlBpbm90IjpmYWxzZSwiUG9sYXJzIChEYXRhRnJhbWUpIjpmYWxzZSwiUG9sYXJzIChQYXJxdWV0KSI6ZmFsc2UsIlBvc3RncmVTUUwgKHR1bmVkKSI6ZmFsc2UsIlBvc3RncmVTUUwiOmZhbHNlLCJRdWVzdERCIjp0cnVlLCJSZWRzaGlmdCI6ZmFsc2UsIlNlbGVjdERCIjpmYWxzZSwiU2luZ2xlU3RvcmUiOmZhbHNlLCJTbm93Zmxha2UiOmZhbHNlLCJTcGFyayI6ZmFsc2UsIlNRTGl0ZSI6ZmFsc2UsIlN0YXJSb2NrcyI6ZmFsc2UsIlRhYmxlc3BhY2UiOmZhbHNlLCJUZW1ibyBPTEFQIChjb2x1bW5hcikiOmZhbHNlLCJUaW1lc2NhbGUgQ2xvdWQiOmZhbHNlLCJUaW1lc2NhbGVEQiAobm8gY29sdW1uc3RvcmUpIjpmYWxzZSwiVGltZXNjYWxlREIiOmZhbHNlLCJUaW55YmlyZCAoRnJlZSBUcmlhbCkiOmZhbHNlLCJVbWJyYSI6dHJ1ZX0sInR5cGUiOnsiQyI6dHJ1ZSwiY29sdW1uLW9yaWVudGVkIjp0cnVlLCJQb3N0Z3JlU1FMIGNvbXBhdGlibGUiOnRydWUsIm1hbmFnZWQiOnRydWUsImdjcCI6dHJ1ZSwic3RhdGVsZXNzIjp0cnVlLCJKYXZhIjp0cnVlLCJDKysiOnRydWUsIk15U1FMIGNvbXBhdGlibGUiOnRydWUsInJvdy1vcmllbnRlZCI6dHJ1ZSwiQ2xpY2tIb3VzZSBkZXJpdmF0aXZlIjp0cnVlLCJlbWJlZGRlZCI6dHJ1ZSwic2VydmVybGVzcyI6dHJ1ZSwiZGF0YWZyYW1lIjp0cnVlLCJhd3MiOnRydWUsImF6dXJlIjp0cnVlLCJhbmFseXRpY2FsIjp0cnVlLCJSdXN0Ijp0cnVlLCJzZWFyY2giOnRydWUsImRvY3VtZW50Ijp0cnVlLCJHbyI6dHJ1ZSwic29tZXdoYXQgUG9zdGdyZVNRTCBjb21wYXRpYmxlIjp0cnVlLCJEYXRhRnJhbWUiOnRydWUsInBhcnF1ZXQiOnRydWUsInRpbWUtc2VyaWVzIjp0cnVlfSwibWFjaGluZSI6eyIxNiB2Q1BVIDEyOEdCIjpmYWxzZSwiOCB2Q1BVIDY0R0IiOmZhbHNlLCJzZXJ2ZXJsZXNzIjpmYWxzZSwiMTZhY3UiOmZhbHNlLCJjNmEuNHhsYXJnZSwgNTAwZ2IgZ3AyIjpmYWxzZSwiTCI6ZmFsc2UsIk0iOmZhbHNlLCJTIjpmYWxzZSwiWFMiOmZhbHNlLCJjNmEubWV0YWwsIDUwMGdiIGdwMiI6dHJ1ZSwiMTkyR0IiOmZhbHNlLCIyNEdCIjpmYWxzZSwiMzYwR0IiOmZhbHNlLCI0OEdCIjpmYWxzZSwiNzIwR0IiOmZhbHNlLCI5NkdCIjpmYWxzZSwiZGV2IjpmYWxzZSwiNzA4R0IiOmZhbHNlLCJjNW4uNHhsYXJnZSwgNTAwZ2IgZ3AyIjpmYWxzZSwiQW5hbHl0aWNzLTI1NkdCICg2NCB2Q29yZXMsIDI1NiBHQikiOmZhbHNlLCJjNS40eGxhcmdlLCA1MDBnYiBncDIiOmZhbHNlLCJjNmEuNHhsYXJnZSwgMTUwMGdiIGdwMiI6ZmFsc2UsImNsb3VkIjpmYWxzZSwiZGMyLjh4bGFyZ2UiOmZhbHNlLCJyYTMuMTZ4bGFyZ2UiOmZhbHNlLCJyYTMuNHhsYXJnZSI6ZmFsc2UsInJhMy54bHBsdXMiOmZhbHNlLCJTMiI6ZmFsc2UsIlMyNCI6ZmFsc2UsIjJYTCI6ZmFsc2UsIjNYTCI6ZmFsc2UsIjRYTCI6ZmFsc2UsIlhMIjpmYWxzZSwiTDEgLSAxNkNQVSAzMkdCIjpmYWxzZSwiYzZhLjR4bGFyZ2UsIDUwMGdiIGdwMyI6ZmFsc2UsIjE2IHZDUFUgNjRHQiI6ZmFsc2UsIjQgdkNQVSAxNkdCIjpmYWxzZSwiOCB2Q1BVIDMyR0IiOmZhbHNlfSwiY2x1c3Rlcl9zaXplIjp7IjEiOnRydWUsIjIiOmZhbHNlLCI0IjpmYWxzZSwiOCI6ZmFsc2UsIjE2IjpmYWxzZSwiMzIiOmZhbHNlLCI2NCI6ZmFsc2UsIjEyOCI6ZmFsc2UsInNlcnZlcmxlc3MiOmZhbHNlLCJ1bmRlZmluZWQiOmZhbHNlfSwibWV0cmljIjoibG9hZCIsInF1ZXJpZXMiOlt0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlXX0=">ClickBench</a>. In this benchmark, the data is loaded from an <a href="https://datasets.clickhouse.com/hits_compatible/hits.csv.gz">82 GB uncompressed CSV file</a> into a database table.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/csv-vs-parquet-clickbench.png" alt="Image showing the ClickBench result 2024-12-05" width="800px" referrerpolicy="no-referrer"></div>
<div align="center">ClickBench CSV loading times (2024-12-05)</div>
<h2 id="comparing-csv-and-parquet">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#comparing-csv-and-parquet">Comparing CSV and Parquet</a>
</h2>
<p>With the large boost in usability and performance for the CSV reader, one might ask: what is the actual difference in performance when loading a CSV file compared to a Parquet file into a table? Additionally, how do these formats differ when running queries directly on them?</p>
<p>To find out, we will run a few examples using both CSV and Parquet files containing TPC-H data to shed light on their differences. All scripts used to generate the benchmarks of this blogpost can be found in a <a href="https://github.com/pdet/csv_vs_parquet">repository</a>.</p>
<h3 id="usability">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#usability">Usability</a>
</h3>
<p>In terms of usability, scanning CSV files and Parquet files can differ significantly.</p>
<p>In simple cases, where all options are correctly detected by DuckDB, running queries on either CSV or Parquet files can be done directly.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="s1">'path/to/file.csv'</span><span class="p">;</span>
<span class="k">FROM</span> <span class="s1">'path/to/file.parquet'</span><span class="p">;</span>
</code></pre></div></div>
<p>Things can differ drastically for wild, rule-breaking <a href="https://reddead.fandom.com/wiki/Arthur_Morgan">Arthur Morgan</a>-like CSV files. This is evident from the number of parameters that can be set for each scanner. The <a href="https://duckdb.org/docs/data/parquet/overview.html">Parquet</a> scanner has a total of six parameters that can alter how the file is read. For the majority of cases, the user will never need to manually adjust any of them.</p>
<p>The CSV reader, on the other hand, depends on the sniffer being able to automatically detect many different configuration options. For example: What is the delimiter? How many rows should it skip from the top of the file? Are there any comments? And so on. This results in over <a href="https://duckdb.org/docs/data/csv/overview.html">30 configuration options</a> that the user might have to manually adjust to properly parse their CSV file. Again, this number of options is necessary due to the lack of a widely adopted standard. However, in most scenarios, users can rely on the sniffer or, at most, change one or two options.</p>
<p>The CSV reader also has an extensive error-handling system and will always provide suggestions for options to review if something goes wrong.</p>
<p>To give you an example of how the DuckDB error-reporting system works, consider the following CSV file:</p>
<pre><code class="language-csv">Clint Eastwood;94
Samuel L. Jackson
</code></pre>
<p>In this file, the second line is missing the value for the second column.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="high |
github-actions
bot
added
the
Auto: Route Test Complete
Auto route test has finished on given PR
label
Dec 10, 2024
Successfully generated as following: http://localhost:1200/duckdb/news - Success ✔️<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>DuckDB News</title>
<link>https://duckdb.org/news/</link>
<atom:link href="http://localhost:1200/duckdb/news" rel="self" type="application/rss+xml"></atom:link>
<description>DuckDB News - Powered by RSSHub</description>
<generator>RSSHub</generator>
<webMaster>[email protected] (RSSHub)</webMaster>
<language>en</language>
<lastBuildDate>Tue, 10 Dec 2024 17:30:49 GMT</lastBuildDate>
<ttl>5</ttl>
<item>
<title>The DuckDB Avro Extension</title>
<description><div class="excerpt">
<p><em>TL;DR: DuckDB now supports reading Avro files through the <code class="language-plaintext highlighter-rouge">avro</code> Community Extension.</em></p>
</div>
<h2 id="the-apache-avro-format">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-apache-avro-format">The Apache™ Avro™ Format</a>
</h2>
<p><a href="https://avro.apache.org/">Avro</a> is a binary format for record data. Like many innovations in the data space, Avro was <a href="https://vimeo.com/7362534">developed</a> by <a href="https://en.wikipedia.org/wiki/Doug_Cutting">Doug Cutting</a> as part of the Apache Hadoop project <a href="https://github.com/apache/hadoop/commit/8296413d4988c08343014c6808a30e9d5e441bfc">in around 2009</a>. Avro gets its name – somewhat obscurely – from a defunct <a href="https://en.wikipedia.org/wiki/Avro">British aircraft manufacturer</a>. The company famously built over 7,000 <a href="https://en.wikipedia.org/wiki/Avro_Lancaster">Avro Lancaster heavy bombers</a> under the challenging conditions of World War 2. But we digress.</p>
<p>The Avro format is yet another attempt to solve the dimensionality reduction problem that occurs when transforming a complex <em>multi-dimensional data structure</em> like tables (possibly with nested types) to a <em>single-dimensional storage layout</em> like a flat file, which is just a sequence of bytes. The most fundamental question that arises here is whether to use a columnar or a row-major layout. Avro uses a row-major layout, which differentiates it from its famous cousin, the <a href="https://parquet.apache.org/">Apache™ Parquet™</a> format. There are valid use cases for a row-major format: for example, appending a few rows to a Parquet file is difficult and inefficient because of Parquet's columnar layout and due to the fact the Parquet metadata is stored <em>at the back</em> of the file. In a row-major format like Avro with the metadata <em>up top</em>, we can “just” add those rows to the end of the files and we're done. This enables Avro to handle appends of a few rows somewhat efficiently.</p>
<p>Avro-encoded data can appear in several ways, e.g., in <a href="https://en.wikipedia.org/wiki/Remote_procedure_call">RPC messages</a> but also in files. In the following, we focus on files since those survive long-term.</p>
<h3 id="header-block">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#header-block">Header Block</a>
</h3>
<p>Avro “object container” files are encoded using a comparatively simple binary <a href="https://avro.apache.org/docs/++version++/specification/#object-container-files">format</a>: each file starts with a <strong>header block</strong> that first has the <a href="https://en.wikipedia.org/wiki/List_of_file_signatures">magic bytes</a> <code class="language-plaintext highlighter-rouge">Obj1</code>. Then, a metadata “map” (a list of string-bytearray key-value pairs) follows. The map is only strictly required to contain a single entry for the <code class="language-plaintext highlighter-rouge">avro.schema</code> key. This key contains the Avro file schema encoded as JSON. Here is an example for such a schema:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"namespace"</span><span class="p">:</span><span class="w"> </span><span class="s2">"example.avro"</span><span class="p">,</span><span class="w">
</span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"record"</span><span class="p">,</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"User"</span><span class="p">,</span><span class="w">
</span><span class="nl">"fields"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"name"</span><span class="p">,</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"string"</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"favorite_number"</span><span class="p">,</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"int"</span><span class="p">,</span><span class="w"> </span><span class="s2">"null"</span><span class="p">]},</span><span class="w">
</span><span class="p">{</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"favorite_color"</span><span class="p">,</span><span class="w"> </span><span class="nl">"type"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"string"</span><span class="p">,</span><span class="w"> </span><span class="s2">"null"</span><span class="p">]}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>The Avro schema defines a record structure. Records can contain scalar data fields (like <code class="language-plaintext highlighter-rouge">int</code>, <code class="language-plaintext highlighter-rouge">double</code>, <code class="language-plaintext highlighter-rouge">string</code>, etc.) but also more complex types like records (similar to <a href="https://duckdb.org/docs/sql/data_types/struct.html">DuckDB <code class="language-plaintext highlighter-rouge">STRUCT</code>s</a>), unions and lists. As a sidenote, it is quite strange that a data format for the definition of record structures would fall back to another format like JSON to describe itself, but such are the oddities of Avro.</p>
<h3 id="data-blocks">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#data-blocks">Data Blocks</a>
</h3>
<p>The header concludes with 16 randomly chosen bytes as a “sync marker”. The header is followed by an arbitrary amount of <strong>data blocks</strong>: each data block starts with a record count, followed by a size and a byte array containing the actual records. Optionally, the bytes can be compressed with deflate (gzip), which will be known from the header metadata.</p>
<p>The data bytes can only be decoded using the schema. The <a href="https://avro.apache.org/docs/++version++/specification/#object-container-files">object file specification</a> contains the details on how each type is encoded. For example, in the example schema we know each value is a record of three fields. The root-level record will encode its entries in the order they are declared. There are no actual bytes required for this. First we will be reading the <code class="language-plaintext highlighter-rouge">name</code> field. Strings consist of a length followed by the string bytes. Like other formats (e.g., Thrift), Avro uses <a href="https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding">variable-length integers with zigzag encoding</a> to store lengths and counts and the like. After reading the string, we can proceed to <code class="language-plaintext highlighter-rouge">favorite_number</code>. This field is a union type (encoded with the <code class="language-plaintext highlighter-rouge">[]</code> syntax). This union can have values of two types, <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">null</code>. The <code class="language-plaintext highlighter-rouge">null</code> type is a bit odd, it can only be used to encode the fact that a value is missing. To decode the <code class="language-plaintext highlighter-rouge">favorite_number</code> fields, we first read an <code class="language-plaintext highlighter-rouge">int</code> that encodes which choice of the union was used. Afterward, we use the “normal” decoders to read the values (e.g., <code class="language-plaintext highlighter-rouge">int</code> or <code class="language-plaintext highlighter-rouge">null</code>). The same can be done for <code class="language-plaintext highlighter-rouge">favorite_color</code>. Each data block again ends with the sync marker. The sync marker can be used to verify that the block was fully written and that there is no garbage in the file.</p>
<h2 id="the-duckdb-avro-community-extension">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-duckdb-avro-community-extension">The DuckDB <code class="language-plaintext highlighter-rouge">avro</code> Community Extension</a>
</h2>
<p>We have developed a DuckDB Community Extension that enables DuckDB to <em>read</em> <a href="https://avro.apache.org/">Apache Avro™</a> files.</p>
<p>The extension does not contain Avro <em>write</em> functionality. This is on purpose, by not providing a writer we hope to decrease the amount of Avro files in the world over time.</p>
<h3 id="installation--loading">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#installation--loading">Installation &amp; Loading</a>
</h3>
<p>Installation is simple through the DuckDB Community Extension repository, just type</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSTALL</span> <span class="n">avro</span> <span class="k">FROM</span> <span class="n">community</span><span class="p">;</span>
<span class="k">LOAD</span> <span class="n">avro</span><span class="p">;</span>
</code></pre></div></div>
<p>in a DuckDB instance near you. There is currently no build for Wasm because of dependencies (sigh).</p>
<h3 id="the-read_avro-function">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#the-read_avro-function">The <code class="language-plaintext highlighter-rouge">read_avro</code> Function</a>
</h3>
<p>The extension adds a single DuckDB function, <code class="language-plaintext highlighter-rouge">read_avro</code>. This function can be used like so:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'some_example_file.avro'</span><span class="p">);</span>
</code></pre></div></div>
<p>This function will expose the contents of the Avro file as a DuckDB table. You can then use any arbitrary SQL constructs to further transform this table.</p>
<h3 id="file-io">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#file-io">File IO</a>
</h3>
<p>The <code class="language-plaintext highlighter-rouge">read_avro</code> function is integrated into DuckDB's file system abstraction, meaning you can read Avro files directly from e.g., HTTP or S3 sources. For example:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'http://blobs.duckdb.org/data/userdata1.avro'</span><span class="p">);</span>
<span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'s3://my-example-bucket/some_example_file.avro'</span><span class="p">);</span>
</code></pre></div></div>
<p>should “just” work.</p>
<p>You can also <a href="https://duckdb.org/docs/sql/functions/pattern_matching.html#globbing"><em>glob</em> multiple files</a> in a single read call or pass a list of files to the functions:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'some_example_file_*.avro'</span><span class="p">);</span>
<span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">([</span><span class="s1">'some_example_file_1.avro'</span><span class="p">,</span> <span class="s1">'some_example_file_2.avro'</span><span class="p">]);</span>
</code></pre></div></div>
<p>If the filenames somehow contain valuable information (as is unfortunately all-too-common), you can pass the <code class="language-plaintext highlighter-rouge">filename</code> argument to <code class="language-plaintext highlighter-rouge">read_avro</code>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="nf">read_avro</span><span class="p">(</span><span class="s1">'some_example_file_*.avro'</span><span class="p">,</span> <span class="k">filename</span> <span class="o">=</span> <span class="k">true</span><span class="p">);</span>
</code></pre></div></div>
<p>This will result in an additional column in the result set that contains the actual filename of the Avro file.</p>
<h3 id="schema-conversion">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#schema-conversion">Schema Conversion</a>
</h3>
<p>This extension automatically translates the Avro Schema to the DuckDB schema. <em>All</em> Avro types can be translated, except for <em>recursive type definitions</em>, which DuckDB does not support.</p>
<p>The type mapping is very straightforward except for Avro's “unique” way of handling <code class="language-plaintext highlighter-rouge">NULL</code>. Unlike other systems, Avro does not treat <code class="language-plaintext highlighter-rouge">NULL</code> as a possible value in a range of e.g., <code class="language-plaintext highlighter-rouge">INTEGER</code> but instead represents <code class="language-plaintext highlighter-rouge">NULL</code> as a union of the actual type with a special <code class="language-plaintext highlighter-rouge">NULL</code> type. This is different to DuckDB, where any value can be <code class="language-plaintext highlighter-rouge">NULL</code>. Of course DuckDB also supports <code class="language-plaintext highlighter-rouge">UNION</code> types, but this would be quite cumbersome to work with.</p>
<p>This extension <em>simplifies</em> the Avro schema where possible: an Avro union of any type and the special null type is simplified to just the non-null type. For example, an Avro record of the union type <code class="language-plaintext highlighter-rouge">["int", "null"]</code> (like <code class="language-plaintext highlighter-rouge">favorite_number</code> in the <a href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#header-block">example</a>) becomes a DuckDB <code class="language-plaintext highlighter-rouge">INTEGER</code>, which just happens to be <code class="language-plaintext highlighter-rouge">NULL</code> sometimes. Similarly, an Avro union that contains only a single type is converted to the type it contains. For example, an Avro record of the union type <code class="language-plaintext highlighter-rouge">["int"]</code> also becomes a DuckDB <code class="language-plaintext highlighter-rouge">INTEGER</code>.</p>
<p>The extension also “flattens” the Avro schema. Avro defines tables as root-level “record” fields, which are the same as DuckDB <code class="language-plaintext highlighter-rouge">STRUCT</code> fields. For more convenient handling, this extension turns the entries of a single top-level record into top-level columns.</p>
<h3 id="implementation">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#implementation">Implementation</a>
</h3>
<p>Internally, this extension uses the “official” <a href="https://avro.apache.org/docs/++version++/api/c/">Apache Avro C API</a>, albeit with some minor patching to allow reading Avro files from memory.</p>
<h3 id="limitations--next-steps">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#limitations--next-steps">Limitations &amp; Next Steps</a>
</h3>
<p>In the following, we disclose the limitations of the <code class="language-plaintext highlighter-rouge">avro</code> DuckDB extension along with our plans to mitigate them in the future:</p>
<ul>
<li>
<p>The extension currently does not make use of <strong>parallelism</strong> when reading either a single (large) Avro file or when reading a list of files. Adding support for parallelism in the latter case is on the roadmap.</p>
</li>
<li>
<p>There is currently no support for projection or filter <strong>pushdown</strong>, but this is also planned at a later stage.</p>
</li>
<li>
<p>There is currently no support for the Wasm or the Windows-MinGW builds of DuckDB due to issues with the Avro library dependency (sigh again). We plan to fix this eventually.</p>
</li>
<li>
<p>As mentioned above, DuckDB cannot express recursive type definitions that Avro has. This is unlikely to ever change.</p>
</li>
<li>
<p>There is no support to allow users to provide a separate Avro schema file. This is unlikely to change, all Avro files we have seen so far had their schema embedded.</p>
</li>
<li>
<p>There is currently no support for the <code class="language-plaintext highlighter-rouge">union_by_name</code> flag that other readers in DuckDB support. This is planned for the future.</p>
</li>
</ul>
<h2 id="conclusion">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/09/duckdb-avro-extension.html#conclusion">Conclusion</a>
</h2>
<p>The new <code class="language-plaintext highlighter-rouge">avro</code> Community Extension for DuckDB enables DuckDB to read Avro files directly as if they were tables. If you have a bunch of Avro files, go ahead and try it out! We'd love to <a href="https://github.com/hannes/duckdb_avro/issues">hear from you</a> if you run into any issues.</p>
</description>
<link>https://duckdb.org/2024/12/09/duckdb-avro-extension.html</link>
<guid isPermaLink="false">https://duckdb.org/2024/12/09/duckdb-avro-extension.html</guid>
<pubDate>Mon, 09 Dec 2024 00:00:00 GMT</pubDate>
<author>Hannes Mühleisen</author>
</item>
<item>
<title>DuckDB: Running TPC-H SF100 on Mobile Phones</title>
<description><div class="excerpt">
<p><em>TL;DR: DuckDB runs on mobile platforms such as iOS and Android, and completes the TPC-H benchmark faster than state-of-the-art research systems on big iron machines 20 years ago.</em></p>
</div>
<p>A few weeks ago, we set out to perform a series of experiments to answer two simple questions:</p>
<ol>
<li>Can DuckDB complete the TPC-H queries on the SF100 data set when running on a new smartphone?</li>
<li>If so, can DuckDB complete a run in less than 400 seconds, i.e., faster than the system in the research paper that originally introduced vectorized query processing?</li>
</ol>
<p>These questions took us on an interesting quest.
Along the way, we had a lot of fun and learned the difference between a cold run and a <em>really cold</em> run.
Read on to find out more.</p>
<h2 id="a-song-of-dry-ice-and-fire">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#a-song-of-dry-ice-and-fire">A Song of Dry Ice and Fire</a>
</h2>
<p>Our first attempt was to use an iPhone, namely an <a href="https://www.gsmarena.com/apple_iphone_16_pro-13315.php">iPhone 16 Pro</a>.
This phone has 8 GB memory and a 6-core CPU with 2 performance cores (running at 4.05 GHz) and 4 efficiency cores (running at 2.42 GHz).</p>
<p>We implemented the application using the <a href="https://duckdb.org/docs/api/swift.html">DuckDB Swift client</a> and loaded the benchmark on the phone, all 30 GB of it.
We quickly found that the iPhone can indeed run the workload without any problems – except that it heated up during the workload. This prompted the phone to perform thermal throttling, slowing down the CPU to reduce heat production. Due to this, DuckDB took 615.1 seconds. Not bad but not enough to reach our goal.</p>
<p>The results got us thinking: what if we improve the cooling of the phone? To this end, we purchased a box of dry ice, which has a temperature below -50 degrees Celsius, and put the phone in the box for the duration of the experiments.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/ice-cooled-iphone-1.jpg" alt="iPhone in a box of dry ice, running TPC-H" width="600px" referrerpolicy="no-referrer"></div>
<div align="center">iPhone in a box of dry ice, running TPC-H. Don't try this at home.</div>
<p>This helped a lot: DuckDB completed in 478.2 seconds. This is a more than 20% improvement – but we still didn't manage to be under 400 seconds.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/ice-cooled-iphone-2.jpg" alt="The phone with icing on it, a few minutes after finishing the benchmark" width="300px" referrerpolicy="no-referrer"></div>
<div align="center">The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!</div>
<h2 id="do-androids-dream-of-electric-ducks">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#do-androids-dream-of-electric-ducks">Do Androids Dream of Electric Ducks?</a>
</h2>
<p>In our next experiment, we picked up a <a href="https://www.gsmarena.com/samsung_galaxy_s24_ultra-12771.php">Samsung Galaxy S24 Ultra phone</a>, which runs Android 14. This phone is full of interesting hardware. First, it has an 8-core CPU with 4 different core types (1×3.39 GHz, 3×3.10 GHz, 2×2.90 GHz and 2×2.20 GHz). Second, it has a huge amount of RAM – 12 GB to be precise. Finally, its cooling system includes a <a href="https://www.sammobile.com/news/galaxy-s24-sustain-performance-bigger-vapor-chamber/">vapor chamber</a> for improved heat dissipation.</p>
<p>We ran DuckDB in the <a href="https://termux.dev/en/">Termux terminal emulator</a>. We compiled DuckDB <a href="https://duckdb.org/docs/api/cli/overview.html">CLI client</a> from source following the <a href="https://duckdb.org/docs/dev/building/build_instructions.html#android">Android build instructions</a> and ran the experiments from the command line.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/duckdb-termux-android-emulator.png" alt="Screenshot of DuckDB in Termux, running in the Android emulator" width="600px" referrerpolicy="no-referrer"></div>
<div align="center">DuckDB in Termux, running in the Android emulator</div>
<p>In the end, it wasn't even close. The Android phone completed the benchmark in 235.0 seconds, outperforming our baseline by around 40%.</p>
<h2 id="never-was-a-cloudy-day">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#never-was-a-cloudy-day">Never Was a Cloudy Day</a>
</h2>
<p>The results got us thinking: how do the results stack up among cloud servers? We picked two x86-based cloud instances in AWS EC2 with instance-attached NVMe storage.</p>
<p>The details of these benchmarks are far less interesting than those of the previous ones. We booted up the instances with Ubuntu 24.04 and ran DuckDB in the command line. We found that an <a href="https://instances.vantage.sh/aws/ec2/r6id.large"><code class="language-plaintext highlighter-rouge">r6id.large</code> instance</a> (2 vCPUs with 16 GB RAM) completes the queries in 570.8 seconds, which is roughly on-par with an air-cooled iPhone. However, an <a href="https://instances.vantage.sh/aws/ec2/r6id.xlarge"><code class="language-plaintext highlighter-rouge">r6id.xlarge</code></a> (4 vCPUs with 32 GB RAM) completes the benchmark in 166.2 seconds, faster than any result we achieved on phones.</p>
<h2 id="summary-of-duckdb-results">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#summary-of-duckdb-results">Summary of DuckDB Results</a>
</h2>
<p>The table contains a summary of the DuckDB benchmark results.</p>
<table>
<thead>
<tr>
<th>Setup</th>
<th style="text-align: right">CPU cores</th>
<th style="text-align: right">Memory</th>
<th style="text-align: right">Runtime</th>
</tr>
</thead>
<tbody>
<tr>
<td>iPhone 16 Pro (air-cooled)</td>
<td style="text-align: right">6</td>
<td style="text-align: right">8 GB</td>
<td style="text-align: right">615.1 s</td>
</tr>
<tr>
<td>iPhone 16 Pro (dry ice-cooled)</td>
<td style="text-align: right">6</td>
<td style="text-align: right">8 GB</td>
<td style="text-align: right">478.2 s</td>
</tr>
<tr>
<td>Samsung Galaxy S24 Ultra</td>
<td style="text-align: right">8</td>
<td style="text-align: right">12 GB</td>
<td style="text-align: right">235.0 s</td>
</tr>
<tr>
<td>AWS EC2 <code class="language-plaintext highlighter-rouge">r6id.large</code></td>
<td style="text-align: right">2</td>
<td style="text-align: right">16 GB</td>
<td style="text-align: right">570.8 s</td>
</tr>
<tr>
<td>AWS EC2 <code class="language-plaintext highlighter-rouge">r6id.xlarge</code></td>
<td style="text-align: right">4</td>
<td style="text-align: right">32 GB</td>
<td style="text-align: right">166.2 s</td>
</tr>
</tbody>
</table>
<h2 id="historical-context">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#historical-context">Historical Context</a>
</h2>
<p>So why did we set out to run these experiments in the first place?</p>
<p>Just a few weeks ago, <a href="https://cwi.nl/">CWI</a>, the birthplace of DuckDB, held a ceremony for the <a href="https://www.cwi.nl/en/events/dijkstra-awards/cwi-lectures-dijkstra-fellowship/">Dijkstra Fellowship</a>.
The fellowship was awarded to Marcin Żukowski for his pioneering role in the development of database management systems and his successful entrepreneurial career that resulted in systems such as <a href="https://en.wikipedia.org/wiki/Actian_Vector">VectorWise</a> and <a href="https://en.wikipedia.org/wiki/Snowflake_Inc.">Snowflake</a>.</p>
<p>A lot of ideas that originate in Marcin's research are used in DuckDB. Most importantly, <em>vectorized query processing</em> allows DuckDB to be both fast and portable at the same time.
With his co-authors Peter Boncz and Niels Nes, he first described this paradigm in the CIDR 2005 paper <a href="https://www.cidrdb.org/cidr2005/papers/P19.pdf">“MonetDB/X100: Hyper-Pipelining Query Execution”</a>.</p>
<blockquote>
<p>The terms <em>vectorization,</em> <em>hyper-pipelining,</em> and <em>superscalar</em> refer to the same idea: processing data in slices, which turns out to be a good compromise between row-at-a-time or column-at-a-time. DuckDB's query engine uses the same principle.</p>
</blockquote>
<p>This paper was published in January 2005, so it's safe to assume that it was finalized in late 2004 – almost exactly 20 years ago!</p>
<p>If we read the paper, we learn that the experiments were carried out on an HP workstation equipped with 12 GB of memory (the same amount as the Samsung phone has today!).
It also had an Itanium CPU and looked like this:</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/hp-itanium-workstation.jpg" alt="The Itanium2 workstation used in original the experiments" width="600px" referrerpolicy="no-referrer"></div>
<div align="center">The Itanium2 workstation used in original the experiments (source: <a href="https://commons.wikimedia.org/wiki/File:HP-HP9000-ZX6000-Itanium2-Workstation_11.jpg">Wikimedia</a>)</div>
<blockquote>
<p>Upon its release in 2001, the <a href="https://en.wikipedia.org/wiki/Itanium">Itanium</a> was aimed at the high-end market with the goal of eventually replacing the then-dominant x86 architecture with a new instruction set that focused heavily on <a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">SIMD (single instruction, multiple data)</a>. While this ambition did not work out, the Itanium was the state-of-the-art architecture of its day. Due to the focus on the server market, the Itanium CPUs had a large amount of cache: the <a href="https://www.intel.com/content/www/us/en/products/sku/27982/intel-itanium-processor-1-30-ghz-3m-cache-400-mhz-fsb/specifications.html">1.3 GHz Itanium2 model used in the experiments</a> had 3 MB of L2 cache, while Pentium 4 CPUs released around that time only had 0.5–1 MB.</p>
</blockquote>
<p>The paper provides a detailed breakdown of the runtimes:</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/cidr2005-monetdb-x100-results.png" alt="Benchmark results from the CIDR 2005 paper “MonetDB/X100: Hyper-Pipelining Query Execution”" width="450px" referrerpolicy="no-referrer"></div>
<div align="center">Benchmark results from the paper “MonetDB/X100: Hyper-Pipelining Query Execution”</div>
<p>The total runtime of the TPC-H SF100 queries was 407.9 seconds – hence our baseline for the experiments.
Here is a video of Hannes presenting the results at the event:</p>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/H1N2Jr34jwU?si=7wYychjmxpRWPqcm&amp;start=1617" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="no-referrer" allowfullscreen=""></iframe>
</div>
<p>And here are all results visualized on a plot:</p>
<div align="center">
<img src="https://duckdb.org/images/blog/tpch-mobile/tpch-mobile-experiment-runtimes.svg" alt="Plot with the TPC-H SF100 experiment results for MonetDB/X100 and DuckDB" width="750px" referrerpolicy="no-referrer"></div>
<div align="center">TPC-H SF100 total query runtimes for MonetDB/X100 and DuckDB</div>
<h2 id="conclusion">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html#conclusion">Conclusion</a>
</h2>
<p>It was a long journey from the original vectorized execution paper to running an analytical database on a phone.
Many key innovations happened that allowed these results, and the big improvement in hardware is just one of them.
Another crucial component is that compiler optimizations became a lot more sophisticated.
Thanks to this, while the MonetDB/X100 system needed to use explicit SIMD, DuckDB can rely on the <a href="https://en.wikipedia.org/wiki/Automatic_vectorization">auto-vectorization</a> of our (carefully constructed) loops.</p>
<p>All that's left is to answer questions that we posed at the beginning of our journey.
Yes, DuckDB can run TPC-H SF100 on a mobile phone.
And yes, in some cases it can even outperform a research prototype running on a high-end machine of 2004 – on a modern smartphone that fits in your pocket.</p>
<p>And with newer hardware, smarter compilers and yet-to-be-discovered database optimizations, future versions are only going to be faster.</p>
</description>
<link>https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html</link>
<guid isPermaLink="false">https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile.html</guid>
<pubDate>Fri, 06 Dec 2024 00:00:00 GMT</pubDate>
<author>Gabor Szarnyas, Laurens Kuiper, Hannes Mühleisen</author>
</item>
<item>
<title>CSV Files: Dethroning Parquet as the Ultimate Storage File Format — or Not?</title>
<description><div class="excerpt">
<p><em>TL;DR: Data analytics primarily uses two types of storage format files: human-readable text files like CSV and performance-driven binary files like Parquet. This blog post compares these two formats in an ultimate showdown of performance and flexibility, where there can be only one winner.</em></p>
</div>
<h2 id="file-formats">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#file-formats">File Formats</a>
</h2>
<h3 id="csv-files">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#csv-files">CSV Files</a>
</h3>
<p>Data is most <a href="https://www.vldb.org/pvldb/vol17/p3694-saxena.pdf">commonly stored</a> in human-readable file formats, like JSON or CSV files. These file formats are easy to operate on, since anyone with a text editor can simply open, alter, and understand them.</p>
<p>For many years, CSV files have had a bad reputation for being slow and cumbersome to work with. In practice, if you want to operate on a CSV file using your favorite database system, you must follow this recipe:</p>
<ol>
<li>Manually discover its schema by opening the file in a text editor.</li>
<li>Create a table with the given schema.</li>
<li>Manually figure out the dialect of the file (e.g., which character is used for a quote?)</li>
<li>Load the file into the table using a <code class="language-plaintext highlighter-rouge">COPY</code> statement and with the dialect set.</li>
<li>Start querying it.</li>
</ol>
<p>Not only is this process tedious, but parallelizing a CSV file reader is <a href="https://www.microsoft.com/en-us/research/uploads/prod/2019/04/chunker-sigmod19.pdf">far from trivial</a>. This means most systems either process it single-threaded or use a two-pass approach.</p>
<p>Additionally, <a href="https://youtu.be/YrqSp8m7fmk?si=v5rmFWGJtpiU5_PX&amp;t=624">CSV files are wild</a>: although <a href="https://www.ietf.org/rfc/rfc4180.txt">RFC-4180</a> exists as a CSV standard, it is <a href="https://aic.ai.wu.ac.at/~polleres/publications/mitl-etal-2016OBD.pdf">commonly ignored</a>. Systems must therefore be sufficiently robust to handle these files as if they come straight from the wild west.</p>
<p>Last but not least, CSV files are wasteful: data is always laid out as strings. For example, numeric values like <code class="language-plaintext highlighter-rouge">1000000000</code> take 10 bytes instead of 4 bytes if stored as an <code class="language-plaintext highlighter-rouge">int32</code>. Additionally, since the data layout is row-wise, opportunities to apply <a href="https://duckdb.org/2022/10/28/lightweight-compression.html">lightweight columnar compression</a> are lost.</p>
<h3 id="parquet-files">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#parquet-files">Parquet Files</a>
</h3>
<p>Due to these shortcomings, performance-driven file formats like Parquet have gained significant popularity in recent years. Parquet files cannot be opened by general text editors, cannot be easily edited, and have a rigid schema. However, they store data in columns, apply various compression techniques, partition the data into row groups, maintain statistics about these row groups, and define their schema directly in the file.</p>
<p>These features make Parquet a monolith of a file format — highly inflexible but efficient and fast. It is easy to read data from a Parquet file since the schema is well-defined. Parallelizing a scanner is straightforward, as each thread can independently process a row group. Filter pushdown is also simple to implement, as each row group contains statistical metadata, and the file sizes are very small.</p>
<p>The conclusion should be simple: if you have small files and need flexibility, CSV files are fine. However, for data analysis, one should pivot to Parquet files, right? Well, this pivot may not be a hard requirement anymore – read on to find out why!</p>
<h2 id="reading-csv-files-in-duckdb">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#reading-csv-files-in-duckdb">Reading CSV Files in DuckDB</a>
</h2>
<p>For the past few releases, DuckDB has doubled down on delivering not only an easy-to-use CSV scanner but also an extremely performant one. This scanner features its own custom <a href="https://duckdb.org/2023/10/27/csv-sniffer.html">CSV sniffer</a>, parallelization algorithm, buffer manager, casting mechanisms, and state machine-based parser.</p>
<p>For usability, the previous paradigm of manual schema discovery and table creation has been changed. Instead, DuckDB now utilizes a CSV Sniffer, similar to those found in dataframe libraries like Pandas.
This allows for querying CSV files as easily as:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="s1">'path/to/file.csv'</span><span class="p">;</span>
</code></pre></div></div>
<p>Or tables to be created from CSV files, without any prior schema definition with:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">t</span> <span class="k">AS</span> <span class="k">FROM</span> <span class="s1">'path/to/file.csv'</span><span class="p">;</span>
</code></pre></div></div>
<p>Furthermore, the reader became one of the fastest CSV readers in analytical systems, as can be seen by the load times of the <a href="https://github.com/ClickHouse/ClickBench/commit/0aba4247ce227b3058d22846ca39826d27262fe0">latest iteration</a> of <a href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQiI6ZmFsc2UsIkFsbG95REIgKHR1bmVkKSI6ZmFsc2UsIkF0aGVuYSAocGFydGl0aW9uZWQpIjpmYWxzZSwiQXRoZW5hIChzaW5nbGUpIjpmYWxzZSwiQXVyb3JhIGZvciBNeVNRTCI6ZmFsc2UsIkF1cm9yYSBmb3IgUG9zdGdyZVNRTCI6ZmFsc2UsIkJ5Q29uaXR5IjpmYWxzZSwiQnl0ZUhvdXNlIjpmYWxzZSwiY2hEQiAoRGF0YUZyYW1lKSI6ZmFsc2UsImNoREIgKFBhcnF1ZXQsIHBhcnRpdGlvbmVkKSI6ZmFsc2UsImNoREIiOmZhbHNlLCJDaXR1cyI6ZmFsc2UsIkNsaWNrSG91c2UgQ2xvdWQgKGF3cykiOmZhbHNlLCJDbGlja0hvdXNlIENsb3VkIChhenVyZSkiOmZhbHNlLCJDbGlja0hvdXNlIENsb3VkIChnY3ApIjpmYWxzZSwiQ2xpY2tIb3VzZSAoZGF0YSBsYWtlLCBwYXJ0aXRpb25lZCkiOmZhbHNlLCJDbGlja0hvdXNlIChkYXRhIGxha2UsIHNpbmdsZSkiOmZhbHNlLCJDbGlja0hvdXNlIChQYXJxdWV0LCBwYXJ0aXRpb25lZCkiOmZhbHNlLCJDbGlja0hvdXNlIChQYXJxdWV0LCBzaW5nbGUpIjpmYWxzZSwiQ2xpY2tIb3VzZSAod2ViKSI6ZmFsc2UsIkNsaWNrSG91c2UiOnRydWUsIkNsaWNrSG91c2UgKHR1bmVkKSI6dHJ1ZSwiQ2xpY2tIb3VzZSAodHVuZWQsIG1lbW9yeSkiOnRydWUsIkNsb3VkYmVycnkiOmZhbHNlLCJDcmF0ZURCIjpmYWxzZSwiQ3J1bmNoeSBCcmlkZ2UgZm9yIEFuYWx5dGljcyAoUGFycXVldCkiOmZhbHNlLCJEYXRhYmVuZCI6dHJ1ZSwiRGF0YUZ1c2lvbiAoUGFycXVldCwgcGFydGl0aW9uZWQpIjpmYWxzZSwiRGF0YUZ1c2lvbiAoUGFycXVldCwgc2luZ2xlKSI6ZmFsc2UsIkFwYWNoZSBEb3JpcyI6ZmFsc2UsIkRyaWxsIjpmYWxzZSwiRHJ1aWQiOmZhbHNlLCJEdWNrREIgKERhdGFGcmFtZSkiOmZhbHNlLCJEdWNrREIgKG1lbW9yeSkiOnRydWUsIkR1Y2tEQiAoUGFycXVldCwgcGFydGl0aW9uZWQpIjpmYWxzZSwiRHVja0RCIjpmYWxzZSwiRWxhc3RpY3NlYXJjaCI6ZmFsc2UsIkVsYXN0aWNzZWFyY2ggKHR1bmVkKSI6ZmFsc2UsIkdsYXJlREIiOmZhbHNlLCJHcmVlbnBsdW0iOmZhbHNlLCJIZWF2eUFJIjpmYWxzZSwiSHlkcmEiOmZhbHNlLCJJbmZvYnJpZ2h0IjpmYWxzZSwiS2luZXRpY2EiOmZhbHNlLCJNYXJpYURCIENvbHVtblN0b3JlIjpmYWxzZSwiTWFyaWFEQiI6ZmFsc2UsIk1vbmV0REIiOmZhbHNlLCJNb25nb0RCIjpmYWxzZSwiTW90aGVyRHVjayI6ZmFsc2UsIk15U1FMIChNeUlTQU0pIjpmYWxzZSwiTXlTUUwiOmZhbHNlLCJPY3RvU1FMIjpmYWxzZSwiT3hsYSI6ZmFsc2UsIlBhbmRhcyAoRGF0YUZyYW1lKSI6ZmFsc2UsIlBhcmFkZURCIChQYXJxdWV0LCBwYXJ0aXRpb25lZCkiOmZhbHNlLCJQYXJhZGVEQiAoUGFycXVldCwgc2luZ2xlKSI6ZmFsc2UsInBnX2R1Y2tkYiAoTW90aGVyRHVjayBlbmFibGVkKSI6ZmFsc2UsInBnX2R1Y2tkYiI6ZmFsc2UsIlBpbm90IjpmYWxzZSwiUG9sYXJzIChEYXRhRnJhbWUpIjpmYWxzZSwiUG9sYXJzIChQYXJxdWV0KSI6ZmFsc2UsIlBvc3RncmVTUUwgKHR1bmVkKSI6ZmFsc2UsIlBvc3RncmVTUUwiOmZhbHNlLCJRdWVzdERCIjp0cnVlLCJSZWRzaGlmdCI6ZmFsc2UsIlNlbGVjdERCIjpmYWxzZSwiU2luZ2xlU3RvcmUiOmZhbHNlLCJTbm93Zmxha2UiOmZhbHNlLCJTcGFyayI6ZmFsc2UsIlNRTGl0ZSI6ZmFsc2UsIlN0YXJSb2NrcyI6ZmFsc2UsIlRhYmxlc3BhY2UiOmZhbHNlLCJUZW1ibyBPTEFQIChjb2x1bW5hcikiOmZhbHNlLCJUaW1lc2NhbGUgQ2xvdWQiOmZhbHNlLCJUaW1lc2NhbGVEQiAobm8gY29sdW1uc3RvcmUpIjpmYWxzZSwiVGltZXNjYWxlREIiOmZhbHNlLCJUaW55YmlyZCAoRnJlZSBUcmlhbCkiOmZhbHNlLCJVbWJyYSI6dHJ1ZX0sInR5cGUiOnsiQyI6dHJ1ZSwiY29sdW1uLW9yaWVudGVkIjp0cnVlLCJQb3N0Z3JlU1FMIGNvbXBhdGlibGUiOnRydWUsIm1hbmFnZWQiOnRydWUsImdjcCI6dHJ1ZSwic3RhdGVsZXNzIjp0cnVlLCJKYXZhIjp0cnVlLCJDKysiOnRydWUsIk15U1FMIGNvbXBhdGlibGUiOnRydWUsInJvdy1vcmllbnRlZCI6dHJ1ZSwiQ2xpY2tIb3VzZSBkZXJpdmF0aXZlIjp0cnVlLCJlbWJlZGRlZCI6dHJ1ZSwic2VydmVybGVzcyI6dHJ1ZSwiZGF0YWZyYW1lIjp0cnVlLCJhd3MiOnRydWUsImF6dXJlIjp0cnVlLCJhbmFseXRpY2FsIjp0cnVlLCJSdXN0Ijp0cnVlLCJzZWFyY2giOnRydWUsImRvY3VtZW50Ijp0cnVlLCJHbyI6dHJ1ZSwic29tZXdoYXQgUG9zdGdyZVNRTCBjb21wYXRpYmxlIjp0cnVlLCJEYXRhRnJhbWUiOnRydWUsInBhcnF1ZXQiOnRydWUsInRpbWUtc2VyaWVzIjp0cnVlfSwibWFjaGluZSI6eyIxNiB2Q1BVIDEyOEdCIjpmYWxzZSwiOCB2Q1BVIDY0R0IiOmZhbHNlLCJzZXJ2ZXJsZXNzIjpmYWxzZSwiMTZhY3UiOmZhbHNlLCJjNmEuNHhsYXJnZSwgNTAwZ2IgZ3AyIjpmYWxzZSwiTCI6ZmFsc2UsIk0iOmZhbHNlLCJTIjpmYWxzZSwiWFMiOmZhbHNlLCJjNmEubWV0YWwsIDUwMGdiIGdwMiI6dHJ1ZSwiMTkyR0IiOmZhbHNlLCIyNEdCIjpmYWxzZSwiMzYwR0IiOmZhbHNlLCI0OEdCIjpmYWxzZSwiNzIwR0IiOmZhbHNlLCI5NkdCIjpmYWxzZSwiZGV2IjpmYWxzZSwiNzA4R0IiOmZhbHNlLCJjNW4uNHhsYXJnZSwgNTAwZ2IgZ3AyIjpmYWxzZSwiQW5hbHl0aWNzLTI1NkdCICg2NCB2Q29yZXMsIDI1NiBHQikiOmZhbHNlLCJjNS40eGxhcmdlLCA1MDBnYiBncDIiOmZhbHNlLCJjNmEuNHhsYXJnZSwgMTUwMGdiIGdwMiI6ZmFsc2UsImNsb3VkIjpmYWxzZSwiZGMyLjh4bGFyZ2UiOmZhbHNlLCJyYTMuMTZ4bGFyZ2UiOmZhbHNlLCJyYTMuNHhsYXJnZSI6ZmFsc2UsInJhMy54bHBsdXMiOmZhbHNlLCJTMiI6ZmFsc2UsIlMyNCI6ZmFsc2UsIjJYTCI6ZmFsc2UsIjNYTCI6ZmFsc2UsIjRYTCI6ZmFsc2UsIlhMIjpmYWxzZSwiTDEgLSAxNkNQVSAzMkdCIjpmYWxzZSwiYzZhLjR4bGFyZ2UsIDUwMGdiIGdwMyI6ZmFsc2UsIjE2IHZDUFUgNjRHQiI6ZmFsc2UsIjQgdkNQVSAxNkdCIjpmYWxzZSwiOCB2Q1BVIDMyR0IiOmZhbHNlfSwiY2x1c3Rlcl9zaXplIjp7IjEiOnRydWUsIjIiOmZhbHNlLCI0IjpmYWxzZSwiOCI6ZmFsc2UsIjE2IjpmYWxzZSwiMzIiOmZhbHNlLCI2NCI6ZmFsc2UsIjEyOCI6ZmFsc2UsInNlcnZlcmxlc3MiOmZhbHNlLCJ1bmRlZmluZWQiOmZhbHNlfSwibWV0cmljIjoibG9hZCIsInF1ZXJpZXMiOlt0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlXX0=">ClickBench</a>. In this benchmark, the data is loaded from an <a href="https://datasets.clickhouse.com/hits_compatible/hits.csv.gz">82 GB uncompressed CSV file</a> into a database table.</p>
<div align="center">
<img src="https://duckdb.org/images/blog/csv-vs-parquet-clickbench.png" alt="Image showing the ClickBench result 2024-12-05" width="800px" referrerpolicy="no-referrer"></div>
<div align="center">ClickBench CSV loading times (2024-12-05)</div>
<h2 id="comparing-csv-and-parquet">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#comparing-csv-and-parquet">Comparing CSV and Parquet</a>
</h2>
<p>With the large boost in usability and performance for the CSV reader, one might ask: what is the actual difference in performance when loading a CSV file compared to a Parquet file into a table? Additionally, how do these formats differ when running queries directly on them?</p>
<p>To find out, we will run a few examples using both CSV and Parquet files containing TPC-H data to shed light on their differences. All scripts used to generate the benchmarks of this blogpost can be found in a <a href="https://github.com/pdet/csv_vs_parquet">repository</a>.</p>
<h3 id="usability">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#usability">Usability</a>
</h3>
<p>In terms of usability, scanning CSV files and Parquet files can differ significantly.</p>
<p>In simple cases, where all options are correctly detected by DuckDB, running queries on either CSV or Parquet files can be done directly.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span> <span class="s1">'path/to/file.csv'</span><span class="p">;</span>
<span class="k">FROM</span> <span class="s1">'path/to/file.parquet'</span><span class="p">;</span>
</code></pre></div></div>
<p>Things can differ drastically for wild, rule-breaking <a href="https://reddead.fandom.com/wiki/Arthur_Morgan">Arthur Morgan</a>-like CSV files. This is evident from the number of parameters that can be set for each scanner. The <a href="https://duckdb.org/docs/data/parquet/overview.html">Parquet</a> scanner has a total of six parameters that can alter how the file is read. For the majority of cases, the user will never need to manually adjust any of them.</p>
<p>The CSV reader, on the other hand, depends on the sniffer being able to automatically detect many different configuration options. For example: What is the delimiter? How many rows should it skip from the top of the file? Are there any comments? And so on. This results in over <a href="https://duckdb.org/docs/data/csv/overview.html">30 configuration options</a> that the user might have to manually adjust to properly parse their CSV file. Again, this number of options is necessary due to the lack of a widely adopted standard. However, in most scenarios, users can rely on the sniffer or, at most, change one or two options.</p>
<p>The CSV reader also has an extensive error-handling system and will always provide suggestions for options to review if something goes wrong.</p>
<p>To give you an example of how the DuckDB error-reporting system works, consider the following CSV file:</p>
<pre><code class="language-csv">Clint Eastwood;94
Samuel L. Jackson
</code></pre>
<p>In this file, the second line is missing the value for the second column.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Invalid Input Error: CSV Error on Line: 2
Original Line: Samuel L. Jackson
Expected Number of Columns: 2 Found: 1
Possible fixes:
* Enable null padding (null_padding=true) to replace missing values with NULL
* Enable ignore errors (ignore_errors=true) to skip this row
file = western_actors.csv
delimiter = , (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \n (Auto-Detected)
header = false (Auto-Detected)
skip_rows = 0 (Auto-Detected)
comment = \0 (Auto-Detected)
date_format = (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding = 0
sample_size = 20480
ignore_errors = false
all_varchar = 0
</code></pre></div></div>
<p>DuckDB provides detailed information about any errors encountered. It highlights the line of the CSV file where the issue occurred, presents the original line, and suggests possible fixes for the error, such as ignoring the problematic line or filling missing values with <code class="language-plaintext highlighter-rouge">NULL</code>. It also displays the full configuration used to scan the file and indicates whether the options were auto-detected or manually set.</p>
<p>The bottom line here is that, even with the advancements in CSV usage, the strictness of Parquet files make them much easier to operate on.</p>
<p>Of course, if you need to open your file in a text editor or Excel, you will need to have your data in CSV format. Note that Parquet files do have some visualizers, like <a href="https://www.tadviewer.com/">TAD</a>.</p>
<h3 id="performance">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#performance">Performance</a>
</h3>
<p>There are primarily two ways to operate on files using DuckDB:</p>
<ol>
<li>
<p>The user creates a DuckDB table from the file and uses the table in future queries. This is a loading process, commonly used if you want to store your data as DuckDB tables or if you will run many queries on them. Also, note that these are the only possible scenarios for most database systems (e.g., Oracle, SQL Server, PostgreSQL, SQLite, …).</p>
</li>
<li>
<p>One might run a query directly on the file scanner without creating a table. This is useful for scenarios where the user has limitations on memory and disk space, or if queries on these files are only executed once. Note that this scenario is typically not supported by database systems but is common for dataframe libraries (e.g., Pandas).</p>
</li>
</ol>
<p>To fairly compare the scanners, we provide the table schemas upfront, ensuring that the scanners produce the exact same data types. We also set <code class="language-plaintext highlighter-rouge">preserve_insertion_order = false</code>, as this can impact the parallelization of both scanners, and set <code class="language-plaintext highlighter-rouge">max_temp_directory_size = '0GB'</code> to ensure no data is spilled to disk, with all experiments running fully in memory.</p>
<p>We use the default writers for both CSV files and Parquet (with the default Snappy compression), and also run a variation of Parquet with <code class="language-plaintext highlighter-rouge">CODEC 'zstd', COMPRESSION_LEVEL 1</code>, as this can speed up querying/loading times.</p>
<p>For all experiments, we use an Apple M1 Max, with 64 GB RAM. We use TPC-H scale factor 20 and report the median times from 5 runs.</p>
<h4 id="creating-tables">
<a style="text-decoration: none;" href="https://duckdb.org/2024/12/05/csv-files-dethroning-parquet-or-not.html#creating-tables">Creating Tables</a>
</h4>
<p>For creating the table, we focus on the <code class="language-plaintext highlighter-rouge">lineitem</code> table.</p>
<p>After defining the schema, both files can be loaded with a simple <code class="language-plaintext highlighter-rouge">COPY</code> statement, with no additional parameters set. Note that even with the schema defined, the CSV sniffer will still be executed to determine the dialect (e.g., quote character, delimiter character, etc.) and match types and names.</p>
<table>
<thead>
<tr>
<th>Name</th>
<th style="text-align: right">Time (s)</th>
<th style="text-align: right">Size (GB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>CSV</td>
<td style="text-align: right">11.76</td>
<td style="text-align: right">15.95</td>
</tr>
<tr>
<td>Parquet Snappy</td>
<td style="text-align: right">5.21</td>
<td style="text-align: right">3.78</td>
</tr>
<tr>
<td>Parquet ZSTD</td>
<td style="text-align: right">5.52</td>
<td style="text-align: right">3.22</td>
</tr>
</tbody>
</table>
<p>We can see that the Parquet files are definitely smaller. About 5× smaller than the CSV file, but the performance difference is not drastic.</p>
<p>The CSV scanner is only about 2× slower than the Parquet scanner. It's also important to note that some of the cost associated with these operations (~1-2 seconds) is related to the insertion into the DuckDB table, not the sc |
artefaritaKuniklo
pushed a commit
to artefaritaKuniklo/RSSHub
that referenced
this pull request
Dec 13, 2024
* fix(route/duckdb): change blogs link and author * fix(route/duckdb): update description selector ---------
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Involved Issue / 该 PR 相关 Issue
Close #
Example for the Proposed Route(s) / 路由地址示例
New RSS Route Checklist / 新 RSS 路由检查表
Puppeteer
Note / 说明
Due to the update of the DuckDB Blog page layout, the old way of fetching links is not working