Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckdb 1.29.0; self-host extensions #1734

Merged
merged 47 commits into from
Nov 2, 2024
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
c70e1bc
explicit duckdb 1.29.0; self-host core extensions; document
Fil Oct 8, 2024
0029c8c
configure which extensions are self-hosted
Fil Oct 10, 2024
feeaad8
Merge branch 'main' into fil/duckdb-wasm-1.29
Fil Oct 10, 2024
33aa5cb
hash extensions
Fil Oct 10, 2024
543f823
better docs
Fil Oct 10, 2024
7475589
cleaner duckdb manifest — now works in scripts and embeds
Fil Oct 11, 2024
47b6bd0
restructure code, extensible manifest
Fil Oct 11, 2024
abd0380
test, documentation
Fil Oct 11, 2024
7ac5d1d
much nicer config
Fil Oct 11, 2024
0adcb36
document config
Fil Oct 11, 2024
5365371
add support for mvp, clean config & documentation
Fil Oct 11, 2024
1fdf717
parametrized the initial LOAD in DuckDBClient
Fil Oct 11, 2024
bc712c3
tests
Fil Oct 11, 2024
2fb2878
bake-in the extensions manifest
Fil Oct 11, 2024
bc49674
fix test
Fil Oct 11, 2024
9a13f2a
don't activate spatial on the documentation
Fil Oct 11, 2024
e2c8b6c
Merge branch 'main' into fil/duckdb-wasm-1.29
Fil Oct 14, 2024
4a5128d
refactor: hash individual extensions, include the list of platforms i…
Fil Oct 14, 2024
13f892c
don't copy extensions twice
Fil Oct 14, 2024
8bb2866
Merge branch 'main' into fil/duckdb-wasm-1.29
Fil Oct 18, 2024
43ef6eb
Merge branch 'main' into fil/duckdb-wasm-1.29
Fil Oct 19, 2024
6764969
Merge branch 'main' into fil/duckdb-wasm-1.29
mbostock Oct 20, 2024
d72f0c3
Update src/duckdb.ts
Fil Oct 20, 2024
d6fc020
remove DuckDBClientReport utility
Fil Oct 21, 2024
69f25a2
renames
Fil Oct 21, 2024
30788e3
p for platform
Fil Oct 21, 2024
710f36a
centralize DUCKDBWASMVERSION and DUCKDBVERSION
Fil Oct 21, 2024
4f58100
clearer
Fil Oct 21, 2024
a8cfdcd
better config; manifest.extensions now lists individual extensions on…
Fil Oct 21, 2024
490d969
validate extension names; centralize DUCKDBBUNDLES
Fil Oct 21, 2024
aaff8f8
fix tests
Fil Oct 21, 2024
bc39bbe
Merge branch 'main' into fil/duckdb-wasm-1.29
Fil Oct 30, 2024
8bd0972
copy edit
Fil Oct 30, 2024
b90c22a
support loading non-self-hosted extensions
Fil Oct 30, 2024
b37be07
test duckdb config normalization & defaults
Fil Oct 30, 2024
9abaf57
documentation
Fil Oct 30, 2024
ccc0073
typography
Fil Oct 30, 2024
26c7a6f
doc
Fil Oct 31, 2024
4416dd3
Merge branch 'main' into fil/duckdb-wasm-1.29
mbostock Nov 1, 2024
7704416
use view for <50MB
mbostock Nov 1, 2024
1dde616
docs, shorthand, etc.
mbostock Nov 1, 2024
0491966
annotate fixes
mbostock Nov 1, 2024
be26385
disable telemetry on annotate tests, too
mbostock Nov 1, 2024
a23d3e4
tidier duckdb manifest
mbostock Nov 1, 2024
c753728
Merge branch 'main' into fil/duckdb-wasm-1.29
mbostock Nov 1, 2024
6e828c9
remove todo
mbostock Nov 1, 2024
365dbe3
more robust duckdb: scheme
mbostock Nov 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,29 @@ export default {
};
```

## duckdb <a href="https://github.com/observablehq/framework/pull/1734" class="observablehq-version-badge" data-version="prerelease" title="Added in #1734"></a>

The **duckdb** option allows you to specify the DuckDB [extensions](./sql#extensions) that you want to self-host and make available in the `sql` and `DuckDBClient` instances.

Its **extensions** property is an object where keys are extension names, and values describe the **source** for the extension, and whether to **install** (self-host) it, and **load** it immediately.

The **source** property is the reference of the repo from which to download the extension. It defaults to `core`, which points to `https://extensions.duckdb.org/`. You can use `core`, `community` (which points to `https://community-extensions.duckdb.org/`), or a custom URL, for example if you develop your own extensions.

By default "json" and "parquet" are installed, but not loaded (since they are autoloaded, there is no reason to load them before we actually need them). If you don’t want to self-host an extension, set its **install** property to false. You will still be able to load it from its source by calling `INSTALL` and `LOAD`.

As a shorthand, you can specify `name: true` to install and load the named extension from the "core" repository. (And `name: false` is shorthand for `{install: false, load: false}`.)

For example, a typical configuration for a geospatial data app might install and load “spatial” from `core` and “h3” from `community`:

```js run=false
duckdb: {
extensions: {
spatial: true,
h3: {source: "community"}
}
}
```

## markdownIt <a href="https://github.com/observablehq/framework/releases/tag/v1.1.0" class="observablehq-version-badge" data-version="^1.1.0" title="Added in v1.1.0"></a>

A hook for registering additional [markdown-it](https://github.com/markdown-it/markdown-it) plugins. For example, to use [markdown-it-footnote](https://github.com/markdown-it/markdown-it-footnote), first install the plugin with either `npm add markdown-it-footnote` or `yarn add markdown-it-footnote`, then register it like so:
Expand Down
18 changes: 18 additions & 0 deletions docs/lib/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,21 @@ const sql = DuckDBClient.sql({quakes: `https://earthquake.usgs.gov/earthquakes/f
```sql echo
SELECT * FROM quakes ORDER BY updated DESC;
```

## Extensions

DuckDB’s [extensions](../sql#extensions)<a href="https://github.com/observablehq/framework/pull/1734" class="observablehq-version-badge" data-version="prerelease" title="Added in #1734"></a> are supported.

By default, `DuckDBClient.of` and `DuckDBClient.sql` load the extensions referenced in the [configuration](../config#duckdb). If you want a different environment, you can pass options listing the extensions you want to load.

For example, pass an empty array to instantiate a DuckDBClient with no loaded extensions (even if your configuration lists several extensions):

```js echo run=false
const simpledb = DuckDBClient.of({}, {load: []});
```

Or, create a geospatial tagged template literal:

```js echo run=false
const geosql = DuckDBClient.sql({}, {load: ["spatial", "h3"]});
```
44 changes: 44 additions & 0 deletions docs/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,3 +206,47 @@ Inputs.table(await sql([`SELECT * FROM gaia WHERE source_id IN (${[source_ids]})
When interpolating values into SQL queries, be careful to avoid [SQL injection](https://en.wikipedia.org/wiki/SQL_injection) by properly escaping or sanitizing user input. The example above is safe only because `source_ids` are known to be numeric.

</div>

## Extensions <a href="https://github.com/observablehq/framework/pull/1734" class="observablehq-version-badge" data-version="prerelease" title="Added in #1734"></a>

DuckDB has a flexible extension mechanism that allows for dynamically loading extensions. These may extend DuckDB's functionality by providing support for additional file formats, introducing new types, and domain-specific functionality.

Framework can download and host the extensions of your choice. By default, only "json" and "parquet" are self-hosted, but you can add more by specifying them in the [configuration](./config). The self-hosted extensions are served from the `/_duckdb/` directory with a content-hashed URL, ensuring optimal performance and allowing you to work offline and from a server you control.

The self-hosted extensions are immediately available in all the `sql` code blocks and [DuckDBClient](./lib/duckdb) instances. For example, the query below works instantly since the "json" extension is configured:

```sql echo
SELECT bbox FROM read_json('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson');
```

Likewise, with the “spatial” extension configured, you could directly run:

```sql echo run=false
SELECT ST_Area('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))'::GEOMETRY) as area;
```

If you use an extension that is not self-hosted, DuckDB falls back to loading it directly from DuckDB’s servers. For example, this documentation does not have the “inet” extension configured for self-hosting.

```sql echo
SELECT '127.0.0.1'::INET AS ipv4, '2001:db8:3c4d::/48'::INET AS ipv6;
```

During development, you can experiment freely with extensions that are not self-hosted. For example to try out the “h3” `community` extension:

```sql echo run=false
INSTALL h3 FROM community;
LOAD h3;
SELECT format('{:x}', h3_latlng_to_cell(37.77, -122.43, 9)) AS cell_id;
mbostock marked this conversation as resolved.
Show resolved Hide resolved
```

<small>(this returns the H3 cell [`892830828a3ffff`](https://h3geo.org/#hex=892830828a3ffff))</small>

For performance and ergonomy, we strongly recommend adding all the extensions you actually use to the [configuration](./config#duckdb).

<div class="tip">

To tell which extensions are effectively in use on a page, inspect the network tab in your browser, or run the following query: `FROM duckdb_extensions() WHERE loaded;`.

</div>

These features are tied to DuckDB wasm’s 1.29 version, and strongly dependent on its development cycle.
23 changes: 21 additions & 2 deletions src/build.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ import {existsSync} from "node:fs";
import {copyFile, readFile, rm, stat, writeFile} from "node:fs/promises";
import {basename, dirname, extname, join} from "node:path/posix";
import type {Config} from "./config.js";
import {getDuckDBManifest} from "./duckdb.js";
import {CliError} from "./error.js";
import {getClientPath, prepareOutput} from "./files.js";
import {findModule, getModuleHash, readJavaScript} from "./javascript/module.js";
Expand Down Expand Up @@ -53,7 +54,7 @@ export async function build(
{config}: BuildOptions,
effects: BuildEffects = new FileBuildEffects(config.output, join(config.root, ".observablehq", "cache"))
): Promise<void> {
const {root, loaders} = config;
const {root, loaders, duckdb} = config;
Telemetry.record({event: "build", step: "start"});

// Prepare for build (such as by emptying the existing output root).
Expand Down Expand Up @@ -140,6 +141,20 @@ export async function build(
effects.logger.log(cachePath);
}

// Copy over the DuckDB extensions and create the DuckDB manifest.
for (const path of globalImports) {
if (path.startsWith("/_duckdb/")) {
const sourcePath = join(cacheRoot, path);
effects.output.write(`${faint("build")} ${path} ${faint("→")} `);
const contents = await readFile(sourcePath);
const hash = createHash("sha256").update(contents).digest("hex").slice(0, 8);
const alias = applyHash(path, hash);
aliases.set(path, alias);
await effects.writeFile(alias, contents);
}
}
const duckDBManifest = await getDuckDBManifest(duckdb, {root, aliases});

// Generate the client bundles. These are initially generated into the cache
// because we need to rewrite any npm and node imports to be hashed; this is
// handled generally for all global imports below.
Expand All @@ -149,6 +164,9 @@ export async function build(
effects.output.write(`${faint("bundle")} ${path} ${faint("→")} `);
const clientPath = getClientPath(path === "/_observablehq/client.js" ? "index.js" : path.slice("/_observablehq/".length)); // prettier-ignore
const define: {[key: string]: string} = {};
if (path === "/_observablehq/stdlib/duckdb.js") {
define["DUCKDB_MANIFEST"] = JSON.stringify(duckDBManifest);
}
const contents = await rollupClient(clientPath, root, path, {minify: true, keepNames: true, define});
await prepareOutput(cachePath);
await writeFile(cachePath, contents);
Expand Down Expand Up @@ -204,7 +222,7 @@ export async function build(
// Anything in _observablehq also needs a content hash, but anything in _npm
// or _node does not (because they are already necessarily immutable).
for (const path of globalImports) {
if (path.endsWith(".js")) continue;
if (path.endsWith(".js") || path.startsWith("/_duckdb/")) continue;
mbostock marked this conversation as resolved.
Show resolved Hide resolved
const sourcePath = join(cacheRoot, path);
effects.output.write(`${faint("build")} ${path} ${faint("→")} `);
if (path.startsWith("/_observablehq/")) {
Expand Down Expand Up @@ -398,6 +416,7 @@ function validateLinks(outputs: Map<string, {resolvers: Resolvers}>): [valid: Li
}

function applyHash(path: string, hash: string): string {
if (path.startsWith("/_duckdb/")) return join("/_duckdb/", `${hash}-${path.slice("/_duckdb/".length)}`);
Copy link
Member

@mbostock mbostock Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, why do we need special hashing behavior for DuckDB files? And second, if we do, we should have a separate function or inline it instead of putting this into applyHash. The only place we currently call applyHash with a path that starts with /_duckdb/ is on L151, so we can just do the special behavior there. But I’d like to avoid the special behavior unless it’s necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because when you INSTALL foo FROM host DuckDB will load it from ${host}/v1.1.1/wasm_{p}/parquet.duckdb_extension.wasm and not accept anything else.

(If we could give the full path I would have gone with ${host}/v1.1.1/wasm_{p}/parquet.duckdb_extension.${hash}.wasm but that doesn't seem to be supported. Maybe I missed something, though.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep makes sense once I grokked DuckDB’s requirements on the custom repositories for extensions. But I still think we should break out this into a separate function rather than overloading applyHash, and I’m not sure we need content hashes on anything under _duckdb/ assuming that they don’t change the contents of extensions independently of the DuckDB version. If they do, then we’ll need a separate custom repository per extension since they’ll all have different hashes…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes—currently I don't rely on any constancy from the extensions repository. Even if duckdb's hosted extensions are indeed changed only when the duckdb version changes, it won't necessarily be the case for custom extensions. Which is why each extension gets its own INSTALL FROM command with a specific hashed path.

const ext = extname(path);
let name = basename(path, ext);
if (path.endsWith(".js")) name = name.replace(/(^|\.)_esm$/, ""); // allow hash to replace _esm
Expand Down
50 changes: 37 additions & 13 deletions src/client/stdlib/duckdb.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,26 @@ import * as duckdb from "npm:@duckdb/duckdb-wasm";
// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.

const bundle = await duckdb.selectBundle({
mvp: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-mvp.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-mvp.worker.js")
},
eh: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-eh.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-eh.worker.js")
}
});
// Baked-in manifest.
// eslint-disable-next-line no-undef
const manifest = DUCKDB_MANIFEST;

const candidates = {
...(manifest.bundles.includes("mvp") && {
mvp: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-mvp.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-mvp.worker.js")
}
}),
...(manifest.bundles.includes("eh") && {
eh: {
mainModule: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-eh.wasm"),
mainWorker: import.meta.resolve("npm:@duckdb/duckdb-wasm/dist/duckdb-browser-eh.worker.js")
}
})
mbostock marked this conversation as resolved.
Show resolved Hide resolved
};
const bundle = await duckdb.selectBundle(candidates);
const activePlatform = manifest.bundles.find((key) => bundle.mainModule === candidates[key].mainModule);

const logger = new duckdb.ConsoleLogger(duckdb.LogLevel.WARNING);

Expand Down Expand Up @@ -169,6 +179,7 @@ export class DuckDBClient {
config = {...config, query: {...config.query, castBigIntToDouble: true}};
}
await db.open(config);
await registerExtensions(db, config);
await Promise.all(Object.entries(sources).map(([name, source]) => insertSource(db, name, source)));
return new DuckDBClient(db);
}
Expand All @@ -178,9 +189,22 @@ export class DuckDBClient {
}
}

Object.defineProperty(DuckDBClient.prototype, "dialect", {
value: "duckdb"
});
Object.defineProperty(DuckDBClient.prototype, "dialect", {value: "duckdb"});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DatabaseClient.dialect isn’t used for anything in Framework. I think it’s used in notebooks for ejecting from a SQL cell, but there’s no analogous concept in Framework so maybe we should consider dropping it. Not directly related to this PR though!


async function registerExtensions(db, {load}) {
const connection = await db.connect();
try {
await Promise.all(
manifest.extensions.map(([name, {[activePlatform]: ref, load: l}]) =>
connection
.query(`INSTALL "${name}" FROM '${ref.startsWith("https://") ? ref : import.meta.resolve(`../..${ref}`)}'`)
.then(() => (load ? load.includes(name) : l) && connection.query(`LOAD "${name}"`))
)
);
} finally {
await connection.close();
}
}

async function insertSource(database, name, source) {
source = await source;
Expand Down
44 changes: 43 additions & 1 deletion src/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import {pathToFileURL} from "node:url";
import he from "he";
import type MarkdownIt from "markdown-it";
import wrapAnsi from "wrap-ansi";
import {DUCKDBBUNDLES} from "./duckdb.js";
import {visitFiles} from "./files.js";
import {formatIsoDate, formatLocaleDate} from "./format.js";
import type {FrontMatter} from "./frontMatter.js";
Expand Down Expand Up @@ -76,6 +77,11 @@ export interface SearchConfigSpec {
index?: unknown;
}

export interface DuckDBConfig {
bundles: string[];
extensions: {[name: string]: {install?: false; load: boolean; source: string}};
}

export interface Config {
root: string; // defaults to src
output: string; // defaults to dist
Expand All @@ -98,6 +104,7 @@ export interface Config {
normalizePath: (path: string) => string;
loaders: LoaderResolver;
watchPath?: string;
duckdb: DuckDBConfig;
}

export interface ConfigSpec {
Expand Down Expand Up @@ -127,6 +134,7 @@ export interface ConfigSpec {
preserveIndex?: unknown;
preserveExtension?: unknown;
markdownIt?: unknown;
duckdb?: unknown;
}

interface ScriptSpec {
Expand Down Expand Up @@ -262,6 +270,7 @@ export function normalizeConfig(spec: ConfigSpec = {}, defaultRoot?: string, wat
const search = spec.search == null || spec.search === false ? null : normalizeSearch(spec.search as any);
const interpreters = normalizeInterpreters(spec.interpreters as any);
const normalizePath = getPathNormalizer(spec);
const duckdb = normalizeDuckDB(spec.duckdb);

// If this path ends with a slash, then add an implicit /index to the
// end of the path. Otherwise, remove the .html extension (we use clean
Expand Down Expand Up @@ -312,7 +321,8 @@ export function normalizeConfig(spec: ConfigSpec = {}, defaultRoot?: string, wat
md,
normalizePath,
loaders: new LoaderResolver({root, interpreters}),
watchPath
watchPath,
duckdb
};
if (pages === undefined) Object.defineProperty(config, "pages", {get: () => readPages(root, md)});
if (sidebar === undefined) Object.defineProperty(config, "sidebar", {get: () => config.pages.length > 0});
Expand Down Expand Up @@ -499,3 +509,35 @@ export function mergeStyle(
export function stringOrNull(spec: unknown): string | null {
return spec == null || spec === false ? null : String(spec);
}

function duckDBExtensionSource(source?: string): string {
return source === undefined || source === "core"
? "https://extensions.duckdb.org"
: source === "community"
? "https://community-extensions.duckdb.org"
: (source = String(source)).startsWith("https://")
? source
: (() => {
throw new Error(`unsupported DuckDB extension source ${source}`);
})();
}

function normalizeDuckDB(spec: unknown): DuckDBConfig {
const extensions: {[name: string]: any} = {};
for (const [name, config] of Object.entries(spec?.["extensions"] ?? {json: {load: false}, parquet: {load: false}})) {
if (!/^\w+$/.test(name)) throw new Error(`illegal extension name ${name}`);
if (config != null) {
extensions[name] =
config === true
? {load: true, install: true, source: duckDBExtensionSource()}
: config === false
? {load: false, install: false, source: duckDBExtensionSource()}
: {
source: duckDBExtensionSource(config["source"]),
install: Boolean(config["install"] ?? true),
load: Boolean(config["load"] ?? true)
};
}
}
return {bundles: DUCKDBBUNDLES, extensions};
}
Loading