-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duckdb 1.29.0; self-host extensions #1734
Conversation
About the package size Correlatively to the new features, this new release weighs a ton: the base files have doubled in size, and with the addition of extensions the binaries now take 153M of disk space on the server. Fortunately this is not what the user has to download. First, depending on the browser used, they will only download the "mvp" version (older browsers) or the "eh" version of the wasm files, which is slightly more performant. Second, they will not load all extensions (and only "spatial" is quite big). Third, the wasm files are gzip'ed when transmitted to the browser. But it's still doubling the (compressed) size of the base files from 4MB to ~8MB (depending on the extensions needed… here I compare 1.28.0 with 1.29.0+parquet). Is there a case to be made for users who would prefer to stay with 1.28 because of that? I prefer not to, since it would add much complexity. About self-hosting extensions A key feature of Framework is self-hosting. I didn't want to support 1.29 without self-hosting at least the extensions that used to be part of the monolithic 1.28 ("parquet", "json"). The status of extensions is however still a bit unclear to me. Some of the core extensions are built-in (such as httpfs). It's unclear how that list will change in the future (httpfs changed status during development, I think). So instead of linking to Moreover, only the core extensions are self-hosted for now, and we might want a path also for people who want to self-host community extensions (such as "h3") or custom extensions. Maybe self-hosting "all the core extensions" is too much, and we could have a smaller list of extensions we self-host. However judging from the sizes of extensions, "spatial" dwarfs all the others—so it might not make sense to try and optimize disk space if we keep "spatial". Another option would be to make a configurable list of self-hosted extensions (including community and custom extensions). We would then have to pass that list to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than overloading the meaning of https://extensions.duckdb.org/
and npm:extensions.duckdb.org
, we’d probably want a duckdb:
protocol for specifying extensions, and to put them in _duckdb
parallel to _npm
. But that’s quite a bit of machinery to support DuckDB extensions…
src/client/stdlib/duckdb.js
Outdated
"tpch", | ||
"vss" | ||
] | ||
.map((ext) => `INSTALL ${ext} FROM '${repo}';`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do these paths get content-hashed and/or versioned (for immutable caching)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they are versioned by the 1.1.1
in their path, but I'm not sure of what happens server-side in duckdb-land. I'm talking with @carlopi to understand this better.
An alternative approach could be to publish a package on jsr or npm with the extensions we want to self-host. |
I guess my inclination is to have users explicitly list which DuckDB extensions they want, and where they come from. And then Framework can download them for self-hosting. So maybe in the config you would say something like: export default {
duckdb: {
extensions: {
json: "https://extensions.duckdb.org/v1.1.1/wasm_eh/json.duckdb_extension.wasm",
parquet: "https://extensions.duckdb.org/v1.1.1/wasm_eh/parquet.duckdb_extension.wasm"
}
}
}; If we wanted to have shorthand, we could also allow something like: export default {
duckdb: {
extensions: {
json: true,
parquet: true
}
}
}; Or even shorter: export default {
duckdb: {
extensions: ["json", "parquet"]
}
}; So Framework would self-host the specified files. And internally we’d have some resolution magic so that |
About the Currently if several sql blocks use spatial functions (for example), you have to remember to type To avoid this issue we could maybe hoist any Or, maybe simpler, we could add a top-level config in front-matter. Something like:
or keep the
(we could also make it possible to reference an Excel or Shapefile dataset in front-matter, since |
@Fil In my previous comment I meant that could be specified in the project config. But we could also let it be specified in the page front matter, overriding the project config if different pages want different extensions. |
The config option would indicate which extensions are self-hosted and where they're sourced from. Thus, they would be For many core extensions this is happening implicitly, when duckdb recognizes that one the functions or file formats used belongs to a given extension (the extension is then said to be “auto-loaded”). The documentation in Currently, when an extension needs to be loaded explicitly, it has to be mentioned in every sql code block, because their order of execution is not guaranteed. That's a bit too much, and I think the correct level to define these I hadn't thought about loading all the configured/self-hosted extensions on all the pages, thinking that it should depend on what the page needs (e.g., for better performance on pages that don't need "spatial"). But I reckon this would make it easier to use, and maybe I'm overcomplicating things for the sake of the hypothetical project that might need an extension on a given page and not on another one. Maybe we should opt for simplicity. (I'll play with the various possibilities to see how it feels.) |
Right, so the config could say whether to load the extension explicitly or to let it autoload if desired. But in either case the installing (and optionally loading) of any desired extensions would happen prior to the Having equivalent extension registration for the front matter as for the project config makes sense. |
(not quite there yet: still need to do hashing, per-extension configuration of the LOAD command, and per page configuration)
docs/lib/duckdb.md
Outdated
```sql echo run=false | ||
INSTALL json FROM core; | ||
-- use JSON features | ||
INSTALL custom FROM 'https://example.com/'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should discourage people from installing extensions from within SQL blocks: doing so globally changes the behavior of the DuckDBClient instance and can lead to race conditions/nondeterministic behavior across blocks, and also because we want to favor self-hosting of extensions rather than hotlinking to an external website.
The recommended way to install extensions should be via the front matter or the project config (or to do it in JavaScript by redefining the sql
literal and awaiting the loading of the extensions).
Getting closer. TODO:
|
src/libraries.ts
Outdated
if (!duckdb) throw new Error("Implementation error: missing duckdb configuration"); | ||
for (const [name, {source}] of Object.entries(duckdb.extensions)) { | ||
for (const platform of duckdb.bundles) { | ||
implicits.add(`duckdb:${platform},${name},${source}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to follow the same convention that DuckDB does here for custom repositories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one answer is that name
and platform
should never contain a comma, making it easier to split — but there’s no guarantee that source
doesn’t contain a comma, so it’s not safe to path.split(",")
and expect to get all the parts back out again. I’m going to tweak this a bit to match the DuckDB convention and make it more robust.
src/config.ts
Outdated
@@ -499,3 +521,41 @@ export function mergeStyle( | |||
export function stringOrNull(spec: unknown): string | null { | |||
return spec == null || spec === false ? null : String(spec); | |||
} | |||
|
|||
// TODO convert array of names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO Remove this to-do.
How do we use/enable the extensions support ? Do we need to wait for an official release or can we install the prerelease vesion ? |
It's possible but difficult; my recommendation is to wait (a few days max) for the next release of Framework. |
🎉 1.29.0! new version of DuckDB-wasm 🎉
https://github.com/duckdb/duckdb-wasm/releases/tag/v1.29.0
The repo had 296 commits since the last stable release a year ago. This is not including the commits on the linked DuckDB itself, which is now in version 1.1.1.
See https://duckdb.org/2024/09/09/announcing-duckdb-110 for the new features and changes in DuckDB. For example, the nice HISTOGRAM() function:
The most notable new feature in duckdb-wasm is the support for extensions, in particular the "spatial" extension which includes the whole of GDAL, enabling geographic compute (projections, areas, etc), and introducing compatibility with dozens of new formats (shapefiles, excel sheets, etc.).
Other extensions: autocomplete, fts, icu, inet, json, parquet, spatial, sqlite_scanner, substrait, tpcds, tpch, vss.
related: