-
-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasm RFC #541
Comments
Keen to hear your thoughts @jonfk |
A couple of remarks. If I recall correctly, the smallest wasm tantivy build I managed to get so far was 900KB and that involved a lot of cheating & trimming. Some of this work will affect external crates, like the stemming crate. I haven't followed what was the state on the standardization of the "WASI", but if we get an Mmap there, it might a very nice use case outside of browser development. This could be a very nice way to distribute tantivy-cli on any platform, in a very safe way. The people from wasmer were kind enough to actually link I have no idea how syscall work so it would be nice to dig a little on that. |
According to this doc Mmap is not likely to be part of the standardized WASI any time soon. That's a bummer. On tantivy side, that will mean some heavy changes on the way we do io if we want to get compatible eventually. |
thanks for your notes. Below are my thoughts on the 3 specific topics. sizeI have grepped for mmapI am going off the wasm spec, rather than the WASI. According to the wasm design doc mmap support is in the future ideas section. I am guessing the rust-wasm team will wait for it to stabilise and hopefully provide a wrapper in the mmap crate. Either way you make an important point. For the foreseeable future, mmap calls won't be supported by wasm. This means we will need to change our internals and keep them compatible for the same index format across different platforms. If and when mmap support is added, we will need to add conditional compilation flags inside the core library to allow mmap on wasm and server-side. distribution as a wasi binary for other languages to useI didn't think of it. My original suggestion was to build tantivy wasm for browsers. However, you are right to draw attention to the potential of wasmer and other wasm runtimes that can help us. They will provide the wrappers and integrations of wasm binaries with CPython, Java. This will save us time, while enabling more developers to embed tantivy across different applications. OverallI prefer to under promise and over deliver than vice versa. if wasm is hard to manage and will make it hard to work on server-side features, I would not accept this. In my opinion, promising wasm support now and ignoring or removing it later seems more damaging to tantivy than not promising it at all. I am still curious to get more thoughts on pain points/costs/disadvantages there might be of adding wasm as a first-class target. It feels like a big commitment for development, toolchain inclusion and CI/ops provisioning. |
This would also make it easy to use tantivy in WebExtensions for Firefox/Chrome or MailExtensions in Thunderbird. |
Is this RFC about having a .wasm binary for this crate, or having a WASM target-compatible library that serves as other library's dependency? If the latter is the choice, there is an very simple guide on the WebAssembly Book that might help. |
I found that this crate can be built to WASM target with wasm-bindgen feature enabled and default features (mmap) disabled. (Probably intended) So I think it could be used like this: [dependencies]
tantivy = { version = "0.11.3", default-features = false, features = ["wasm-bidgen"] } But I cannot make sure since the current test suite relies on mmap, and so as the many APIs of tantivy. |
did not look deeper, but could this mmap in wasi-lib help: |
@urbien This part is probably more relevant. So the WASI-libc does add support for mmap, but it will read and load the entire file in anonymous memory. |
I came here looking for a static web site search engine. Think Wikipedia size data with all the files pre-processed and search indexes laid down on disk for static serving. A WASM module in the browser would read the inverted index files as needed to execute the search at hand. I'm not familiar with the layout of your index files, but I wouldn't want each individual file too big or too small, and a single search shouldn't need too many different files to complete, since each "seek" could be another network fetch. A website archived to a content addressable storage and wanted to include search would need everything pre-computed at build and upload time. Keeping the content un-changeable and without any backend infrastructure enable reliable sneaker-net movement of these archives and usage across air-gapped networks. |
@ngbrown Do you have a specific use case? |
One very specific example, but it's not the first use of a static search that I've thought of, would be the need for an air-gapped Wikipedia because some countries block it now. The IPFS has a project that provides this (https://blog.ipfs.io/24-uncensorable-wikipedia/) along with several methods to transport the snapshot across the network block, one option is pinning on nodes that have connectivity to both partitions of the network, and the other is to copy an entire snapshot package to a USB drive (https://dweb-primer.ipfs.io/avenues-for-access/sneakernets). If you check out their snapshot, you'll see it has no search. Dealing with Wikipedia without search is less than ideal. As far as I know, there's no working solution for a static search on sites this big because the indexes would be very large to download just for that specific user's needs. Javascript and WASM is allowable as part of this snapshot so that's why I thought a full search engine like Tantivy could be leveraged. |
@ngbrown searching static website is common, but they are usually small enough that javascript libraries do ok. I have compiled tantivy to WASM before, but the resulting file is around 1MB at least. (more with stemming), so it is not really worth doing it for a simple use case. Wikipedia on a USB key is interesting. For different reasons, current version of tantivy is not great for this use case however... |
I've used the in-memory JavaScript search solutions for the smaller use cases (think blog or help documentation). So they do work, but I don't know of one that partitions the indexes and only loads segments on demand for the really big cases. I really don't think a 1MB WASM file would be a big deal compared to the size of the indexes for a big site like Wikipedia. Microsoft is downloading multiples of that to get .NET running in WASM. Do you have any more information of this alternative? The good news is that this use case doesn't need backwards compatible files. They would just get re-built each time, for each version. |
I think this blocks matrix-org/seshat#84 which is useful in element matrix web client. Is it possible to remove non wasm compatible parts (I see mmap in this issue) via some cfg? |
@phiresky That's awesome. I'll check out your proof of concept, thanks for sharing it. With Tantivy's use of mmap for storing the index, what does memory usage look like to the end user on your proof of concept? (i.e. in Activity Monitor). (warning: I'm not well-versed in mmap'd files so there maybe be incorrect assumptions in the following question...) Is there a way to tell the OS to limit resident memory usage (rss) with an mmap'd file? I want to use Tantivy via WASM for a large client-side index, but I'm concerned the end user will perceive a bunch of memory usage (i.e. in Activity Monitor) when, in practice, the OS should manage the mmap'd file's resident memory usage as it pleases. |
In my POC I replace memory mapping with a manual "page cache". Basically a replacement for what the OS does when memory mapping. With memory mapping the OS chooses how much of it to keep it memory based on how much RAM you have, and automatically evicts it when other stuff needs the space. We can't really do this automatic choosing in the browser since we can't know what other programs on the computer need. So you can basically choose whatever memory usage you want. You actually have to in my POC, since otherwise it will cache everything forever. So it would probably need an LRU system with a fixed limit on memory usage. Depending on your needs you might want to cache e.g. the whole In my tests, normal queries fetch around 1-10MB of data, which is then the same as the memory usage (except for internal data structures, but those shouldn't be very large). So it really shouldn't be much more than a normal website. |
Thanks for providing Tantivy and keeping this ticket open. Even though I know it's not supported, I tried and seemed to get close, but can't get Tantivy working in WASM even though it compiles. I get a panic when actually calling Tantivy code in a WASM context. I know this isn't supported yet, so I'm not surprised it doesn't work, but if anybody has any quick pointers on hacks/workarounds the code is below. My strategy was:
I think my WASM/Rust/JS is right. I think my Tantivy code is right. I told Tantivy I only care about in-memory index. But still can't get the combination of technologies working. The update from 0.14 here make it seems like the underlying data storage Tantivy uses is getting closer to some form of in-memory + WASM capability, but I'm not sure if it can work in any WASM environment yet (even if limited?). Is the underlying issue with WASM the data storage aspect? For me, I want the Lucene-like capabilities, stemming, etc. but am okay with an in-memory index. The panic, as close as I can get! Cargo.toml [package]
# search client!
name = "sc"
version = "0.1.0"
edition = "2021"
default-run = "sc"
[lib]
crate-type = ["cdylib"]
[dependencies]
anyhow = "1.0"
fake = { version = "2.6.1", features=['derive'] }
num-format = "0.4.4"
rand = "0.8.5" # required by fake
tantivy = { version = "0.19.2", default-features = false }
uuid = "1.3.3"
wasm-bindgen = "0.2.86"
# needed for WASM
getrandom = { version = "0.2.2", features = ["js"] }
web-sys = { version = "0.3.63", features = ['console'] }
[target.'cfg(target_arch = "wasm32")'.dependencies]
console_error_panic_hook = "0.1.6" src/lib.rs use wasm_bindgen::prelude::*;
mod logging;
mod search_index;
mod test_data;
#[wasm_bindgen]
extern "C" {
fn get_js_name() -> JsValue;
}
#[wasm_bindgen]
pub fn get_rust_name() -> String {
"BenRs".to_string()
}
#[wasm_bindgen(start)]
pub fn start() {
let js_name = get_js_name();
let js_name_string = js_name.as_string().unwrap();
println!("Hello {}, you wild alien from JavaScript!", js_name_string);
println!("test search");
let goofy_string = search_index::test_two();
search_index::test();
println!("test search: done, string is {}", goofy_string);
} src/test_data/mod.rs use fake::Fake;
// using `faker` module with locales
use fake::faker::name::raw::*;
use fake::locales::*;
pub fn fake_names(count: usize) -> Vec<String> {
(0..count).map(|_| Name(EN).fake()).collect()
} src/search_index/mod.rs use std::time::Instant;
use num_format::{Locale, ToFormattedString};
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};
use uuid::Uuid;
use crate::test_data::fake_names;
const NUM_TEST_RECORDS: usize = 100_000;
// const NUM_TEST_RECORDS: usize = 1_000_000;
// const NUM_TEST_RECORDS: usize = 1;
const TEST_QUERY: &'static str = "standefer";
const DEBUG: bool = false;
#[derive(Debug)]
struct TestDoc {
uuid: Uuid,
name: String,
}
pub fn test_two() -> String {
println!("Geting a string from within search_index module");
"hi from search_index!".to_string()
}
pub fn test() -> tantivy::Result<()> {
let mut now: Instant;
// Define schema
let mut schema_builder = Schema::builder();
schema_builder.add_u64_field("uuid_hi", STORED);
schema_builder.add_u64_field("uuid_lo", STORED);
schema_builder.add_text_field("name", TEXT | STORED);
let schema = schema_builder.build();
// Create the index, this will create meta.json in the directory
let index = Index::create_in_ram(schema.clone());
// Get fake data
now = Instant::now();
let mut names = fake_names(NUM_TEST_RECORDS);
names.push("Ben Standefer".to_string());
let docs: Vec<TestDoc> = names.into_iter().map(|name| {
TestDoc {
uuid: Uuid::new_v4(),
name: name.to_string(),
}
}).collect();
if DEBUG {
for doc in &docs {
println!("{:?}", doc);
}
}
println!("Generating data took: {:?} ({} records)", now.elapsed(), NUM_TEST_RECORDS.to_formatted_string(&Locale::en));
// Write
now = Instant::now();
let mut index_writer = index.writer(50_000_000)?;
let uuid_hi_field = schema.get_field("uuid_hi").unwrap();
let uuid_lo_field = schema.get_field("uuid_lo").unwrap();
let name_field = schema.get_field("name").unwrap();
for doc in &docs {
let (uuid_hi, uuid_lo) = doc.uuid.as_u64_pair();
index_writer.add_document(doc!(
uuid_hi_field => uuid_hi,
uuid_lo_field => uuid_lo,
name_field => doc.name.to_string(),
))?;
}
index_writer.commit()?;
println!("Indexing data took: {:?}", now.elapsed());
// Search
now = Instant::now();
let reader = index
.reader_builder()
.reload_policy(ReloadPolicy::OnCommit)
.try_into()?;
let searcher = reader.searcher();
let query_parser = QueryParser::for_index(&index, vec![name_field]);
let query = query_parser.parse_query(TEST_QUERY)?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
for (_score, doc_address) in top_docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("{}", schema.to_json(&retrieved_doc));
println!("{}", Uuid::from_u64_pair(
retrieved_doc.get_first(uuid_hi_field).unwrap().as_u64().unwrap(),
retrieved_doc.get_first(uuid_lo_field).unwrap().as_u64().unwrap(),
));
}
println!("Search took: {:?}", now.elapsed());
Ok(())
} src/logging.rs #[cfg(target_arch = "wasm32")]
#[macro_export]
macro_rules! println {
($($arg:tt)*) => (web_sys::console::log_1(&format!($($arg)*).into()))
}
#[cfg(not(target_arch = "wasm32"))]
#[macro_export]
macro_rules! println {
($($arg:tt)*) => (std::println!($($arg)*))
} src/target_test.rs pub fn test() {
#[cfg(target_arch = "wasm32")]
{
println!("target test: wasm32");
}
#[cfg(not(target_arch = "wasm32"))]
{
println!("target test: NOT wasm32");
}
} src/main.rs use anyhow::Result;
mod logging;
mod target_test;
mod search_index;
mod test_data;
fn main() -> Result<()> {
target_test::test();
println!("Let's do this!");
search_index::test()?;
Ok(())
} index.html <!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Hello from Rust!</title>
<script type="module">
import init, { greet } from './sc.js';
async function run() {
await init('./sc_bg.wasm');
console.log(greet("World"));
}
run();
</script>
</head>
<body></body>
</html> index.js <!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Hello from Rust!</title>
<script type="module">
import init, { greet } from './sc.js';
async function run() {
await init('./sc_bg.wasm');
console.log(greet("World"));
}
run();
</script>
</head>
<body></body>
</html> package.json {
"scripts": {
"build": "webpack",
"serve": "webpack serve"
},
"devDependencies": {
"@babel/core": "^7.22.1",
"@babel/preset-env": "^7.22.4",
"@wasm-tool/wasm-pack-plugin": "1.5.0",
"babel-loader": "^9.1.2",
"html-webpack-plugin": "^5.3.2",
"source-map-loader": "^4.0.1",
"text-encoding": "^0.7.0",
"webpack": "^5.49.0",
"webpack-cli": "^4.7.2",
"webpack-dev-server": "^3.11.2"
}
} .babelrc {
"presets": ["@babel/preset-env"]
} webpack.config.js const path = require('path');
const HtmlWebpackPlugin = require('html-webpack-plugin');
const webpack = require('webpack');
const WasmPackPlugin = require("@wasm-tool/wasm-pack-plugin");
module.exports = {
entry: './index.js',
output: {
path: path.resolve(__dirname, 'dist'),
filename: 'index.js',
},
plugins: [
new HtmlWebpackPlugin(),
new WasmPackPlugin({
crateDirectory: path.resolve(__dirname, ".")
}),
// Have this example work in Edge which doesn't ship `TextEncoder` or
// `TextDecoder` at this time.
new webpack.ProvidePlugin({
TextDecoder: ['text-encoding', 'TextDecoder'],
TextEncoder: ['text-encoding', 'TextEncoder']
})
],
module: {
rules: [
{
test: /\.js$/,
exclude: /node_modules/,
use: {
loader: 'babel-loader',
options: {
presets: ['@babel/preset-env'],
},
},
},
{
test: /\.js$/,
enforce: "pre",
use: ["source-map-loader"],
},
],
},
mode: 'development',
experiments: {
asyncWebAssembly: true
}
}; |
hi, looks like the error comes from Instant which does not support wasm, see https://internals.rust-lang.org/t/is-std-instant-on-webassembly-possible/18913 |
You may want to check out wasix https://wasmer.io/posts/announcing-wasix, they also got a version of tantivy working |
Whether or not |
@aguynamedben And also you can check Tantivy fork and search server based on this fork https://github.com/izihawa/summa It is patched for WASM (and working there perfectly, at least for reads):
Also, I did small patches for my case: parallelized compression of Both fork and search server have been in production for a long time, I'm keeping it in line with Tantivy master branch. |
Here is WASM package https://github.com/izihawa/summa/tree/master/summa-wasm, it works for me in all browsers including Safari on iOS, but I have never prepared it for being public so it suffers from the lack of any documentation, except for several indirectly related articles in blog: 1 and 2 Anyway, you can look how it is done. It utilizes ThreadPool based on WebWorkers for paralellizing search load. Together with async code patches and accurate using (e.g. using hotcache and not using fieldnorms), it works very fast on multi-segment indices and multi-term queries even if index is living in the network. |
Summary
Commit to wasm as one of the targets for tantivy.
Motivation
Makes tantivy available to server-side and web developers natively. Enables developers to use the same index format between server and gives client native bindings to read and query the index. eg. Client-side index queries on small-enough index files.
I expect this to give us a competitive advantage over Lucene and help library adoption rates.
Reference-level explanation
Introduce cargo workspaces
Using Cloudflare's wirefilter as an example, we would move current src/ directory to server/ and create a new directory wasm/.
Provide methods to index on server
The server indexer has 2 entry-points: tantivy-cli and library.
Library
Add a method to IndexWriter that serializes the index to a file.
Helps user build a tantivy index that is later serialized into a binary format that tantivy wasm understands.
tantivy-cli/indexer
Add a flag/question at the end to give users an option to serialize the index to wasm format.
Make the wasm library easy to compile and integrate
Add functions for
Enable integrations
Use wasm-pack to build tantivy-wasm. Ship the repo with a JS/HTML component that makes it easy to integrate tantivy wasm to web applications (backend and frontend).
Drawbacks
tantivy was originally conceived as a library for developing server-side indexers.
Making a serious commitment to wasm will affect feature development, programming style and devops infrastructure.
Incompatibility/lack of features
Although wasm support is being increasingly adopted by browser engines (Chromium and Firefox), the API surface is still limited and continues to change fast. For example, threading support is currently a work in progress in the wasm runtime.
Having a clear focus on 1 type of platform (servers) allows us to optimize our solution using SIMD, Rust intrinsics for different platforms and system-specific structure for lock-less programming provided by crossbeam.
Programming style
Changes in any relevant traits and structures cannot break the wasm build. Will require implementing some functionality twice with conditional compilation flags. This might introduce compromises in code, when Linux-specific features are sacrificed for the sake of wasm-compatibility.
Lose system-specific performance gains.
Adopting an immature platform with little industry backing. The probability of wasm being abandoned is much greater than Linux.
CI infra/build times
Including the wasm target and dependencies (wasm-bindgen, wasm-pack) in every CI build will increase CI time. Since these dependencies are yet to stabilize, we run the risk of being guinea pigs for the bugs in wasm-bindgen, which might break our builds and introduce potentially indefinite delays.
Alternatives
Start a separate project under the tantivy organization. Only guarantee about wasm-compatibility every git tag/release checkpoint. This will keep the same rate of development of features.
Unresolved questions
How to test wasm? Headless browser or in pure Rust?
Future possibilities
Extend tantivy wasm to support the IndexWriter trait. Enable building a wasm application that indexes uploaded files in the tantivy format in the browser. Users will be able to build an index in browser memory and download it to run on their server.
This will further extend tantivy wasm demo and allow users to build an index and run queries against it on client-side.
The text was updated successfully, but these errors were encountered: