-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on using SQLite for the WACZ format #62
Comments
One more thing is that you probably want compression, and might want hash-based indexing as you do in your archiveweb.page system. So that would make it somethingl ike create table requests(id integer primary key autoincrement, timestamp date, ..., headers_compression_format text, headers blob, content_hash references contents(sha256));
create table contents(sha256 text primary key, compression_format text, data blob); |
I'll see if I can convert a warc archive to the above sqlite format later today, then we can check how many http requests it has to do |
Thanks for the detailed response! To clarify, I was thinking specifically of using sqlite in place of the CDXJ index, and possibly page and text indices, not the raw WARC data.
Yes, a key reason is backwards (and forwards) compatibility, there are many PBs of WARC data out there, and it is a standard format that is widely used in the web archiving community. By simply storing the WARCs in the WACZ, this ensures there is a way to transition between the two, for those that only support WARCs. Any new version of the WARC standard can also be accommodated. There are also efforts to support zstd compression in WARCs as well. The CDX(J) format, although in common use now is less standardized, and there is no standard for page or text indices thus far. I am thinking of sqlite in place of the CDX(J) and/or the page/text indices. The table could look something like:
A quick test shows that a sqlite table is about similar size (slightly smaller) than the uncompressed CDXJ, while a compressed CDXJ would still be a lot smaller.. so there is that. Probably a custom grouping approach as you've experimented with might still be necessary. |
Sure, that makes sense. I still think you could fairly easily make a bidirectional conversion from a set of warc files to a extensible SQLite DB and ensure there's no information loss, and a specification of that could be the next version of the WARC file format. But I can understand the want to keep around the original files and to have a less drastic transition.
That sounds kind of unexpected to me since the JSON format has a lot of overhead (when not compressed). Converted with the following command: rg '(.*?) (\d+) \{(.*)\}' cdx-00000 --replace '{"revurl": "$1", "ts": $2, $3}' > cdx-0000.jsonl
jq -s . cdx-0000.jsonl > cdx-0000.json
sqlite-utils insert cdx-0000.sqlite3 cdx cdx-0000.json
l cdx*
2,7M Sep 19 21:26 cdx-00000.jsonl
2,1M Sep 19 21:27 cdx-00000.sqlite3
2,5M Sep 19 21:19 cdx-00000 So trivially converted, that's 16% somewhat smaller than the uncompressed DB, though when I add an index on But there's also a ton of redundancy in there because it's not normalized. With queries like create table filenames(id integer primary key not null, filename text not null);
insert or ignore into filenames(filename) select filename from cdx;
update cdx set filename = (select id from filenames where filenames.filename = cdx.filename); The sqlite size goes down to 1.3MByte, a 48% size reduction over the uncompressed file. Still 5x larger than the compressed file, but that doesn't allow random access. Testing with
Shows that finding the entry for one url takes 7 HTTP requests of 4kB each, 3 for the revurl b-tree lookup and 4 for the cdx table lookup. Not sure if that's good or bad. I guess with "only" 8000 entries it's still better to just compress and download the whole thing. Attached is the SQLite db and the original cdxj file: |
One obvious reason to use SQLite for the indexes would be that you can easily add a fairly efficient full-text-search index for fuzzy-matching in e.g. page titles (or contents) without downloading all of them. No idea how you would build that otherwise, I don't think there's any precedent for full-text / inverse index look ups from a static http server. Other than that the indexes should be more efficient and more flexible than self-written ones. |
Thanks for checking it with this file, that seems like a good test case, and can definitely find something bigger with a lot more data! By comparison, inside that WACZ, the size is compressed cdx-00000.gz (263902) + cluster.idx (2430) = 266332, so ~266K, which is about 10x smaller. For CDX lookup, there should only be 2 HTTP requests, one for the compressed block (from cdx-00000.gz) and one to get the WARC record from one of the WARCs. (There's a few other requests initially from the WACZ itself). 7 HTTP requests sounds not so great by comparison, so this doesn't seem like a good alternative based on this dataset. Of course, this index is particularly optimized for url + timestamp lookup, so other types of queries would be more tricky.
Perhaps application to page index / text index might be more interesting. Yes, I was thinking of some sort of incremental sharded loading that is populated into the local FlexSearch instance, but you're right, not aware of any existing standard approach for this! |
Hm, the domain-reversed URL we may actually be able to skip, as wabac.js currently doesn't use it - experimenting if replay on this scale can be done sufficiently well without it. Most other operations are done on range queries in IndexedDB with extra filtering.
Yeah, I think probably some sort of batch row compression like this would be needed to get comparable storage size to the compressed cdx... |
The
Huh, so it just does exact matching by URL? I just tried again with a 64kB page size and querying by URL, I get 4 http requests for the first lookup by URL, and 2 for the second lookup with a different random URL. With similar URLs, the next query has 0 or 1 http requests. So yeah, just comparing these it doesn't look like SQLite has much of an advantage, and only gets the same "performance" as the manually-tuned cdxj index with some tuning of its own.
Yes, you're not going to get that same size with SQLite without pretty hacky stuff. Though I'm not sure if the total size here is too much of an issue, considering it's only fetched in smaller parts, and the total size will in any case be much smaller than the rest of the WARC files.
I'd say The SQLite FTS scales to maybe 5 - 50GB of plain text content with this method (static hosting), though the DB will get pretty large. quickwit-oss/tantivy#1067 scales to searching in multiple terabytes of text, fetching only 1-2MB of data from the index. |
Yes, you're right, the .idx file is sort of equivalent to the b-tree cache!
It does a lot of prefix search, usually from '?', and then finds best match based on a few heuristics and rules
Yeah, perhaps not worth the switch just for this.. Perhaps if additional querying capability was needed, that could be worth it, but this is intended primarily for replay. Thanks again for trying this out!
Yeah, that's probably right at this scale. It matters when dealing with PBs of data but not at scale of most WACZ files.
That is interesting! Perhaps optimization/index format improvements should focus on around search, as we don't yet have a good scalable solution for full-text search! I wasn't familiar with this project before (hope that PR gets merged!) |
Interested in exploring the idea of use SQLite just for the full-text search.. In this way, it'll be an optional enhancement, but not required. but the full-text search, don't yet have a scalable solution. The idea would be to essentially convert the pages.jsonl / extra-pages.jsonl to a sqllite db. The table would consist of the page id, url, timestamp, title and extracted text.
If we're talking about just text content, I think that scale seems sufficient! |
Ilya, this might be interesting for you
quickwit-oss/tantivy#1067
(Hint: faster and more efficient than sql lite)
On Fri, Apr 1, 2022 at 10:06 PM Ilya Kreymer ***@***.***> wrote:
Interested in exploring the idea of use SQLite just for the full-text
search.. In this way, it'll be an optional enhancement, but not required.
The CDXJ approach works well for the core replay, now having it scale to
over 1TB, ex:
https://replayweb.page/?source=https%3A%2F%2Ffiles.sucho.org%2Farchives%2Fchnu-cv-ua.wacz#view=pages&url=http%3A%2F%2Fchnu.cv.ua%2Findex.php%3Fpage%3Dua&ts=20220324190736
but the full-text search, don't yet have a scalable solution.
The idea would be to essentially convert the pages.jsonl /
extra-pages.jsonl to a sqllite db. The table would consist of the page id,
url, timestamp, title and extracted text.
I'd say The SQLite FTS scales to maybe 5 - 50GB of plain text content with
this method (static hosting), though the DB will get pretty large.
quickwit-oss/tantivy#1067
<quickwit-oss/tantivy#1067> scales to searching
in multiple terabytes of text, fetching only 1-2MB of data from the index.
If we're talking about just text content, I think that scale seems
sufficient!
—
Reply to this email directly, view it on GitHub
<https://github.com/webrecorder/wacz-spec/issues/62#issuecomment-1086444235>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABU7ZZ2D3SJ6SFWOBSTWNSDVC6MQ5ANCNFSM5EKF6LTA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
yes, that PR is from @phiresky , who mentions it above :) It doesn't seem like its going to get merged, though based on lack of activity there :/ Another factor is that this would add a dependency on tantivy, which is not as ubiquitous as sqlite. |
I didn't check in a while, but if I remember correctly the authors of tantivy actually added some other changes that should make a wasm build easier (smaller PR with much less changes). Still implementation work someone would need to do though which the authors of tantivy themselves probably aren't too interested in doing themselves.
Can you send a (large) pages.jsonl file so I can play around? https://files.sucho.org/archives/chnu-cv-ua.wacz doesn't contain the plain text extraction |
@ikreymer here's the python code I used for a different project that loads a set of WARC files into an SQLite database https://github.com/phiresky/warc-sqlite (with a suboptimal format and without advanced functionality right now). I added a small CLI and docs. Try it with |
Yes, we turned off text extraction due to scaling issues for now. Here are some WACZ files that do have text in the pages.jsonl: Can send the separate pages.jsonl later. Thanks so much for taking a look (whenever you have time!) |
Thanks, I extracted them from the links you sent.
Do you think the search should rather happen within the service worker or
on the client side? In the service worker there might be a bit of an issue
in that you can't create another (web)worker from the service worker, so
running the sqlite engine in there might block other queries while it's
running
Am So., 3. Apr. 2022 um 02:22 Uhr schrieb Ilya Kreymer <
***@***.***>:
… Can you send a (large) pages.jsonl file so I can play around?
https://files.sucho.org/archives/chnu-cv-ua.wacz doesn't contain the
plain text extraction
Yes, we turned off text extraction due to scaling issues for now.
Here are some WACZ files that do have text in the pages.jsonl:
https://files.sucho.org/archives/dako-gov-ua.wacz
https://files.sucho.org/archives/echo-msk-ru.wacz
Can send the separate pages.jsonl later
—
Reply to this email directly, view it on GitHub
<https://github.com/webrecorder/wacz-spec/issues/62#issuecomment-1086748335>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARSOYO44AMFDS5OS7EW22TVDDQCPANCNFSM5EKF6LTA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's a good point, I was imagining it would be a part of wabac.js, or as an add on to it, since it handles the wacz loading. wabac.js can also be run as a webworker - it is used that way for the initial loading. Then, replayweb.page could just perform search queries via the |
I implemented a prototype: webrecorder/wabac.js#63 We'll need bigger pages.jsonl files to really test it, the 50MB ones aren't too interesting |
This is a response to your email, but I thought I'd post it publicly since it might interest other people.
Here's some fairly unstructured thoughts without a real "conclusion":
I'd say SQLite is definitely a possibility for that use case.
SQLite has an "integrated" zip-file replacement https://sqlite.org/sqlar/doc/trunk/README.md (not that great for this since it uses DEFLATE and doesn't directly have any other file metadata or indexes, but it's precedent for storing files in SQLite).
SQLite has a BLOB reading API that allows reading a byte sequence stored in there to be read as a stream (as opposed to getting it completely with a normal SQL query)
Audacity recently changed their project format which consists of many audio file tracks to use SQLite instead of a directory of files: https://www.audacityteam.org/audacity-3-0-0-released/ . These can be pretty large audio files so it seems to work fine for that.
SQLite has a limitation of 2GB for one blob which is probably not that relevant for webrecorder but can definitely be hit when using it to archive normal files
I've looked at the WACZ format before and it seems well thought out, but a bit unnecessarily complicated - kind of reinventing a database structure inside a ZIP file. I also understand in general why people like plain text etc formats - but is there a specific reason why you need to retain normal WARC files with side-car indexes in your WACZ format?
Putting everything in a cleanly normalized SQLite database has a few obvious advantages - for one it becomes trivial to continuously insert more requests into it without having to rewrite any indexes, all indexing is managed automatically and adding a new key by which to query data is trivial. For example in browsertrix-crawler, I couldn't really find out if I can stop the crawling and resume it later, or if it always has to run through completely, and I guess it only writes out the final archive at the end?
You can write to and read from SQLite database from multiple processes at the same time with ease (single-writer but multiple-reader with journal_mode=WAL)
The SQLite database storage format is pretty efficient - for example it transparently uses var-ints for all integers. Text/binary data is not compressed though.
I see that accessing CDXJ uses binary search for looking up stuff - SQLite uses a B-Tree (as do basically all databases) which in general should result in a constant factor of fewer requests, since the tree depth is lower. It can also be updated more efficiently (afaik). Since request latency is significant for partial fetching, this is probably a good improvement. You can check the depth of the b-tree with
sqlite_analyze
. I just looked at a DB with 2M rows, and the SQLite index has depth 4, while a normal binary tree lookup would have depth 20.SQLite always first looks up the rowid in a index b-tree, then the real row in the table b-tree. Since each HTTP request has fairly high overhead, it can make sense to pre-cache the inner nodes of the b-trees beforehand. Someone implemented this streamlining HTTP request pattern with a sidecar file phiresky/sql.js-httpvfs#15 . In your WACZ format, the equivalent would be caching the first few layers of the binary tree induced by the binary search (but it would be less useful, since the tree is much deeper).
CDXJ gzips groups of lines - this is something that would be harder with SQLite since you can't compress many rows together without denormalizing the DB in some way (e.g. storing many real rows per DB row).
To solve the above I've actually developed https://github.com/phiresky/sqlite-zstd , which allows transparent row-level compression in SQLite with dictionaries - so if you have 10000 similar entries all the common information need only be stored once. It's great, but a bit experimental and right now no wasm support.
SQlite would be pretty much a blackbox in this context - you probably don't want to modify SQLite core so if you find anything that's outside of the flexibility of SQLite will be hard to do.
One disadvantage about SQLite is that it would be harder to reason about how much data needs to be fetched exactly when querying by something specifically, and having a wasm "blob" between layers of your code is annoying in general.
SQLite is very robust and established, but using it from within the browser is not. So if your use case was only native programs I'd say definitely use it instead of a home-brew format, but for web-based it's less clear.
To use SQLite as your file format, you'd have to decide on the database structure. So that would be something like:
You could then also easily add a full-text search search index so you can search for websites with specific text in the titles or even content.
The "cleanest" way would be to put everything in 3NF form. That way you could add indexes on whatever your want, including looking for requests with a specific header value etc. But you probably want to denormalize the data somewhat since you always want all the headers together with the content (like in WARC), without any joining overhead. Then it's harder to index the headers individually, but the main use case is (probably) efficient. Putting it in a more normal form would mean something like
Note that this is much more complicated, but it also deduplicates the data since most headers and even header values are probably used multiple times.
The text was updated successfully, but these errors were encountered: