-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bounty: Public database syncing system (150$) #373
Comments
I'm not sure how well useful IPFS chunking would be as a large amount of the CSV is modified in day-to-day operations, most notably votes. You can try this yourself by chunking and deduping different days from my historical archive |
Postgres approach is probably the best, but it does restrict all mirrors to have to use Postgres instead of Sqlite |
braindump while I finish my readings https://gist.github.com/mchangrh/d8e2837b372756c867429eb20e946798 |
I think I have a very basic implementation that would only require
and high bandwidth RSync+nginx server(s) for hosting but that's quite obvious + I am willing to provide (offtopic) https://gist.github.com/mchangrh/3c3a9e870dfc1b8aee574c791dab83ad tl;drmaster (Ajay)
storage (single/ geodistributed) / T1 Mirror
active mirror
mirror from scratch / back from the dead
|
Is there a reason this would require more bandwith than now? As, kind of the point of this is to reduce bandwith used by many people downloading CSVs. I think the 16mb WALs should make it not take too much. |
the point would be to offload the bandwidth from the main server on to these mirrors, but they would probably be serving exports as well as these base_dumps and WALs, but if they are slower than the main server or just generally bad, they'll probably just fall back on the main server |
I guess that part really depends on the amount of uptake on it past services like sb.ltn.fi, but that's a good point. Seperation of concerns is nice |
the main advantage is that the tier 1 mirrors can be dumb small amount of storage (1-3GB) with minimal CPU usage and some RAM since all they have are
if the primary storage server isn't generating the archive, then tar which is minimal and db generation can be on a docker runner |
I found the addon today and it's great! You could also abuse HTTP range requests and host an index to the CSV lines based on time. So the program just asks the next CSV slice based on it's newest ID and the server just sends the correct range (and possibly a URL). Then the program just downloads just the new lines. The beauty of this approach is that any plain old nginx instance could do this without any major changes. Additionally since the server is serving as an index, then we can point to mirrors and round robin the load. Real talk though. Compress those CSV files. Even gzip is good enough:
GZIP and XZ are both installed on most systems and anyone downloading these files from the website should know how to deal with it. If they are being used programmatically for updating, there are ways of just opening the compressed file and reading it in all the popular languages. |
would this work for modified lines? A lot of the delta is not just new segments being added but votes and views on existing segments too |
I don't see why you couldn't just have the update logic just overwrite the record if it encounters the same ID. (Essentially just upsert each record into based on the ID). While duplication is an issue, you can also provide a periodic base file and if someone if too far back they just redownload the entire base file. These base files would just be snapshots of the database dumped periodically to compress the updated fields. It definitely keeps it simple which I think is the biggest benefit. |
1/2 done - distribution and downloading of files gist explaining flags and why lz4 was chosen (w/ benchmarks) docker container for mirroring csv files aports merge for lz4 https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26616 (if closed, will probably use debian as a base instead or build ourselves) |
It might not be a bad idea to migrate to a decentralized database like https://github.com/amark/gun |
Very cool but I don't know if it would fit SponsorBlock since there would need to be an absolute point of truth, having a publicly writeable database without the application layer would not be ideal. |
2/2 done - with only things missing would be
If everything else is okay, then I'll do a quick change from alpine to debian for rsync's lz4 - pending aports merge |
@mchangrh I think that's a misconception, there is no "single source of truth" which means there are multiple sources of truth, as each node just stores the graph that concerns them. The entire database is a sum of all nodes. It resolves the need for distributing and syncing databases, it's handled automatically by the Gun protocol.
Not really, by that logic crypto currencies could never exist. We can reach out to the gun team on gitter to see how they can help this project. |
Yes but sponsorblock isn't a cryptocurrency, it doesn't need a single source of truth, the source of truth is done by verifying with other peers. Quite a while ago, when the BTC market was booming, there were concerns since there was a single pool that mined 3 consequitive blocks. If a malicious actor wanted, they could spin up a majority of gun peers and override most of the database. It's cool tech but doesn't fit in with master/slave and single source of truth that sponsorblock requires. This is required by sponsorblock since the vipUsers table should only be accessible to the admin, a bad actor could change the vipUsers table and remove, add locked segments or ban users. It's very cool but even if we only use gun for only sponsorTimes there would still be the issues of distributing the other tables. |
This is now complete! Check out @mchangrh's project to see how to host a mirror yourself: https://github.com/mchangrh/sb-mirror |
There should be a system that allows anyone to be able to have a near-realtime (5 mins optimal, 20 mins at least) mirror at least the
sponsorTimes
table.This was discussed in Discord(message) / Matrix(message).
Possible options:
Use a system similar to MusicBrainz where a Postgres replication feed gets mirrored in a read-only way for other users to download. There then should be a way for the server to automatically read from this replication feed and import it into their local database. This would have to be done in a secure way that only replicated the public tables.
Reuse the existing Webhook implementation and create a recieving end for this that can be configured to save this info to the database.
Host the CSVs via IPFS and have a wrapper to automatically retrieve latest versions of this using IPFS. This should work as IPFS stores files as chunks, and will know which chunks and new and which are old. This also makes it so when someone is first downloading the entire db, everyone's mirror will act as a seeder, even if they are on a slightly outdated version of the file (since they have access to the specific chunks).
The solution should be able to be run easily when starting up the SponsorBlock server. Docker would probably be needed to ease deployment if it requires external dependencies outside of just node.
Bounty has been solved. Bounty of $150 USD via PayPal to whoever solves this issue. To avoid duplicate work, make sure to comment on this issue with what you are doing.
The text was updated successfully, but these errors were encountered: