Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounty: Public database syncing system (150$) #373

Closed
ajayyy opened this issue Sep 29, 2021 · 18 comments
Closed

Bounty: Public database syncing system (150$) #373

ajayyy opened this issue Sep 29, 2021 · 18 comments

Comments

@ajayyy
Copy link
Owner

ajayyy commented Sep 29, 2021

There should be a system that allows anyone to be able to have a near-realtime (5 mins optimal, 20 mins at least) mirror at least the sponsorTimes table.

This was discussed in Discord(message) / Matrix(message).

Possible options:

  • Use a system similar to MusicBrainz where a Postgres replication feed gets mirrored in a read-only way for other users to download. There then should be a way for the server to automatically read from this replication feed and import it into their local database. This would have to be done in a secure way that only replicated the public tables.

  • Reuse the existing Webhook implementation and create a recieving end for this that can be configured to save this info to the database.

  • Host the CSVs via IPFS and have a wrapper to automatically retrieve latest versions of this using IPFS. This should work as IPFS stores files as chunks, and will know which chunks and new and which are old. This also makes it so when someone is first downloading the entire db, everyone's mirror will act as a seeder, even if they are on a slightly outdated version of the file (since they have access to the specific chunks).


The solution should be able to be run easily when starting up the SponsorBlock server. Docker would probably be needed to ease deployment if it requires external dependencies outside of just node.


Bounty has been solved. Bounty of $150 USD via PayPal to whoever solves this issue. To avoid duplicate work, make sure to comment on this issue with what you are doing.

@ajayyy ajayyy added the help wanted Extra attention is needed label Sep 29, 2021
@ajayyy ajayyy pinned this issue Sep 29, 2021
@mchangrh
Copy link
Contributor

I'm not sure how well useful IPFS chunking would be as a large amount of the CSV is modified in day-to-day operations, most notably votes. You can try this yourself by chunking and deduping different days from my historical archive

http://sb-archive.mchang.xyz/mirror/

@ajayyy
Copy link
Owner Author

ajayyy commented Sep 29, 2021

Postgres approach is probably the best, but it does restrict all mirrors to have to use Postgres instead of Sqlite

@mchangrh
Copy link
Contributor

braindump while I finish my readings

https://gist.github.com/mchangrh/d8e2837b372756c867429eb20e946798

@mchangrh
Copy link
Contributor

mchangrh commented Sep 30, 2021

I think I have a very basic implementation that would only require

  • rsync
  • cron

and high bandwidth RSync+nginx server(s) for hosting but that's quite obvious + I am willing to provide (offtopic)

https://gist.github.com/mchangrh/3c3a9e870dfc1b8aee574c791dab83ad

tl;dr

master (Ajay)

  • base_backup every 12 hours
  • ship WALs to storage once done ( can also ship via rclone to "cloud storage")
  • prod storage w/ ssh/ webhook/ anything

storage (single/ geodistributed) / T1 Mirror

  • retain full history OR last base + WAL
  • rsync, nginx
  • (on prod) - generate archive (tar with base + all WAL)
  • (on prod) - generate .db for sqlite3 (wal replays are too cumbersome on stock sqlite)
  • (on prod) - prod mirrors

active mirror

  • (on prod) | cron - get latest WAL(s)
  • any new WALs are automatically replayed

mirror from scratch / back from the dead

  • docker-compose pull etc...
  • get latest archive OR base + WAL
  • restore base, start in recovery to stream in WALs

@ajayyy
Copy link
Owner Author

ajayyy commented Sep 30, 2021

high bandwidth RSync+nginx server(s)

Is there a reason this would require more bandwith than now? As, kind of the point of this is to reduce bandwith used by many people downloading CSVs. I think the 16mb WALs should make it not take too much.

@mchangrh
Copy link
Contributor

Is there a reason this would require more bandwith than now? As, kind of the point of this is to reduce bandwith used by many people downloading CSVs. I think the 16mb WALs should make it not take too much.

the point would be to offload the bandwidth from the main server on to these mirrors, but they would probably be serving exports as well as these base_dumps and WALs, but if they are slower than the main server or just generally bad, they'll probably just fall back on the main server

@ajayyy
Copy link
Owner Author

ajayyy commented Sep 30, 2021

I guess that part really depends on the amount of uptake on it past services like sb.ltn.fi, but that's a good point. Seperation of concerns is nice

@mchangrh
Copy link
Contributor

the main advantage is that the tier 1 mirrors can be dumb

small amount of storage (1-3GB) with minimal CPU usage and some RAM since all they have are

  • nginx server
  • rsync server
  • nginx WH listener / sshd.

if the primary storage server isn't generating the archive, then tar which is minimal and db generation can be on a docker runner

@stranger-danger-zamu
Copy link

stranger-danger-zamu commented Oct 5, 2021

I found the addon today and it's great!

You could also abuse HTTP range requests and host an index to the CSV lines based on time. So the program just asks the next CSV slice based on it's newest ID and the server just sends the correct range (and possibly a URL). Then the program just downloads just the new lines.

The beauty of this approach is that any plain old nginx instance could do this without any major changes. Additionally since the server is serving as an index, then we can point to mirrors and round robin the load.


Real talk though. Compress those CSV files. Even gzip is good enough:

original file original size gzip'd size xz'd size
categoryVotes.csv 3.4M 1.2M 996K
lockCategories.csv 3.0M 292K 216K
sponsorTimes.csv 635M 272M 208M
unlistedVideos.csv 5.5M 1.8M 1.2M
userNames.csv 5.1M 2.6M 2.4M
vipUsers.csv 4.0K 4.0K 4.0K
warnings.csv 84K 32K 28K

GZIP and XZ are both installed on most systems and anyone downloading these files from the website should know how to deal with it.

If they are being used programmatically for updating, there are ways of just opening the compressed file and reading it in all the popular languages.

@mchangrh
Copy link
Contributor

mchangrh commented Oct 5, 2021

would this work for modified lines? A lot of the delta is not just new segments being added but votes and views on existing segments too

@stranger-danger-zamu
Copy link

stranger-danger-zamu commented Oct 6, 2021

would this work for modified lines? A lot of the delta is not just new segments being added but votes and views on existing segments too

I don't see why you couldn't just have the update logic just overwrite the record if it encounters the same ID. (Essentially just upsert each record into based on the ID).

While duplication is an issue, you can also provide a periodic base file and if someone if too far back they just redownload the entire base file. These base files would just be snapshots of the database dumped periodically to compress the updated fields.

It definitely keeps it simple which I think is the biggest benefit.

@mchangrh
Copy link
Contributor

1/2 done - distribution and downloading of files

gist explaining flags and why lz4 was chosen (w/ benchmarks)
https://gist.github.com/mchangrh/3d4a967732f8994cfcfdc05a8e22dc4f

docker container for mirroring csv files
https://github.com/mchangrh/sb-mirror

aports merge for lz4 https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26616 (if closed, will probably use debian as a base instead or build ourselves)

@whizzzkid
Copy link
Contributor

It might not be a bad idea to migrate to a decentralized database like https://github.com/amark/gun

@mchangrh
Copy link
Contributor

It might not be a bad idea to migrate to a decentralized database like https://github.com/amark/gun

Very cool but I don't know if it would fit SponsorBlock since there would need to be an absolute point of truth, having a publicly writeable database without the application layer would not be ideal.

@mchangrh
Copy link
Contributor

1/2 done - distribution and downloading of files

gist explaining flags and why lz4 was chosen (w/ benchmarks) https://gist.github.com/mchangrh/3d4a967732f8994cfcfdc05a8e22dc4f

docker container for mirroring csv files https://github.com/mchangrh/sb-mirror

aports merge for lz4 https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26616 (if closed, will probably use debian as a base instead or build ourselves)

2/2 done - with mode=mirror in config, almost all of the work is done

only things missing would be

  • MSSQL support with BULKIMPORT
  • possibly finagling SQLite to not do version updates
  • (improbable but doable - will probably do for fun tomorrow) - runner to start up node server automatically

If everything else is okay, then I'll do a quick change from alpine to debian for rsync's lz4 - pending aports merge

@whizzzkid
Copy link
Contributor

@mchangrh I think that's a misconception, there is no "single source of truth" which means there are multiple sources of truth, as each node just stores the graph that concerns them. The entire database is a sum of all nodes. It resolves the need for distributing and syncing databases, it's handled automatically by the Gun protocol.

since there would need to be an absolute point of truth, having a publicly writeable database without the application layer would not be ideal.

Not really, by that logic crypto currencies could never exist. We can reach out to the gun team on gitter to see how they can help this project.

@mchangrh
Copy link
Contributor

mchangrh commented Oct 22, 2021

Not really, by that logic crypto currencies could never exist.

Yes but sponsorblock isn't a cryptocurrency, it doesn't need a single source of truth, the source of truth is done by verifying with other peers. Quite a while ago, when the BTC market was booming, there were concerns since there was a single pool that mined 3 consequitive blocks. If a malicious actor wanted, they could spin up a majority of gun peers and override most of the database. It's cool tech but doesn't fit in with master/slave and single source of truth that sponsorblock requires.

This is required by sponsorblock since the vipUsers table should only be accessible to the admin, a bad actor could change the vipUsers table and remove, add locked segments or ban users.

It's very cool but even if we only use gun for only sponsorTimes there would still be the issues of distributing the other tables.

@ajayyy ajayyy removed the help wanted Extra attention is needed label Oct 27, 2021
@ajayyy ajayyy closed this as completed in 19f7bbc Oct 28, 2021
@ajayyy ajayyy unpinned this issue Oct 28, 2021
@ajayyy
Copy link
Owner Author

ajayyy commented Oct 28, 2021

This is now complete! Check out @mchangrh's project to see how to host a mirror yourself: https://github.com/mchangrh/sb-mirror

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants