Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kitsune2 Bootstrap Server--Core #10

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Kitsune2 Bootstrap Server--Core #10

wants to merge 15 commits into from

Conversation

neonphog
Copy link
Collaborator

@neonphog neonphog commented Nov 21, 2024

This is the core functionality of the kitsune2 bootstrap server code.

To make the PRs more manageable, I've broken them out into some preparatory work before this, and all the endpoint test after this.

  • See the previous Prep #9 PR for that prep work.
  • See the follow on Testing #11 PR for api endpoint tests and general notes about this effort in the PR description.


// validate expires_at is not more than 30 min after created_at
if info.expires_at - info.created_at
> (std::time::Duration::from_secs(60 * 30).as_micros() as i64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this value to a const in the config file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 263d5ab


// validate created at is less than 3 min in the future
if info.created_at
> now + (std::time::Duration::from_secs(60 * 3).as_micros() as i64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this value to a const in the config file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 263d5ab


// validate created at is not older than 3 min ago
if info.created_at
< now - (std::time::Duration::from_secs(60 * 3).as_micros() as i64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this value to a const in the config file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 263d5ab

}
}

fn maint_worker(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this and thought it was a typo on "main_worker" -- maybe spelling it fully would be cleaer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'll change it to prune_worker? I hate typing maintenance, I always get the e's and a's backwards the first time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 263d5ab


/// A map of spaces.
#[derive(Clone)]
pub struct SpaceMap(Arc<Mutex<HashMap<bytes::Bytes, Space>>>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use bytes::Bytes instead of the wrapper type you made SpaceId?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I see this server / protocol being useful for WAN discovery outside of projects using kitsune itself. Perhaps it would actually be better to exist outside this monorepo, but that is more overhead for us. At the very least, I didn't want this to have any dependencies on other kitsune crates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case do you want to define a SpaceId or similar wrapper type in this crate just for readability?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can start with just a typedef for code clarity. I don't think there's any need for a newtype yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 263d5ab


impl SpaceMap {
/// Read the content of a space.
pub fn read(&self, space: &bytes::Bytes) -> std::io::Result<Vec<u8>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - should this be a SpaceId?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 263d5ab

Base automatically changed from bootstrap-prep to main November 22, 2024 17:25
fn drop(&mut self) {
self.cont.store(false, std::sync::atomic::Ordering::SeqCst);
for worker in self.workers.drain(..) {
let _ = worker.join().expect("Failure shutting down worker thread");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in passing the error through here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you like to pass this through? This is a drop impl, we don't have many options. Technically you're not supposed to panic in drop impls, you'll get a stack dump that is not very useful... I'm open to other suggestions... but rust's handling of panics in other threads is pretty poor, I haven't really heard of a better pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the main thread = server thread panic here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needn't be the main thread if the server were passed to a different thread. But yes, whichever thread the server instance variable is dropped on will panic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't that stop the whole server when it's only about dropping workers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest that this doesn't have to be a drop implementation here. The main program is doing the drop explicitly but you only need to create one instance of this per program, so it could equally be a shutdown method that can return an error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A drop implementation that calls shutdown().unwrap() would be good, to make sure that the user gets some attempt at shutdown if they forget to call it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: ebac6ff

/// and having too many file handles open.
///
/// Start with 10MiB?
const MAX_PER_TEMPFILE: u64 = 1024 * 1024 * 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be in the config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pressure leading to wanting this larger is the inefficiency of having multiple tempfiles open. This lets us store more than 10_000 entries in one file (because infos are often much less than 1024 bytes long). It doesn't feel like we should need this to be much larger.

The pressure leading to wanting this smaller is the constraints of disk space. On a 16 core system (which cloud servers will likely not have anywhere near that) the default production config will have 64 active write tempfiles, giving 64MiB. Even if we have 10 older generations still around, that's only 640 MiB. Do we think servers will ever have that little disk space? Or to look at it from the opposite perspective, let's say the cloud server provides 10GiB of disk space. We can store 10,485,760 agent records at once, even if many of those are defunct, waiting to be cleaned up.

I don't feel like dev-ops folks are going to understand the consequences of making changes to this number. Providing the option to may just be confusing, and not practically useful. What do you think?

P.S. Re-reading this post, I'm not sure I even described it without being confusing...

Copy link
Contributor

@jost-s jost-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Pruning is tested in the other PR - so just a couple of questions.

crates/bootstrap_srv/src/parse.rs Outdated Show resolved Hide resolved
fn drop(&mut self) {
self.cont.store(false, std::sync::atomic::Ordering::SeqCst);
for worker in self.workers.drain(..) {
let _ = worker.join().expect("Failure shutting down worker thread");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the main thread = server thread panic here?

crates/bootstrap_srv/src/server.rs Outdated Show resolved Hide resolved
///
/// Defaults:
/// - `testing = 32`
/// - `production = 32`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be higher in production? Like 10k?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is just for bootstrap. I'd prefer to handle more spaces and less agents per space than the other way around. Once they are in the space, they can get additional peers by gossip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@@ -0,0 +1,83 @@
/// An entry with known content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to point to where this is documented from here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 85dffc7

use crate::*;
use tiny_http::*;

/// Don't allow created_at to be greater or less than this far away from now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Don't allow created_at to be greater or less than this far away from now.
/// Don't allow created_at to be greater than this far away from now.

I think the "absolute" is implied by the sentence and the less is a little confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: b407784

fn drop(&mut self) {
self.cont.store(false, std::sync::atomic::Ordering::SeqCst);
for worker in self.workers.drain(..) {
let _ = worker.join().expect("Failure shutting down worker thread");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest that this doesn't have to be a drop implementation here. The main program is doing the drop explicitly but you only need to create one instance of this per program, so it could equally be a shutdown method that can return an error?

fn drop(&mut self) {
self.cont.store(false, std::sync::atomic::Ordering::SeqCst);
for worker in self.workers.drain(..) {
let _ = worker.join().expect("Failure shutting down worker thread");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A drop implementation that calls shutdown().unwrap() would be good, to make sure that the user gets some attempt at shutdown if they forget to call it

let prune_cont = cont.clone();
let prune_space_map = space_map.clone();
workers.push(std::thread::spawn(move || {
prune_worker(config, prune_cont, prune_space_map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be bad if this thread died and we didn't know. Doesn't necessarily seem like a problem for this issue but would that be a good thing to track somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we haven't adopted any tracing in this binary yet (it is less needed with server code, because we probably just want it all going to systemd log anyways) I've just eprintln-d it for now. We can do something different easily in the future: 7923234

return Err(std::io::Error::other("InvalidExpiresAt"));
}

// validate signature (do this at the end because it's more expensive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// validate signature (do this at the end because it's more expensive
// validate signature (do this at the end because it's more expensive)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 68c1004

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is a little scary to me :) The documentation makes sense, and I can follow the code, I'm not sure how much I can asses the behavior

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It would be nice to find a way to test it, at least a little bit. Perhaps if I made the max file size configurable at least internally, and then wrote some tests that explicitly dropped read handles and checked if the files still exist on-disk? I don't know if that would be flaky or not...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been through this in detail now and I'm much happier. Thank you for letting me try and fail to pick it apart so I could make sense of it :)

Copy link
Member

@ThetaSinner ThetaSinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through the store, will pick up after meetings

//! open a new tempfile for writing.
//! - The older read handles will persist the existence of the older tempfiles
//! until the last read reference is dropped, at which point the tempfile
//! will be cleaned up by the os.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the OS guaranteed to do this right away? I thought this would only happen on reboot unless the tempfile implementation in the code has logic to cleanup when dropped

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tempfile drop implementation indeed cleans up the tempfile. If that fails for any reason (panic on the thread) then it is up to the os to clean it up. That's completely system specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd optionally include that in the documentation here but maybe it's too much detail

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some details: ea028b0

crates/bootstrap_srv/src/store.rs Show resolved Hide resolved
crates/bootstrap_srv/src/store.rs Show resolved Hide resolved
crates/bootstrap_srv/src/store.rs Show resolved Hide resolved
crates/bootstrap_srv/src/store.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants