Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actually provision read-only downstairs! #728

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 144 additions & 23 deletions agent/src/datafile.rs
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ impl DataFile {
id: request.id.clone(),
name: request.name.clone(),
port_number,
state: State::Created, // no work to do, just run smf
state: State::Requested,
};

info!(
Expand Down Expand Up @@ -302,26 +302,26 @@ impl DataFile {
let running_snapshots =
inner.running_snapshots.get_mut(&request.id).unwrap();

if running_snapshots.get(&request.name).is_none() {
info!(
self.log,
"not running for region {} snapshot {}, returning Ok",
request.id.0,
request.name
);

return Ok(());
}
match running_snapshots.get_mut(&request.name) {
None => {
info!(
self.log,
"not running for region {} snapshot {}, returning Ok",
request.id.0,
request.name
);

info!(
self.log,
"removing snapshot {}-{}", request.id.0, request.name
);
return Ok(());
}

running_snapshots.remove(&request.name);
Some(request) => {
info!(
self.log,
"removing snapshot {}-{}", request.id.0, request.name
);

if running_snapshots.is_empty() {
inner.running_snapshots.remove(&request.id);
request.state = State::Tombstoned;
}
}

/*
Expand Down Expand Up @@ -435,6 +435,36 @@ impl DataFile {
self.store(inner);
}

/**
* Mark a particular running snapshot as failed to provision.
*/
pub fn fail_rs(&self, region_id: &RegionId, snapshot_name: &str) {
let mut inner = self.inner.lock().unwrap();

let mut rs = inner
.running_snapshots
.get_mut(region_id)
.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if, and when would these unwraps possibly fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent would panic, and the service would restart, scan the data file, and attempt to do the same operation. It may then succeed, or it may crash again. Too many crashes would put the service in a maintenance state and it remain disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know for sure that a panic'd agent will restart and retry? I know there are places where we don't
correctly handle panics, but I was not sure what happens to the agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll verify this but I'm pretty sure.

.get_mut(snapshot_name)
.unwrap();

let nstate = State::Failed;
if rs.state == nstate {
return;
}

info!(
self.log,
"running snapshot {} state: {:?} -> {:?}",
rs.id.0,
rs.state,
nstate,
);
rs.state = nstate;

self.store(inner);
}

/**
* Mark a particular region as provisioned.
*/
Expand Down Expand Up @@ -465,6 +495,51 @@ impl DataFile {
Ok(())
}

/**
* Mark a particular running snapshot as provisioned.
*/
leftwo marked this conversation as resolved.
Show resolved Hide resolved
pub fn created_rs(
&self,
region_id: &RegionId,
snapshot_name: &str,
) -> Result<()> {
let mut inner = self.inner.lock().unwrap();

let mut rs = inner
.running_snapshots
.get_mut(region_id)
.unwrap()
.get_mut(snapshot_name)
.unwrap();

let nstate = State::Created;

match &rs.state {
State::Requested => (),
State::Tombstoned => {
/*
* Nexus requested that we destroy this running snapshot before
* we finished provisioning it.
*/
return Ok(());
}

x => bail!("created running_snapshot in weird state {:?}", x),
}

info!(
self.log,
"running snapshot {} state: {:?} -> {:?}",
rs.id.0,
rs.state,
nstate,
);
rs.state = nstate;

self.store(inner);
Ok(())
}

/**
* Mark a particular region as destroyed.
*/
Expand All @@ -490,6 +565,40 @@ impl DataFile {
Ok(())
}

/**
* Mark a particular running snapshot as destroyed.
*/
pub fn destroyed_rs(
&self,
region_id: &RegionId,
snapshot_name: &str,
) -> Result<()> {
let mut inner = self.inner.lock().unwrap();

let mut rs = inner
.running_snapshots
.get_mut(region_id)
.unwrap()
.get_mut(snapshot_name)
.unwrap();

let nstate = State::Destroyed;

info!(
self.log,
"running snapshot {} state: {:?} -> {:?}",
rs.id.0,
rs.state,
nstate,
);
rs.state = nstate;

// XXX shouldn't this remove the record?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XXX How can we tell?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't remove the record, then any attempt by Nexus to provision a running snapshot with the same name for the same region wouldn't work, it would return the existing record with state == destroyed. We currently keep the record around for debugging purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm confused about what the XXX was for.

I think removing the record is what we want, right? Otherwise we are leaking resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leaving the records there is important to keep Nexus sagas idempotent. If they request that a region is to be destroyed, then the saga node reruns, we'll want the "destroy" command to not return a 404 (because the record is gone), but instead a 204 because the region is already destroyed. I've removed the XXX comments and added a note about sagas in 03086e6.


self.store(inner);
Ok(())
}

/**
* Nexus has requested that we destroy this particular region. Returns
* true if the region is already destroyed, false if not destroyed
Expand Down Expand Up @@ -547,11 +656,11 @@ impl DataFile {
}

/**
* The worker thread will request the first region that is in a
* particular state. If there are no tasks in the provided state,
* we sleep waiting for work to do.
* The worker thread will request the first resource that is in a
* particular state. If there are no resources in the provided state,
* wait on the condition variable.
*/
pub fn first_in_states(&self, states: &[State]) -> Region {
pub fn first_in_states(&self, states: &[State]) -> Resource {
let mut inner = self.inner.lock().unwrap();

loop {
Expand All @@ -565,7 +674,19 @@ impl DataFile {
for s in states {
for r in inner.regions.values() {
if &r.state == s {
return r.clone();
return Resource::Region(r.clone());
}
}

for (rid, r) in &inner.running_snapshots {
for (name, rs) in r {
if &rs.state == s {
return Resource::RunningSnapshot(
rid.clone(),
name.clone(),
rs.clone(),
);
}
}
}
}
Expand Down
Loading