-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add worker tasks to the agent. #1393
base: main
Are you sure you want to change the base?
Conversation
This enables crucible agent to have more than one (currently 5) outstanding region creation operations while still handling creation and deletion of snapshots (SMF services) and deletion of regions. Updated agent-antagonist test to print a little less in some places and a little more in other. agent-antagonist now has a new test that just tests downstairs cloning operations in a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes needed:
- adding work to the queue has to be idempotent, as the agent is called from a saga context
- what happens when there is work on the queue and someone tombstones the region? does that tombstone serialize behind the region creation job?
// room on the work queue. Otherwise, they remain | ||
// requested until we can service them. | ||
if r.state == State::Requested { | ||
assert!(!outer.work_queue.contains_key(&r.id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assert shouldn't be required, this function should still hold the lock on outer and we just checked contains_key
@@ -69,7 +84,7 @@ fn command(log: &Logger, bin: &'static str, args: &[&str]) -> Result<String> { | |||
info!(log, "{} {:?} took {:?}", bin, args, elapsed); | |||
|
|||
if !cmd.status.success() { | |||
bail!("zfs list failed!"); | |||
bail!("command: {} {:?} failed {:?}", bin, args, cmd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where's the dunce emoji
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear I meant myself! :o
name.clone(), | ||
rs.clone(), | ||
); | ||
if outer.work_queue.contains_key(rid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this branch, nothing currently does work on running snapshots right?
@@ -400,15 +404,17 @@ where | |||
*/ | |||
let expected_downstairs_instances = regions | |||
.iter() | |||
.filter(|r| r.state == State::Created) | |||
.filter(|r| r.state == State::Created || r.state == State::Requested) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right - if I remember correctly this would start a downstairs process for a region that either hasn't had its dataset created yet, or could still be cloning.
.map(|r| format!("{}-{}", downstairs_prefix, r.id.0)) | ||
.collect::<HashSet<_>>(); | ||
|
||
let expected_snapshot_instances: Vec<String> = running_snapshots | ||
.iter() | ||
.flat_map(|(_, n)| { | ||
n.iter() | ||
.filter(|(_, rs)| rs.state == State::Created) | ||
.filter(|(_, rs)| { | ||
rs.state == State::Created || rs.state == State::Requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I don't think this is right, as it would start a downstairs process for a snapshot's .zfs/snapshot/...
directory when that might not exist yet
@@ -799,6 +805,10 @@ where | |||
} | |||
crucible_smf::CommitResult::OutOfDate => { | |||
error!(log, "concurrent modification?!"); | |||
panic!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why panic here and not in the other check for concurrent modification?
// that does not already have a job on work queue, so | ||
// now we will create that job and spawn a task for | ||
// it to do the work in. | ||
let log0 = log.new(o!("component" => "worktask")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this log context should also have the region id as a tag / whatever slog calls it :)
Do you mean calling the agent to take an action has to be idempotent? If we call the agent several times it should only do the work once?
It should, yes (though I'll verify). |
Yeah - if Nexus sends the same request, it should receive the same answer, polling until the state changes to
I thought it should as well, yeah |
Added more tasks to the crucible-agent.
We now spawn up to 5 tasks to handle a region creation operation. This includes a clone operation where we create an empty region, then fill it by copying from another running downstairs.
I attempted to do this in a way that preserved much of how the worker tasks operated, instead of refactoring it.
I moved what was in the main
worker()
loop for region create into a spawned task, it should be the same set of operations that existed in the previous version.The existing doorbell mechanism used in the first_in_states()
loop is also used to indicate that a worker task has finished.
first_in_states()` will also now check for jobs that have completed and remove them from the work queue.To keep SMF happy, and to not clobber ourselves, I put a lock around the SMF update. This seemed simpler then trying to break apart
apply_smf()
into the various things it does and then try to only call the part I need when I need it.Other actions besides region creation will still be processed by the agent's current worker task.
Updated agent-antagonist test to print a little less in some places and a little more in other.
Because agent-antagonist will go much faster than Nexus (and has fewer restrictions) I added a loop in the clone path that will try to connect to the source endpoint and verify the remote downstairs is responding before trying to clone from it. I don't believe this kind of check is needed in Omicron.
agent-antagonist now has a new test that just tests downstairs cloning operations in a loop.
I added a "work queue" API endpoint to the agent. I'll come back when this lands in Omicron and add that to
omdb
.